HOME
          
LATEST STORY
Complicating the network: The year in social media research
ABOUT                    SUBSCRIBE
Sept. 24, 2009, 8 a.m.

DocumentCloud adds impressive list of investigative-journalism outfits

DocumentCloud, the souped-up repository of primary-source material that I’ve been raving about since it first emerged in November, has a big announcement today: They’ve signed up 20 more organizations — including The Washington Post, New Yorker, MSNBC, and ACLU — to contribute documents and test the first iteration of the consortium, which is expected to launch privately by the end of this year.

The full list of members amounts to one of the most impressive collaborations of investigative-journalism outfits in — well, there really is no precedent for this:

ACLU National Security Project, Arizona Republic, The Atlantic, Center for Democracy and Technology / OpenCRS, Centre for Investigative Journalism (City University London), Center for Investigative Reporting / California Watch, Center for Public Integrity, Chicago Tribune, Dallas Morning News, Gotham Gazette, The Investigative Reporting Workshop at American University, The National Security Archive, The New York Times, New Yorker, MinnPost, MSNBC, Mother Jones, PBS NewsHour, ProPublica, St. Petersburg Times, Sunlight Foundation, Talking Points Memo, Voice of San Diego, Washington Post, WNYC

Imagine being able to search across the New York Times’ cache of records on Guantánamo Bay detainees, the ACLU’s unrivaled set of documents on detention policy, Jane Mayer’s source material for her coverage of the CIA in The New Yorker, and The Washington Post’s valuable contributions to all of the above. That’s the promise of DocumentCloud, which I’ve explained at length in previous posts.

Today they’re also announcing an official partnership with OpenCalais, the powerful Thomson Reuters product that turns text into meaningful data. (For instance, it can distinguish between Poland, the country, and Poland, Maine, or group references to Guantánamo and Gitmo.) Material submitted to Document Cloud will be run through optical-character-recognition software, then OpenCalais and potentially other applications, with the goal of wringing as much value from them as possible.

Eric Umansky, one of the co-founders of DocumentCloud, told me that reporters contributing documents will have a “period of exclusivity” in which they can utilize the database — to search for common references, say, or background information — without giving up their competitive advantage. In return, they’ll agree to eventually make the source material public. Other details are still being worked out, though I got some new specifics that I’ll add here later today.

I received a heads up about today’s announcement on Tuesday, when I was attending the Transparent Text symposium at IBM. So I grabbed Aron Pilhofer, another DocumentCloud co-founder, who was at the conference, to chat about what they’ve been up to since winning a two-year $719,500 grant from the Knight Foundation. That video is above, and a transcript is after the jump.

Aron Pilhofer: This project has always been conceived as a consortium, and so the members that are going to be part of it are those organizations who will be contributing documents and helping us, at least initially, debug the system, helping us build it out, helping us figure out which features are useful to them, giving us feedback on how it works. And I think we’ll start with a very small subset of members who work with us and just grow it out over time.

Importantly, these are the organizations that are making a statement to us that they find value in this project and are interested in using it. In fact, I think if Document Cloud existed today, every single one of these groups would have it and they’d use it right now, so that’s exciting. […]

The competitive issues, obviously, are there, but I also think it’s a misnomer that news organizations aren’t willing or able to collaborate. I think that, especially in this web world, that we’re all sort of recognizing the need to do these kinds of collaborations. And I think the Web provides a mechanism to do collaborations, even among competing organizations, but find ways of keeping a very secure kind of wall between what, say, we’re putting into DocumentCloud and are not ready to make public and what The Washington Post is putting in and not ready to make public. So we’re able to do that, provide a secure environment to them, and we’re also able to then create an environment where, once they make it public, it’ll just drive traffic back and attention to the recording they’re doing. […]

Are there daunting issues? Yeah, there absolutely are.

Zach Seward: Like what?

Pilhofer: The ability to ingest, to process, database, index and then republish metadata for what could eventually, if this works extremely well, amount to tens of thousands, hundreds of thousands, millions, possibly, of pages of printed or textual data is an extraordinarily difficult technological task to overcome, problem to solve, and we’re at the Transparency Text conference now at IBM, listening to speaker after speaker talking about these massive systems that they are developing to do exactly that.

So I think it gives you some sense that this is not a trivial problem. So this is one of the things that keeps me up at night now, a solvable problem. A lot of the same people that are here have solved this very problem and are thrilled to work with us on this problem. So I think it’s a solvable problem, but that’s what keeps me up at night. But we knew that going in.

Seward: Sure, not that surprising, but still keeping you sweating. One element of the development processes that you’re also announcing is OpenCalais is a partner.

Pilhofer: That’s right.

Seward: So what does that involve technically?

Pilhofer: What that involves is we’ve, from the very beginning of this project, one of — that I think you saw in a presentation here, one of the many seeds of this project was a presentation of Tom Tague of OpenCalais gave to the technology group at the New York Times from OpenCalais. He is basically Mr. OpenCalais. From that minute, I realized we have the back end we need it to extract the metadata to make this system really work.

So what it means as far as us working with Calais: They, when I described this project to them, they immediately loved it. So we are going to work with them very closely. We’ll have access to their engineers, we’ll have certain — I can’t really talk, I can’t give you real specifics of this, but let’s just say we will have a very close relationship with Calais because this is precisely the kind of project they want to see built on top of the system that they’ve developed. He described himself as a plumbing contractor at a conference of folks who are doing a lot of data visualization. So we’re building a layer on top of Calais, which is exactly what they want to happen.

POSTED     Sept. 24, 2009, 8 a.m.
SHARE THIS STORY
   
Show comments  
Show tags
 
Join the 15,000 who get the freshest future-of-journalism news in our daily email.
Complicating the network: The year in social media research
Journalist’s Resource sifts through the academic journals so you don’t have to. Here are 12 of the studies about social and digital media they found most interesting in 2014.
News in a remix-focused culture
“We have to stop thinking about how to leverage whatever hot social platform is making headlines and instead spend time understanding how communication is changing.”
Los Angeles is the content future
“Creative content people are frustrated with the industry and creating their content on their own terms. Sound familiar?”
What to read next
847
tweets
Here’s some remarkable new data on the power of chat apps like WhatsApp for sharing news stories
At least in certain contexts, WhatsApp is a truly major traffic driver — bigger even than Facebook. Should there be a WhatsApp button on your news site?
429What’s the right news experience on a phone? Stacy-Marie Ishmael and BuzzFeed are trying to figure it out
“Nobody has to read you. You have to earn that. You have to respect people’s attention.”
343Come work for Nieman Lab
We have an opening for a staff writer in our Cambridge newsroom.
These stories are our most popular on Twitter over the past 30 days.
See all our most recent pieces ➚
Encyclo is our encyclopedia of the future of news, chronicling the key players in journalism’s evolution.
Here are a few of the entries you’ll find in Encyclo.   Get the full Encyclo ➚
InvestigateWest
New Jersey Newsroom
TBD
Texas Tribune
Tampa Bay Times
The Seattle Times
The Philadelphia Inquirer & Daily News
Austin American-Statesman
Storify
BuzzFeed
The Bay Citizen
WyoFile