Nieman Foundation at Harvard
The Wall Street Journal website — paywalled from the very beginning — turns 20 years old today
ABOUT                    SUBSCRIBE
Sept. 24, 2009, 8 a.m.

DocumentCloud adds impressive list of investigative-journalism outfits

DocumentCloud, the souped-up repository of primary-source material that I’ve been raving about since it first emerged in November, has a big announcement today: They’ve signed up 20 more organizations — including The Washington Post, New Yorker, MSNBC, and ACLU — to contribute documents and test the first iteration of the consortium, which is expected to launch privately by the end of this year.

The full list of members amounts to one of the most impressive collaborations of investigative-journalism outfits in — well, there really is no precedent for this:

ACLU National Security Project, Arizona Republic, The Atlantic, Center for Democracy and Technology / OpenCRS, Centre for Investigative Journalism (City University London), Center for Investigative Reporting / California Watch, Center for Public Integrity, Chicago Tribune, Dallas Morning News, Gotham Gazette, The Investigative Reporting Workshop at American University, The National Security Archive, The New York Times, New Yorker, MinnPost, MSNBC, Mother Jones, PBS NewsHour, ProPublica, St. Petersburg Times, Sunlight Foundation, Talking Points Memo, Voice of San Diego, Washington Post, WNYC

Imagine being able to search across the New York Times’ cache of records on Guantánamo Bay detainees, the ACLU’s unrivaled set of documents on detention policy, Jane Mayer’s source material for her coverage of the CIA in The New Yorker, and The Washington Post’s valuable contributions to all of the above. That’s the promise of DocumentCloud, which I’ve explained at length in previous posts.

Today they’re also announcing an official partnership with OpenCalais, the powerful Thomson Reuters product that turns text into meaningful data. (For instance, it can distinguish between Poland, the country, and Poland, Maine, or group references to Guantánamo and Gitmo.) Material submitted to Document Cloud will be run through optical-character-recognition software, then OpenCalais and potentially other applications, with the goal of wringing as much value from them as possible.

Eric Umansky, one of the co-founders of DocumentCloud, told me that reporters contributing documents will have a “period of exclusivity” in which they can utilize the database — to search for common references, say, or background information — without giving up their competitive advantage. In return, they’ll agree to eventually make the source material public. Other details are still being worked out, though I got some new specifics that I’ll add here later today.

I received a heads up about today’s announcement on Tuesday, when I was attending the Transparent Text symposium at IBM. So I grabbed Aron Pilhofer, another DocumentCloud co-founder, who was at the conference, to chat about what they’ve been up to since winning a two-year $719,500 grant from the Knight Foundation. That video is above, and a transcript is after the jump.

Aron Pilhofer: This project has always been conceived as a consortium, and so the members that are going to be part of it are those organizations who will be contributing documents and helping us, at least initially, debug the system, helping us build it out, helping us figure out which features are useful to them, giving us feedback on how it works. And I think we’ll start with a very small subset of members who work with us and just grow it out over time.

Importantly, these are the organizations that are making a statement to us that they find value in this project and are interested in using it. In fact, I think if Document Cloud existed today, every single one of these groups would have it and they’d use it right now, so that’s exciting. […]

The competitive issues, obviously, are there, but I also think it’s a misnomer that news organizations aren’t willing or able to collaborate. I think that, especially in this web world, that we’re all sort of recognizing the need to do these kinds of collaborations. And I think the Web provides a mechanism to do collaborations, even among competing organizations, but find ways of keeping a very secure kind of wall between what, say, we’re putting into DocumentCloud and are not ready to make public and what The Washington Post is putting in and not ready to make public. So we’re able to do that, provide a secure environment to them, and we’re also able to then create an environment where, once they make it public, it’ll just drive traffic back and attention to the recording they’re doing. […]

Are there daunting issues? Yeah, there absolutely are.

Zach Seward: Like what?

Pilhofer: The ability to ingest, to process, database, index and then republish metadata for what could eventually, if this works extremely well, amount to tens of thousands, hundreds of thousands, millions, possibly, of pages of printed or textual data is an extraordinarily difficult technological task to overcome, problem to solve, and we’re at the Transparency Text conference now at IBM, listening to speaker after speaker talking about these massive systems that they are developing to do exactly that.

So I think it gives you some sense that this is not a trivial problem. So this is one of the things that keeps me up at night now, a solvable problem. A lot of the same people that are here have solved this very problem and are thrilled to work with us on this problem. So I think it’s a solvable problem, but that’s what keeps me up at night. But we knew that going in.

Seward: Sure, not that surprising, but still keeping you sweating. One element of the development processes that you’re also announcing is OpenCalais is a partner.

Pilhofer: That’s right.

Seward: So what does that involve technically?

Pilhofer: What that involves is we’ve, from the very beginning of this project, one of — that I think you saw in a presentation here, one of the many seeds of this project was a presentation of Tom Tague of OpenCalais gave to the technology group at the New York Times from OpenCalais. He is basically Mr. OpenCalais. From that minute, I realized we have the back end we need it to extract the metadata to make this system really work.

So what it means as far as us working with Calais: They, when I described this project to them, they immediately loved it. So we are going to work with them very closely. We’ll have access to their engineers, we’ll have certain — I can’t really talk, I can’t give you real specifics of this, but let’s just say we will have a very close relationship with Calais because this is precisely the kind of project they want to see built on top of the system that they’ve developed. He described himself as a plumbing contractor at a conference of folks who are doing a lot of data visualization. So we’re building a layer on top of Calais, which is exactly what they want to happen.

POSTED     Sept. 24, 2009, 8 a.m.
Show comments  
Show tags
Join the 15,000 who get the freshest future-of-journalism news in our daily email.
The Wall Street Journal website — paywalled from the very beginning — turns 20 years old today
“From the very beginning it was very clear we needed to cover all the same concerns and sensibilities of the print Journal even though we were online and even though we were a young staff.”
Newsonomics: In the platform wars, how well are you armed?
“Think about platforms as fishing places where you can find large, engaged audiences and build a relationship with them by providing content. Then offer these users some other services off-platform.”
Wired’s making the long and slow switch to HTTPS and it wants to help other news sites do the same
With its HTTPS implementation, Wired’s starting with its security vertical and for users who pay for the ad-free version of the site.
What to read next
In the room where it happens: The host of NPR’s new show Embedded talks about news in podcast form
Kelly McEvers: “A lot of the great storytelling podcasts happen in the studio. I hope ours opens the door to people thinking more about what you can do in the field, when things don’t go as planned and are unexpected.”
0What a group of USC students learned shooting lots of VR video (hint: duct tape is involved)
The students traveled to Houston over spring break to shoot footage to accompany a ProPublica/Texas Tribune project on what a hurricane could do to the city.
0Audible, long known only for audiobooks, is branching out into podcasts — and news
The podcast/audio world has been waiting for Audible to make its big move into the space. It’s here, including original content from major publishers like The New York Times, The Wall Street Journal, and Jeff Bezos’ Washington Post.
These stories are our most popular on Twitter over the past 30 days.
See all our most recent pieces ➚
Encyclo is our encyclopedia of the future of news, chronicling the key players in journalism’s evolution.
Here are a few of the entries you’ll find in Encyclo.   Get the full Encyclo ➚
The Huffington Post
Public Radio International
Minneapolis Star Tribune
Investigative Reporting Workshop
The Globe and Mail
Plaza Pública