HOME
          
LATEST STORY
The newsonomics of MLB’s pioneering mobile experience
ABOUT                    SUBSCRIBE
Sept. 24, 2009, 8 a.m.

DocumentCloud adds impressive list of investigative-journalism outfits

DocumentCloud, the souped-up repository of primary-source material that I’ve been raving about since it first emerged in November, has a big announcement today: They’ve signed up 20 more organizations — including The Washington Post, New Yorker, MSNBC, and ACLU — to contribute documents and test the first iteration of the consortium, which is expected to launch privately by the end of this year.

The full list of members amounts to one of the most impressive collaborations of investigative-journalism outfits in — well, there really is no precedent for this:

ACLU National Security Project, Arizona Republic, The Atlantic, Center for Democracy and Technology / OpenCRS, Centre for Investigative Journalism (City University London), Center for Investigative Reporting / California Watch, Center for Public Integrity, Chicago Tribune, Dallas Morning News, Gotham Gazette, The Investigative Reporting Workshop at American University, The National Security Archive, The New York Times, New Yorker, MinnPost, MSNBC, Mother Jones, PBS NewsHour, ProPublica, St. Petersburg Times, Sunlight Foundation, Talking Points Memo, Voice of San Diego, Washington Post, WNYC

Imagine being able to search across the New York Times’ cache of records on Guantánamo Bay detainees, the ACLU’s unrivaled set of documents on detention policy, Jane Mayer’s source material for her coverage of the CIA in The New Yorker, and The Washington Post’s valuable contributions to all of the above. That’s the promise of DocumentCloud, which I’ve explained at length in previous posts.

Today they’re also announcing an official partnership with OpenCalais, the powerful Thomson Reuters product that turns text into meaningful data. (For instance, it can distinguish between Poland, the country, and Poland, Maine, or group references to Guantánamo and Gitmo.) Material submitted to Document Cloud will be run through optical-character-recognition software, then OpenCalais and potentially other applications, with the goal of wringing as much value from them as possible.

Eric Umansky, one of the co-founders of DocumentCloud, told me that reporters contributing documents will have a “period of exclusivity” in which they can utilize the database — to search for common references, say, or background information — without giving up their competitive advantage. In return, they’ll agree to eventually make the source material public. Other details are still being worked out, though I got some new specifics that I’ll add here later today.

I received a heads up about today’s announcement on Tuesday, when I was attending the Transparent Text symposium at IBM. So I grabbed Aron Pilhofer, another DocumentCloud co-founder, who was at the conference, to chat about what they’ve been up to since winning a two-year $719,500 grant from the Knight Foundation. That video is above, and a transcript is after the jump.

Aron Pilhofer: This project has always been conceived as a consortium, and so the members that are going to be part of it are those organizations who will be contributing documents and helping us, at least initially, debug the system, helping us build it out, helping us figure out which features are useful to them, giving us feedback on how it works. And I think we’ll start with a very small subset of members who work with us and just grow it out over time.

Importantly, these are the organizations that are making a statement to us that they find value in this project and are interested in using it. In fact, I think if Document Cloud existed today, every single one of these groups would have it and they’d use it right now, so that’s exciting. [...]

The competitive issues, obviously, are there, but I also think it’s a misnomer that news organizations aren’t willing or able to collaborate. I think that, especially in this web world, that we’re all sort of recognizing the need to do these kinds of collaborations. And I think the Web provides a mechanism to do collaborations, even among competing organizations, but find ways of keeping a very secure kind of wall between what, say, we’re putting into DocumentCloud and are not ready to make public and what The Washington Post is putting in and not ready to make public. So we’re able to do that, provide a secure environment to them, and we’re also able to then create an environment where, once they make it public, it’ll just drive traffic back and attention to the recording they’re doing. [...]

Are there daunting issues? Yeah, there absolutely are.

Zach Seward: Like what?

Pilhofer: The ability to ingest, to process, database, index and then republish metadata for what could eventually, if this works extremely well, amount to tens of thousands, hundreds of thousands, millions, possibly, of pages of printed or textual data is an extraordinarily difficult technological task to overcome, problem to solve, and we’re at the Transparency Text conference now at IBM, listening to speaker after speaker talking about these massive systems that they are developing to do exactly that.

So I think it gives you some sense that this is not a trivial problem. So this is one of the things that keeps me up at night now, a solvable problem. A lot of the same people that are here have solved this very problem and are thrilled to work with us on this problem. So I think it’s a solvable problem, but that’s what keeps me up at night. But we knew that going in.

Seward: Sure, not that surprising, but still keeping you sweating. One element of the development processes that you’re also announcing is OpenCalais is a partner.

Pilhofer: That’s right.

Seward: So what does that involve technically?

Pilhofer: What that involves is we’ve, from the very beginning of this project, one of — that I think you saw in a presentation here, one of the many seeds of this project was a presentation of Tom Tague of OpenCalais gave to the technology group at the New York Times from OpenCalais. He is basically Mr. OpenCalais. From that minute, I realized we have the back end we need it to extract the metadata to make this system really work.

So what it means as far as us working with Calais: They, when I described this project to them, they immediately loved it. So we are going to work with them very closely. We’ll have access to their engineers, we’ll have certain — I can’t really talk, I can’t give you real specifics of this, but let’s just say we will have a very close relationship with Calais because this is precisely the kind of project they want to see built on top of the system that they’ve developed. He described himself as a plumbing contractor at a conference of folks who are doing a lot of data visualization. So we’re building a layer on top of Calais, which is exactly what they want to happen.

POSTED     Sept. 24, 2009, 8 a.m.
SHARE THIS STORY
   
Show comments  
Show tags
 
Join the 15,000 who get the freshest future-of-journalism news in our daily email.
The newsonomics of MLB’s pioneering mobile experience
Running a sports league and running a news operation aren’t the same thing. But there are lessons to be learned from baseball’s success in navigating mobile.
Why The New York Times built a tool for crowdsourced time travel
Madison, a new tool that asks readers to help identify ads in the Times archives, is part of a new open source platform for crowdsourcing built by the company’s R&D Lab.
Opening up the archives: JSTOR wants to tie a library to the news
Its new site JSTOR Daily highlights interesting research and offers background and context on current events.
What to read next
1020
tweets
The newsonomics of the millennial moment
The new wave of news startups is aiming at a younger audience. But do legacy media companies have a chance at earning their attention?
803A mixed bag on apps: What The New York Times learned with NYT Opinion and NYT Now
The two apps were part of the paper’s plan to increase digital subscribers through smaller, targeted offerings. Now, with staff cutbacks on the way, one app is being shuttered and the other is being adjusted.
413The new Vox daily email, explained
The company’s newsletter, Vox Sentences, enters an increasingly crowded inbox. Can concise writing and smart aggregation on the day’s news help expand their audience?
These stories are our most popular on Twitter over the past 30 days.
See all our most recent pieces ➚
Encyclo is our encyclopedia of the future of news, chronicling the key players in journalism’s evolution.
Here are a few of the entries you’ll find in Encyclo.   Get the full Encyclo ➚
Kaiser Health News
St. Louis Beacon
BuzzFeed
Storify
Current TV
Corporation for Public Broadcasting
Facebook
Seattle PostGlobe
The Globe and Mail
New Jersey Newsroom
U.S. News & World Report
Plaza Pública