DocumentCloud, the souped-up repository of primary-source material that I’ve been raving about since it first emerged in November, has a big announcement today: They’ve signed up 20 more organizations — including The Washington Post, New Yorker, MSNBC, and ACLU — to contribute documents and test the first iteration of the consortium, which is expected to launch privately by the end of this year.
The full list of members amounts to one of the most impressive collaborations of investigative-journalism outfits in — well, there really is no precedent for this:
ACLU National Security Project, Arizona Republic, The Atlantic, Center for Democracy and Technology / OpenCRS, Centre for Investigative Journalism (City University London), Center for Investigative Reporting / California Watch, Center for Public Integrity, Chicago Tribune, Dallas Morning News, Gotham Gazette, The Investigative Reporting Workshop at American University, The National Security Archive, The New York Times, New Yorker, MinnPost, MSNBC, Mother Jones, PBS NewsHour, ProPublica, St. Petersburg Times, Sunlight Foundation, Talking Points Memo, Voice of San Diego, Washington Post, WNYC
Imagine being able to search across the New York Times’ cache of records on Guantánamo Bay detainees, the ACLU’s unrivaled set of documents on detention policy, Jane Mayer’s source material for her coverage of the CIA in The New Yorker, and The Washington Post’s valuable contributions to all of the above. That’s the promise of DocumentCloud, which I’ve explained at length in previous posts.
Today they’re also announcing an official partnership with OpenCalais, the powerful Thomson Reuters product that turns text into meaningful data. (For instance, it can distinguish between Poland, the country, and Poland, Maine, or group references to Guantánamo and Gitmo.) Material submitted to Document Cloud will be run through optical-character-recognition software, then OpenCalais and potentially other applications, with the goal of wringing as much value from them as possible.
Eric Umansky, one of the co-founders of DocumentCloud, told me that reporters contributing documents will have a “period of exclusivity” in which they can utilize the database — to search for common references, say, or background information — without giving up their competitive advantage. In return, they’ll agree to eventually make the source material public. Other details are still being worked out, though I got some new specifics that I’ll add here later today.
I received a heads up about today’s announcement on Tuesday, when I was attending the Transparent Text symposium at IBM. So I grabbed Aron Pilhofer, another DocumentCloud co-founder, who was at the conference, to chat about what they’ve been up to since winning a two-year $719,500 grant from the Knight Foundation. That video is above, and a transcript is after the jump.
Aron Pilhofer: This project has always been conceived as a consortium, and so the members that are going to be part of it are those organizations who will be contributing documents and helping us, at least initially, debug the system, helping us build it out, helping us figure out which features are useful to them, giving us feedback on how it works. And I think we’ll start with a very small subset of members who work with us and just grow it out over time.
Importantly, these are the organizations that are making a statement to us that they find value in this project and are interested in using it. In fact, I think if Document Cloud existed today, every single one of these groups would have it and they’d use it right now, so that’s exciting. [...]
The competitive issues, obviously, are there, but I also think it’s a misnomer that news organizations aren’t willing or able to collaborate. I think that, especially in this web world, that we’re all sort of recognizing the need to do these kinds of collaborations. And I think the Web provides a mechanism to do collaborations, even among competing organizations, but find ways of keeping a very secure kind of wall between what, say, we’re putting into DocumentCloud and are not ready to make public and what The Washington Post is putting in and not ready to make public. So we’re able to do that, provide a secure environment to them, and we’re also able to then create an environment where, once they make it public, it’ll just drive traffic back and attention to the recording they’re doing. [...]
Are there daunting issues? Yeah, there absolutely are.
Zach Seward: Like what?
Pilhofer: The ability to ingest, to process, database, index and then republish metadata for what could eventually, if this works extremely well, amount to tens of thousands, hundreds of thousands, millions, possibly, of pages of printed or textual data is an extraordinarily difficult technological task to overcome, problem to solve, and we’re at the Transparency Text conference now at IBM, listening to speaker after speaker talking about these massive systems that they are developing to do exactly that.
So I think it gives you some sense that this is not a trivial problem. So this is one of the things that keeps me up at night now, a solvable problem. A lot of the same people that are here have solved this very problem and are thrilled to work with us on this problem. So I think it’s a solvable problem, but that’s what keeps me up at night. But we knew that going in.
Seward: Sure, not that surprising, but still keeping you sweating. One element of the development processes that you’re also announcing is OpenCalais is a partner.
Pilhofer: That’s right.
Seward: So what does that involve technically?
Pilhofer: What that involves is we’ve, from the very beginning of this project, one of — that I think you saw in a presentation here, one of the many seeds of this project was a presentation of Tom Tague of OpenCalais gave to the technology group at the New York Times from OpenCalais. He is basically Mr. OpenCalais. From that minute, I realized we have the back end we need it to extract the metadata to make this system really work.
So what it means as far as us working with Calais: They, when I described this project to them, they immediately loved it. So we are going to work with them very closely. We’ll have access to their engineers, we’ll have certain — I can’t really talk, I can’t give you real specifics of this, but let’s just say we will have a very close relationship with Calais because this is precisely the kind of project they want to see built on top of the system that they’ve developed. He described himself as a plumbing contractor at a conference of folks who are doing a lot of data visualization. So we’re building a layer on top of Calais, which is exactly what they want to happen.