DocumentCloud adds impressive list of investigative-journalism outfits

By Zachary M. SewardSept. 24, 2009  /  8 a.m.

DocumentCloud, the souped-up repository of primary-source material that I’ve been raving about since it first emerged in November, has a big announcement today: They’ve signed up 20 more organizations — including The Washington Post, New Yorker, MSNBC, and ACLU — to contribute documents and test the first iteration of the consortium, which is expected to launch privately by the end of this year.

The full list of members amounts to one of the most impressive collaborations of investigative-journalism outfits in — well, there really is no precedent for this:

ACLU National Security Project, Arizona Republic, The Atlantic, Center for Democracy and Technology / OpenCRS, Centre for Investigative Journalism (City University London), Center for Investigative Reporting / California Watch, Center for Public Integrity, Chicago Tribune, Dallas Morning News, Gotham Gazette, The Investigative Reporting Workshop at American University, The National Security Archive, The New York Times, New Yorker, MinnPost, MSNBC, Mother Jones, PBS NewsHour, ProPublica, St. Petersburg Times, Sunlight Foundation, Talking Points Memo, Voice of San Diego, Washington Post, WNYC

Imagine being able to search across the New York Times’ cache of records on Guantánamo Bay detainees, the ACLU’s unrivaled set of documents on detention policy, Jane Mayer’s source material for her coverage of the CIA in The New Yorker, and The Washington Post’s valuable contributions to all of the above. That’s the promise of DocumentCloud, which I’ve explained at length in previous posts.

Today they’re also announcing an official partnership with OpenCalais, the powerful Thomson Reuters product that turns text into meaningful data. (For instance, it can distinguish between Poland, the country, and Poland, Maine, or group references to Guantánamo and Gitmo.) Material submitted to Document Cloud will be run through optical-character-recognition software, then OpenCalais and potentially other applications, with the goal of wringing as much value from them as possible.

Eric Umansky, one of the co-founders of DocumentCloud, told me that reporters contributing documents will have a “period of exclusivity” in which they can utilize the database — to search for common references, say, or background information — without giving up their competitive advantage. In return, they’ll agree to eventually make the source material public. Other details are still being worked out, though I got some new specifics that I’ll add here later today.

I received a heads up about today’s announcement on Tuesday, when I was attending the Transparent Text symposium at IBM. So I grabbed Aron Pilhofer, another DocumentCloud co-founder, who was at the conference, to chat about what they’ve been up to since winning a two-year $719,500 grant from the Knight Foundation. That video is above, and a transcript is after the jump.

Aron Pilhofer: This project has always been conceived as a consortium, and so the members that are going to be part of it are those organizations who will be contributing documents and helping us, at least initially, debug the system, helping us build it out, helping us figure out which features are useful to them, giving us feedback on how it works. And I think we’ll start with a very small subset of members who work with us and just grow it out over time.

Importantly, these are the organizations that are making a statement to us that they find value in this project and are interested in using it. In fact, I think if Document Cloud existed today, every single one of these groups would have it and they’d use it right now, so that’s exciting. [...]

The competitive issues, obviously, are there, but I also think it’s a misnomer that news organizations aren’t willing or able to collaborate. I think that, especially in this web world, that we’re all sort of recognizing the need to do these kinds of collaborations. And I think the Web provides a mechanism to do collaborations, even among competing organizations, but find ways of keeping a very secure kind of wall between what, say, we’re putting into DocumentCloud and are not ready to make public and what The Washington Post is putting in and not ready to make public. So we’re able to do that, provide a secure environment to them, and we’re also able to then create an environment where, once they make it public, it’ll just drive traffic back and attention to the recording they’re doing. [...]

Are there daunting issues? Yeah, there absolutely are.

Zach Seward: Like what?

Pilhofer: The ability to ingest, to process, database, index and then republish metadata for what could eventually, if this works extremely well, amount to tens of thousands, hundreds of thousands, millions, possibly, of pages of printed or textual data is an extraordinarily difficult technological task to overcome, problem to solve, and we’re at the Transparency Text conference now at IBM, listening to speaker after speaker talking about these massive systems that they are developing to do exactly that.

So I think it gives you some sense that this is not a trivial problem. So this is one of the things that keeps me up at night now, a solvable problem. A lot of the same people that are here have solved this very problem and are thrilled to work with us on this problem. So I think it’s a solvable problem, but that’s what keeps me up at night. But we knew that going in.

Seward: Sure, not that surprising, but still keeping you sweating. One element of the development processes that you’re also announcing is OpenCalais is a partner.

Pilhofer: That’s right.

Seward: So what does that involve technically?

Pilhofer: What that involves is we’ve, from the very beginning of this project, one of — that I think you saw in a presentation here, one of the many seeds of this project was a presentation of Tom Tague of OpenCalais gave to the technology group at the New York Times from OpenCalais. He is basically Mr. OpenCalais. From that minute, I realized we have the back end we need it to extract the metadata to make this system really work.

So what it means as far as us working with Calais: They, when I described this project to them, they immediately loved it. So we are going to work with them very closely. We’ll have access to their engineers, we’ll have certain — I can’t really talk, I can’t give you real specifics of this, but let’s just say we will have a very close relationship with Calais because this is precisely the kind of project they want to see built on top of the system that they’ve developed. He described himself as a plumbing contractor at a conference of folks who are doing a lot of data visualization. So we’re building a layer on top of Calais, which is exactly what they want to happen.

This entry was written by Zachary M. Seward, posted on September 24, 2009 at 8:00 am, and tagged , , , , , . Bookmark the permalink. Follow any comments here with the RSS feed for this post. Post a comment or leave a trackback.


13 comments:

Trackbacks:

  1. Knight Foundation Blog » More news from DocumentCloud at 12:21 pm, September 24, 2009

    [...] high-wattage partners to contribute documents and feedback to DocumentCloud. NiemanLab sets up the tremendous possibilities here: Imagine being able to search across the New York Times’ cache of records on Guantánamo Bay [...]

     
  2. TwittLink - Your headlines on Twitter at 1:33 pm, September 24, 2009

    [...] DocumentCloud adds impressive list of investigative-journalism outfits » Nieman Journalism Lab [...]

     
  3. DocumentCloud | TightWind at 10:01 pm, September 24, 2009

    [...] Zachary M. Steward described DocumentCloud: Imagine being able to search across the New York Times’ cache of records on Guantánamo Bay detainees, the ACLU’s unrivaled set of documents on detention policy, Jane Mayer’s source material for her coverage of the CIA in The New Yorker, and The Washington Post’s valuable contributions to all of the above. That’s the promise of DocumentCloud [...]

     
  4. DocumentCloud and OpenCalais: Some Questions at ≈ Relations at 12:53 am, September 25, 2009

    [...] it announced another two-dozenhigh profile content partners (Nieman Labs view on this) as well as a partnership with ThomsonReuters OpenCalais (DocumentCloud Blog Post): This morning [...]

     
  5. links for 2009-09-27 « Sarah Hartley at 3:02 pm, September 27, 2009

    [...] DocumentCloud adds impressive list of investigative-journalism outfits » Nieman Journalism Lab Imagine being able to search across the New York Times’ cache of records on Guantánamo Bay detainees, the ACLU’s unrivaled set of documents on detention policy, Jane Mayer’s source material for her coverage of the CIA in The New Yorker, and The Washington Post’s valuable contributions to all of the above. That’s the promise of DocumentCloud, which I’ve explained at length in previous posts. (tags: journalism media newspaper news data interview citizenjournalism) [...]

     
  6. The Rise of the Non-Profit News Model | The Brandsynario Blog at 12:59 am, September 29, 2009

    [...] Friday, Document Cloud announced that it has signed up 20 news and information organizations, including the Washington Post, MSNBC, [...]

     
  7. Five projects on the frontier of text-based data analysis and visualization » Nieman Journalism Lab at 9:01 am, September 29, 2009

    [...] mentioned OpenCalais in the context of DocumentCloud, but there’s much more to the software, which was purchased by Thomson Reuters in 2007. In a [...]

     
  8. OpenCalais joins DocumentCloud, set to host a wealth of primary sources | TR! TellMe at 4:57 pm, September 29, 2009

    [...] across news reporters’ source documents. Aron Pilhofer, a DocumentCloud co-founder, discussed with Nieman Journalism Labs some of the collaborative and open-sourced foundations of the project, [...]

     
  9. Coming Soon: DocumentCloud, A Place to Access Primary Source Documents « ResourceShelf at 11:34 am, October 1, 2009

    [...] Also: DocumentCloud adds impressive list of investigative-journalism outfits (via Nieman Journalism Lab) Imagine being able to search across the New York Times’ cache of records on Guantánamo Bay [...]

     
  10. September #5 at take21.org/blog at 7:07 pm, October 2, 2009

    [...] Nieman LabDocumentCloud: investigative journalism organisations+ Open Calais Nieman Lab interviewFull audio and transcript of Clay Shirky’s talk at Harvard on the future of news Nieman LabPie [...]

     
  11. DocumentCloud – a paradigm shift in source documents? | The Art of Documentation at 12:53 pm, October 19, 2009

    [...] Also: DocumentCloud adds impressive list of investigative-journalism outfits (via Nieman Journalism Lab) Imagine being able to search across the New York Times’ cache of records on Guantánamo Bay [...]

     
  12. Why link out? Four journalistic purposes of the noble hyperlink » Nieman Journalism Lab at 9:39 am, June 8, 2010

    [...] If they went to city hall and saw the records, can they scan them for us? There is already infrastructure for journalists who want to do this. A link is the simplest, most comprehensive, and most [...]

     
  13. 中国网媒能从BBC身上学些什么? - ccc at 10:32 pm, June 12, 2010

    [...] 在Herrmann原博文下有用户之间的活跃讨论,对学术杂志与其他非免费来源的分享模式尤其受关注。值得一提的是DocumentCloud计划,它们正在建立一个严肃的新闻文档库,包括对未发布的文档进行限时保护。 [...]

     

Leave a comment

Check out these related posts