ProPublica and NYT seek $1M to put everyone’s documents online
[Saturday was the deadline for submissions for this year's Knight News Challenge. In the coming days and weeks, we'll be looking at some of the most interesting applicants. If you know of one you think worth highlighting, let us know, via email or in the comments. —Ed.]
Two of the biggest names in journalism have applied to this year’s Knight News Challenge: The pioneering investigative-reporting non-profit ProPublica and The New York Times are seeking $1 million from the Knight Foundation to launch an online repository of primary-source documents. The project could lead to greater information sharing among news organizations and their audience. As they put it in their grant application:
Documents are the foundation of investigative journalism, but today’s newsroom is a throwaway culture. Too often, reporters gather reams of information, do their stories, then chuck rich source documents into a dusty corner, never again to see the light of day.
The project, which is called DocumentCloud, would let news organizations upload their materials for public consumption and analysis. (”Readers will also be able to quickly search, annotate and bookmark documents — and for the first time link directly to specific pages or passages.”)
The proposal relies on a piece of software called DocViewer, which was developed by the Times’ Interactive Newsroom Technologies team. The head of that team, Aron Pilhofer, recently confirmed that the Times will release DocViewer as open source “sometime after the election.” Brian Boyer, the blogger who broke that news, said the software was created by the Times for its searchable database of Hillary Clinton’s 11,000-page public schedule as first lady, which was a journalistic marvel.
In an email today, Pilhofer said the application has already made it to the second round of the News Challenge, and he explained the proposal’s provenance:
The project started with a conversation between Scott Klein, Eric Umansky (of ProPublica) and me and my boss, Marc Frons. They were interested in using our DocViewer, and we were talking about the possibility of just open sourcing the darn thing. So, we got into one of those… “Hey, wouldn’t it be cool if we could also…” sorts of conversations, and things went from there.
DocumentCloud would focus initially on New York City “because it has favorable FOI laws and a vibrant journalism and blogging community.” (The community focus is also a requirement of the News Challenge.) A consortium of media outlets, bloggers, and watchdog groups would submit documents, though the application mentions only one partner on board: the Gotham Gazette, a news website published by the Citizens Union Foundation of the City of New York. ProPublica also plans to contribute state- and federal-government documents.
For the technically inclined, DocumentCloud will run on open APIs, so readers or other news organizations could search and interact with the document database as necessary for investigative projects. “Think of it as a ‘card catalog’ of standardized metadata for primary source documents,” the application argues.
It isn’t clear if the project could or would go ahead without funding from Knight, which will award its News Challenge grants next summer. ProPublica’s $10-million annual budget is funded primarily by the Sandler Foundation. We’ve sent an email to Mike Webb, ProPublica’s director of communications, seeking more information.
The full text of the grant application is below the jump.
Project Title: DocumentCloudRequested amount from Knight News Challenge: $1,000,000
Expected amount of time to complete project: 3 [years]
Total cost of project including all sources of funding: $1,000,000
Describe your project: What is it? DocumentCloud is software, a website, and a set of open standards and APIs that will accelerate the daily work of investigative reporters, and will make investigative reporters out of every citizen, by improving the way we find, share, read and collaborate on source documents online. Why do we need it? Documents are the foundation of investigative journalism, but today’s newsroom is a throwaway culture. Too often, reporters gather reams of information, do their stories, then chuck rich source documents into a dusty corner, never again to see the light of day. Documents that are placed on the web are typically just PDFs — a poor user experience that places documents out of context and, often, out of reach when the story fades from public consciousness. Further, news outfits do not benefit from the wisdom of the crowd since there is no good way to collaboratively examine large document sets. How will it do it? DocumentCloud will host, and provide an open API to, an online database of source documents, contributed by a consortium of news orgs, watchdog groups and bloggers. Think of it as a “card catalog” of standardized metadata for primary source documents. Once submitted to DocumentCloud, documents can be found, linked to, and retrieved by anyone, anywhere on the Web. Thanks to the metadata, users will be able to search by topic, agency, or location. The project will lower barriers of participation by creating open standards and open-source software. DocViewer, a best-of-class web-based application, will allow even the smallest organizations to publish their documents online and contribute to DocumentCloud. Readers will also be able to quickly search, annotate and bookmark documents — and for the first time link directly to specific pages or passages.
How will your project improve the way news and information are delivered to geographic communities? Because source documents are often more scarce in metro reporting than in national, DocumentCloud can make its biggest initial impact by helping create an infrastructure for sharing on the local level. We’ve picked New York City for our initial rollout because it has favorable FOI laws and a vibrant journalism and blogging community. We have an agreement with our first local partner, Gotham Gazette, to work with us to build and test the software and APIs. They’ll also join the consortium, which will grow over the period of the grant to include many other local and national news organizations, bloggers and watchdog groups. While our pilot will focus on New York City, we will also include source documents from state and federal governments.
How is your idea innovative? (new or different from what already exists) DocumentCloud will take source documents beyond the inherent constraints of the PDF and out of the realm of clumsy scans or external application plug-ins and for the first time make them an intrinsic part of the semantic Web, and a part of reporting news online. Sharing information becomes much easier when you can share specific pages or paragraphs as well as entire documents. Source documents will be easier to find because users can search through fielded metadata, such as topics, locations, people, government agencies, publication date and other variables. Though of course the project stands on the shoulders of initiatives like Brewster Kahle’s Internet Archive, as well as the Open Archives Initiative, nothing like DocumentCloud exists.
What experience do you or your organization have to successfully develop this project? The New York Times has been at the forefront of the industry by fully integrating its newsroom and digital operations, and a leading innovator for digital content on the web among other platforms. In the past year, The Times has developed and launched a number of innovative products, including the Times People social network, Times Machine, an iPhone Times reader, the Times Developer Network, the Visualization Lab and two APIs. The Times is among the only major media organizations to form a dedicated team of journalist/developers focused exclusively on news projects, including the paper’s extensive Olympics and elections coverage this year. This team, Interactive Newsroom Technologies, has already built a lightweight version of the DocViewer, which will be released as an open source project. ProPublica, the new, non-profit newsroom, has the largest team of reporters dedicated to investigative journalism anywhere in the country. It is uniquely qualified to help manage the effort, not only because its reporters could be “power-users” of the service but because it was organized to take on just this kind of effort — collaborating with newsrooms around the country. ProPublica has already partnered with new organizations including Newsweek, 60 Minutes, Politico, the Albany Times-Union, the Los Angeles Times. Unlike most news organizations, ProPublica does not have an economic incentive to be competitive with other news organizations — in fact, just the opposite. Its model relies on just the kind of collaboration that will help spread DocumentCloud virally.
[Update: See Jay Rosen's concerns on this application here.]
[Hello, readers from Romenesko, and welcome to the newly launched Nieman Journalism Lab. We hope you'll come back every weekday for reporting, commentary, and conversation about the future of journalism. Here's our front page, here's more about us, and here's our RSS feed.]








For online annotation & collaboration on PDFs / Word docs in the browser there’s also A.nnotate.com (our service) - some other tools like google and amazon book search also use the technique of rendering documents as images to make them render quickly on the web without waiting for the whole document to download and separate viewer plugins to start up.
A number of flash-based document sites have appeared recently too - like scribd / docstoc / edocr - but these don’t do much more than display the document in a flash panel, which doesn’t seem like such a big win over having a link to a pdf.
Intriguing concept. Puzzling how it has already made it to the second round when the deadline was just Saturday.
I sent a somewhat similar proposal, although as an individual mine is not as complete as the Times one. I wonder how the judges will tackle ideas that overlap. Mine is called Acceso (Access in Spanish) and targets Mexico City.
http://tinyurl.com/5w5j8o
My other two ideas that are still in competition are:
News-Point: http://tinyurl.com/6dwagg
Crimesourcing: http://tinyurl.com/6hvkf4
Hope to read your comments. Thanks,
Gabriel Sama
With an invitation like that, I’m sure you’ll be flooded by applicants looking for exposure. But I think we have nifty idea. It ties state level campaign finance reports with the votes of state lawmakers and with bill analyses. It emphasizes the impact of multi-state political contributions on regional environmental policy. See:
http://tinyurl.com/5heo6r
I wonder how this would compare&contrast to the UCSF tobacco documents archive. If it’s just duplicating effort, bad. But if it’d make a job like that much easier…
(Here’s hoping we get to put many more such document sets online in future.)
Does ProPublica editor-in-chief Paul Steiger’s Knight Foundation trusteeship constitute a conflict of interest? How about Knight Foundation president & CEO Alberto Ibargüen’s position on the ProPublica board?
And while we’re on the topic of possibly odd arrangements, what’s the deal with NewsU offering incentives to its users to write testimonials (”We’re giving away prizes for the best stories that are submitted”), to use in order to get more funding? Is this a legitimate tactic?
Just ran across something from Ask Metafilter, that might be relevant -
http://ask.metafilter.com/108087/Examples-of-Onlne-Archives-that-Allow-Users-to-Add-Metadata
(and I wonder why blogging platforms don’t segregate trackbacks from comments, or somehow make it possible to hide the trackbacks…)