Nieman Foundation at Harvard
The Society of Professional Journalists faces a “dire situation”
ABOUT                    SUBSCRIBE
Nov. 2, 2008, 1:17 p.m.

ProPublica and NYT seek $1M to put everyone’s documents online

[Saturday was the deadline for submissions for this year’s Knight News Challenge. In the coming days and weeks, we’ll be looking at some of the most interesting applicants. If you know of one you think worth highlighting, let us know, via email or in the comments. —Ed.]

Two of the biggest names in journalism have applied to this year’s Knight News Challenge: The pioneering investigative-reporting non-profit ProPublica and The New York Times are seeking $1 million from the Knight Foundation to launch an online repository of primary-source documents. The project could lead to greater information sharing among news organizations and their audience. As they put it in their grant application:

Documents are the foundation of investigative journalism, but today’s newsroom is a throwaway culture. Too often, reporters gather reams of information, do their stories, then chuck rich source documents into a dusty corner, never again to see the light of day.

The project, which is called DocumentCloud, would let news organizations upload their materials for public consumption and analysis. (“Readers will also be able to quickly search, annotate and bookmark documents — and for the first time link directly to specific pages or passages.”)

The proposal relies on a piece of software called DocViewer, which was developed by the Times’ Interactive Newsroom Technologies team. The head of that team, Aron Pilhofer, recently confirmed that the Times will release DocViewer as open source “sometime after the election.” Brian Boyer, the blogger who broke that news, said the software was created by the Times for its searchable database of Hillary Clinton’s 11,000-page public schedule as first lady, which was a journalistic marvel.

In an email today, Pilhofer said the application has already made it to the second round of the News Challenge, and he explained the proposal’s provenance:

The project started with a conversation between Scott Klein, Eric Umansky (of ProPublica) and me and my boss, Marc Frons. They were interested in using our DocViewer, and we were talking about the possibility of just open sourcing the darn thing. So, we got into one of those… “Hey, wouldn’t it be cool if we could also…” sorts of conversations, and things went from there.

DocumentCloud would focus initially on New York City “because it has favorable FOI laws and a vibrant journalism and blogging community.” (The community focus is also a requirement of the News Challenge.) A consortium of media outlets, bloggers, and watchdog groups would submit documents, though the application mentions only one partner on board: the Gotham Gazette, a news website published by the Citizens Union Foundation of the City of New York. ProPublica also plans to contribute state- and federal-government documents.

For the technically inclined, DocumentCloud will run on open APIs, so readers or other news organizations could search and interact with the document database as necessary for investigative projects. “Think of it as a ‘card catalog’ of standardized metadata for primary source documents,” the application argues.

It isn’t clear if the project could or would go ahead without funding from Knight, which will award its News Challenge grants next summer. ProPublica’s $10-million annual budget is funded primarily by the Sandler Foundation. We’ve sent an email to Mike Webb, ProPublica’s director of communications, seeking more information.

The full text of the grant application is below the jump.

Project Title: DocumentCloud

Requested amount from Knight News Challenge: $1,000,000

Expected amount of time to complete project: 3 [years]

Total cost of project including all sources of funding: $1,000,000

Describe your project: What is it? DocumentCloud is software, a website, and a set of open standards and APIs that will accelerate the daily work of investigative reporters, and will make investigative reporters out of every citizen, by improving the way we find, share, read and collaborate on source documents online. Why do we need it? Documents are the foundation of investigative journalism, but today’s newsroom is a throwaway culture. Too often, reporters gather reams of information, do their stories, then chuck rich source documents into a dusty corner, never again to see the light of day. Documents that are placed on the web are typically just PDFs — a poor user experience that places documents out of context and, often, out of reach when the story fades from public consciousness. Further, news outfits do not benefit from the wisdom of the crowd since there is no good way to collaboratively examine large document sets. How will it do it? DocumentCloud will host, and provide an open API to, an online database of source documents, contributed by a consortium of news orgs, watchdog groups and bloggers. Think of it as a “card catalog” of standardized metadata for primary source documents. Once submitted to DocumentCloud, documents can be found, linked to, and retrieved by anyone, anywhere on the Web. Thanks to the metadata, users will be able to search by topic, agency, or location. The project will lower barriers of participation by creating open standards and open-source software. DocViewer, a best-of-class web-based application, will allow even the smallest organizations to publish their documents online and contribute to DocumentCloud. Readers will also be able to quickly search, annotate and bookmark documents — and for the first time link directly to specific pages or passages.

How will your project improve the way news and information are delivered to geographic communities? Because source documents are often more scarce in metro reporting than in national, DocumentCloud can make its biggest initial impact by helping create an infrastructure for sharing on the local level. We’ve picked New York City for our initial rollout because it has favorable FOI laws and a vibrant journalism and blogging community. We have an agreement with our first local partner, Gotham Gazette, to work with us to build and test the software and APIs. They’ll also join the consortium, which will grow over the period of the grant to include many other local and national news organizations, bloggers and watchdog groups. While our pilot will focus on New York City, we will also include source documents from state and federal governments.

How is your idea innovative? (new or different from what already exists) DocumentCloud will take source documents beyond the inherent constraints of the PDF and out of the realm of clumsy scans or external application plug-ins and for the first time make them an intrinsic part of the semantic Web, and a part of reporting news online. Sharing information becomes much easier when you can share specific pages or paragraphs as well as entire documents. Source documents will be easier to find because users can search through fielded metadata, such as topics, locations, people, government agencies, publication date and other variables. Though of course the project stands on the shoulders of initiatives like Brewster Kahle’s Internet Archive, as well as the Open Archives Initiative, nothing like DocumentCloud exists.

What experience do you or your organization have to successfully develop this project? The New York Times has been at the forefront of the industry by fully integrating its newsroom and digital operations, and a leading innovator for digital content on the web among other platforms. In the past year, The Times has developed and launched a number of innovative products, including the Times People social network, Times Machine, an iPhone Times reader, the Times Developer Network, the Visualization Lab and two APIs. The Times is among the only major media organizations to form a dedicated team of journalist/developers focused exclusively on news projects, including the paper’s extensive Olympics and elections coverage this year. This team, Interactive Newsroom Technologies, has already built a lightweight version of the DocViewer, which will be released as an open source project. ProPublica, the new, non-profit newsroom, has the largest team of reporters dedicated to investigative journalism anywhere in the country. It is uniquely qualified to help manage the effort, not only because its reporters could be “power-users” of the service but because it was organized to take on just this kind of effort — collaborating with newsrooms around the country. ProPublica has already partnered with new organizations including Newsweek, 60 Minutes, Politico, the Albany Times-Union, the Los Angeles Times. Unlike most news organizations, ProPublica does not have an economic incentive to be competitive with other news organizations — in fact, just the opposite. Its model relies on just the kind of collaboration that will help spread DocumentCloud virally.

[Update: See Jay Rosen’s concerns on this application here.]

[Hello, readers from Romenesko, and welcome to the newly launched Nieman Journalism Lab. We hope you’ll come back every weekday for reporting, commentary, and conversation about the future of journalism. Here’s our front page, here’s more about us, and here’s our RSS feed.]

POSTED     Nov. 2, 2008, 1:17 p.m.
PART OF A SERIES     Knight News Challenge 2009
Show tags
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
The Society of Professional Journalists faces a “dire situation”
“If we don’t change our thinking, the next incoming president will be the last president.”
Four disabled journalists on how news outlets can support staffers and audience members with disabilities
“The tools that journalists are given [should be] accessible — and designed with people like me in an advisory role.”
Press freedom means controlling the language of AI
Generative AI systems act like “stochastic parrots,” using statistical models to guess word orders and pixel placements. That’s incompatible with a free press that commands its own words.