HOME
          
LATEST STORY
Open-mic journalism: How The Arizona Republic found success with storytelling events
ABOUT                    SUBSCRIBE
Nov. 19, 2008, 8:36 a.m.

DocumentCloud: The innovation $1m in Knight money could buy

Here’s some more information about the Knight News Challenge application by ProPublica and The New York Times that generated some buzz and criticism earlier this month. They’re seeking a $1 million grant to develop an online repository of primary-source documents that anyone could contribute to or take from. I spoke at length with developers at both organizations, and they discussed the technology behind their effort, how it could benefit investigative journalism, and why they’re seeking seven figures to launch the project.

The venture, which is called DocumentCloud, seems like it could vastly improve document-based journalism. (That’s separate from the issue of whether they’re deserving of a News Challenge grant.) At the moment, when a reporter gets her hands on paper documents, the best she can typically do is post them online as scanned PDFs, where they often can’t be searched and will likely be forgotten by the end of the day. Worst of all, it’s a one-sided experience: The reporter drops a dead tree in a forest and has no idea if it ever makes a sound.

DocViewer, which is the technology behind DocumentCloud, promises several features that would address the current failings of the PDF model. It would allow users to run their documents through an OCR (optical character recognition) service that would enable full-text searches of otherwise impenetrable material. Then DocViewer relies on OpenCalais, a web service developed by Thomson Reuters, which can tag documents with the names of known people and places found within the text. Any reporter who has ever attempted to wade through a thick stack of paper on deadline will immediately realize how helpful this would be.

“The problem we’re trying to solve here is the problem that TPM Muckraker had when they got thousands of pages of attorney general documents, and then just sort of threw it up online and said, ‘Take a look through this,'” said Aron Pilhofer, editor of interactive news technology at the Times. That effort, which won a Polk Award, broke new ground in crowdsourced journalism — a topic, incidentally, that we’re discussing in this month’s Lab Book Club. (And the TPM Muckraker blogger who posted those docs, Paul Kiel, now works for ProPublica.)

But the process wasn’t perfect. TPM readers had to navigate large PDF files and post their observations in the comments section of a blog post, which was helpful in the moment but limited in its long-term usefulness. “Those comments become more than just comments,” Pilhofer said. “They become actual data.”

DocumentCloud seeks to make the most of such data by allowing journalists and readers to annotate documents for all to see and benefit. Think of it as highlighting for the crowd. Pilhofer said the current proof of concept for DocViewer includes an annotation feature that’s similar to the notes users can leave on photographs in Flickr. Users will also be able to link directly to specific pages or even phrases in a document.

To get a sense of DocumentCloud’s potential, take a look at the database of Guantánamo Bay detainees that the Times made public on Nov. 3, when it was accompanied by a 1,500-word story. Each record is linked to relevant government documents that have been made public since “enemy combatants” were first held there in 2002. Pilhofer said the database isn’t using a full-featured version of DocViewer, but it certainly demonstrates the benefit of browsing documents grouped by subject rather than, say, the order in which the Defense Department happened to release them. What’s remarkable about the Gitmo collection, aside from its massive scope, is that the Times has offered up this information at all. As Pilhofer said, “It’s not usually in a newsroom’s DNA to release something like that to the public — and not just the public, the competition, too.”

Scott Klein, the director of online development for ProPublica, said that sharing — a maxim of the Internet, if not of newsrooms — would be the real power of DocumentCloud. The objective, he said, is to maximize the work of collecting documents that’s already been done on a particular topic and allow other journalists to build from there. “How can we collect those documents so the next reporter doing a story on this subject can find this information and use it and display it in a much more satisfying way?” he said.

ProPublica and the Times are asking the Knight Foundation for $1 million over three years to cover their anticipated costs. Klein said expenses would include staff to facilitate the program as well as hosting and bandwidth costs. I asked Pilhofer to respond to criticism of their application leveled by NYU’s Jay Rosen, who suggested that the for-profit Times Company shouldn’t be seeking foundation grants for its journalism. Here’s what Pilhofer said:

I can understand why some would feel that way, but I think it’s more a misunderstanding of what the project is and who it’s intended for…This is a grant submitted by us, but it’s not for us…The project is to create what we’re calling a consortium, some sort of entity that is not The New York Times, that is not ProPublica. Ideally, this will incorporate all sorts of media organizations and bloggers and watchdog groups and universities…If anything, Professor Rosen has it kind of backwards: We’re contributing to this effort. We’re contributing development resources, we’re contributing our time.

Obviously, I’m a fan of DocumentCloud and hope it sees the light of day. But whether they should receive a Knight grant is another question and depends, as my boss Josh asked, on whom the News Challenge is for. Based on the comments at my original post and around the web, it seems like DocumentCloud has generated some resentment among other News Challenge applicants more desperate for funding. One commenter also questioned whether ProPublica’s editor-in-chief, Paul Steiger, has an unfair advantage because he sits on the board of Knight, whose CEO, Alberto Ibargüen, is on the board of ProPublica. That web of ties could certainly help DocumentCloud’s application.

But what will help the project most is that it’s a good idea. And having waded through many News Challenge applications this month, I’ve seen that there’s truly a shortage of good ideas — or, at least, ones with clear potential to immediately improve journalism on a broad level. Kristen Taylor, Knight’s online community manager, said as much to me when she visited Cambridge in October. So while $1 million is a lot of money — a fifth of what Knight has committed to spend on News Challenge projects this year — but I’d bet that much cash that DocumentCloud will be one of the winners when they’re announced next fall.

POSTED     Nov. 19, 2008, 8:36 a.m.
PART OF A SERIES     Knight News Challenge 2009
SHARE THIS STORY
   
Show comments  
Show tags
 
Join the 15,000 who get the freshest future-of-journalism news in our daily email.
Open-mic journalism: How The Arizona Republic found success with storytelling events
The four-year-old program has helped boost the newspaper’s events business and helped strengthen relationships with the community through nights of storytelling.
Newsonomics: Buying Yelp — and making it the next core of the local news and information business
The pricetag would be high, but it might be worth it to reassemble one part of the old newspaper bundle — tying together local news and local services.
Crossing the streams: Why competing publications are deciding to team up on podcasts
Low financial risk and a desire for word-of-mouth sharing have led news sites to collaborate, sharing audience and infrastructure.
What to read next
953
tweets
The State of the News Media 2015: Newspapers ↓, smartphones ↑
The annual omnibus report from Pew outlines a story of continued trends more than radical change.
561The Upshot uses geolocation to push readers deeper into data
The New York Times story changes its text depending on where you’re reading it: “It’s a fine line between a smarter default and being creepy.”
422Knight Foundation invests $1 million in creator-driven podcast collective Radiotopia
The money will help PRX’s collective of public media-minded shows develop sustainable business models and expand with new shows and producers.
These stories are our most popular on Twitter over the past 30 days.
See all our most recent pieces ➚
Encyclo is our encyclopedia of the future of news, chronicling the key players in journalism’s evolution.
Here are a few of the entries you’ll find in Encyclo.   Get the full Encyclo ➚
Current TV
Gawker Media
Financial Times
Bloomberg
Houston Chronicle
Kaiser Health News
Daily Kos
The Bay Citizen
Yahoo
The Economist
U.S. News & World Report
O Globo