HOME
          
LATEST STORY
Ken Doctor: Why The New York Times hired Kinsey Wilson
ABOUT                    SUBSCRIBE
June 23, 2011, 1 p.m.

“The Drupal of dataviz”: Overview, AP’s News Challenge winner, wants to make sense of big document sets

When WikiLeaks released its Iraq War Logs last year, it unleashed upon the world, and particularly upon newsrooms, 391,832 individual documents. If a single reader were to tackle that document set, says Jonathan Stray, interactive technology editor for the Associated Press, it would take about three years of full-time work simply to read through the information contained within it. That’s factoring in a very ambitious consumption rate of one minute per doc — and factoring out the additional work of finding patterns and standouts in the text, writing about them, and otherwise putting the documents into meaningful context for readers.

Oof. There’s very little that’s redeeming about the term “document dump,” but one benefit of the phrase is its implicit illustration of a problem: As documents become easier than ever to copy and store and share, volume alone can lead to a kind of chaos. “Basically,” Stray noted at the NICAR conference earlier this year, “we’re drowning in documents.”

Stray and the AP think they’ve figured out a way to tackle that problem — and they’ve just won $475,000 from the Knight Foundation to test their theory. Over the next two years, Stray (a Lab contributor) and a team at the AP will work to build Overview, a visualization tool for documents. While Document Cloud, another News Challenge fundee, offers a way to store and search docs, Overview offers a way to organize them in a manner that emphasizes connections among their content. There’s dataviz; Overview suggests a subset of that field. Call it docuviz.

Document search and document visualization aren’t mutually exclusive; on the contrary, Overview plans to interface with Document Cloud, which has become expert at presentational issues like viewing, annotation, and searching. From there, Stray says, Overview will group documents into clusters that will identify the patterns hidden within documents’ text.

That approach isn’t new — in fact, it’s the basic insight that drives search engines and other familiar technologies — but it’s novel, Stray notes, in its application to a particularly journalistic use case. Keyword searches, which until now have been the focus of document-set reporting, are, by their nature, limiting; so are word-frequency tools like the much-praised Google Ngram viewer. Together, they present a problem that’s as practical as it is philosophical: How do you know what search terms to use if you don’t yet know the content of the documents you’re searching?

Overview solves that riddle semantically, emphasizing connections over categories — or, rather, creating categories out of connections. The tool processes documents according to the words they contain, and then analyzes the overlap between keywords to identify trends — and clusters — within the text. Overview, premised on the idea that visualization is itself a kind of meaning-making, “is one response to the deluge of document and data sets that journalists and others are now confronting,” Stray says.

If Overview were a newsroom tool (rather than, as it is at the moment, a working prototype), it might be used, for example, to help sort through the release of Sarah Palin’s email records (more than 24,000 of them). Or, for that matter, SEC filings or corporate communications or FOIA records or basically any set of documents that is big and bulky and otherwise unwieldy. Overview could prove particularly useful for investigative reporters, who routinely spend hours and days and weeks panning through documents in hopes of finding flecks of gold. Inefficiencies have been built into their work. “When you file a FOIA request,” Stray notes, “governments have an obligation to give you the material — but they don’t have an obligation to organize it, or tell you what it means, or how to find anything.”

Stray and his team have three main goals for the Overview project:

  • Produce a tool that journalists can use on real document-set problems, on deadline. To do that, Stray says, the team will have to pick a couple of key workflows, identify the most critical problems, and the build something that can live in, and be of use to, newsrooms. Ultimately, he says, “it’s got to be deployable.”
  • Build up a community of “journalist-geeks” who have an interest in the tool. The only way to build something that’s truly useful to journalists will be to involve journalists in the building process itself. Stray hopes to gather a group of people who, he says, “see it in their interest both to use the tool and contribute to its further development.” Overview will be a product; it will also be, ideally, a community.
  • Increase the speed of data-mining research in journalism. Overview is in some sense a proof of concept — and the ideas it’s testing could be of use far beyond the tool itself. “There’s no shortage of things that could be tried,” Stray points out, “but there isn’t an open platform where people can experiment with new techniques and with bringing technology over from other fields.” With the Overview project, he says, “we’re really trying to kick-start the development of open analytics technology.”

To begin with — employment alert! — Stray is looking to hire “two top-notch developers”: one person “who can put together a really good user experience,” and another, a computer scientist, who cares about the questions Overview is tackling and “wants to play with a lot of cutting-edge technology.”

Overview will be open-source — because of its News Challenge funding, yes, but more so because openness is in its DNA. (And that’s because of, not despite, the fact that the tool’s being built under the auspices of the AP. “This actually aligns very closely with our mission,” Stray says, “because part of what we do is try to support all of our members, and our customers, as well.” The AP, he notes, is “an organization that has a long tradition of being a shared resource for the industry.”) Though Overview will be based within, and used by, the AP, Stray is hoping that its production will be crowdsourced — through input, in particular, from fellow hacker-journalists, who he hopes will contribute both ideas and code to the effort. The broad goal? “A big, open-source project that people have contributed to and figured out how to configure to their needs.” Ultimately, Stray says, “we want this to be the Drupal of data visualization.”

POSTED     June 23, 2011, 1 p.m.
PART OF A SERIES     Knight News Challenge 2011
SHARE THIS STORY
   
Show comments  
Show tags
 
Join the 15,000 who get the freshest future-of-journalism news in our daily email.
Ken Doctor: Why The New York Times hired Kinsey Wilson
The former chief content officer at NPR will be moving up I-95 to one of the most important digital positions at the Times.
Why Google is taking another shot at helping readers pay for news
Google Contributor is the latest tool the company has designed to help readers pay for what they read online. But its previous experiments in supporting paid content have had limited success.
In Canada, newspapers’ attempts to experiment with ebooks haven’t seen much success
A number of papers across the country started ebook programs in the early part of this decade, repurposing their archives or producing new work. They haven’t been the moneymakers some had hoped.
What to read next
718
tweets
Ken Doctor: The New York Times’ financials show the transition to digital accelerating
The numbers may look flat, but they contain a continuing set of ups and downs. Up next: executing on a year’s worth of launches.
540Here’s some remarkable new data on the power of chat apps like WhatsApp for sharing news stories
At least in certain contexts, WhatsApp is a truly major traffic driver — bigger even than Facebook. Should there be a WhatsApp button on your news site?
502Controlled chaos: As journalism and documentary film converge in digital, what lessons can they share?
Old and new media types from journalism, documentary, and technology backgrounds gathered at MIT to share practices and discuss mutual concerns.
These stories are our most popular on Twitter over the past 30 days.
See all our most recent pieces ➚
Encyclo is our encyclopedia of the future of news, chronicling the key players in journalism’s evolution.
Here are a few of the entries you’ll find in Encyclo.   Get the full Encyclo ➚
New Jersey Newsroom
Current TV
Frontline
American Independent News Network
Wikipedia
DocumentCloud
Byliner
Apple
Hearst
ReadWrite
Wired
Groupon