June 23, 2011, 1 p.m.

“The Drupal of dataviz”: Overview, AP’s News Challenge winner, wants to make sense of big document sets

When WikiLeaks released its Iraq War Logs last year, it unleashed upon the world, and particularly upon newsrooms, 391,832 individual documents. If a single reader were to tackle that document set, says Jonathan Stray, interactive technology editor for the Associated Press, it would take about three years of full-time work simply to read through the information contained within it. That’s factoring in a very ambitious consumption rate of one minute per doc — and factoring out the additional work of finding patterns and standouts in the text, writing about them, and otherwise putting the documents into meaningful context for readers.

Oof. There’s very little that’s redeeming about the term “document dump,” but one benefit of the phrase is its implicit illustration of a problem: As documents become easier than ever to copy and store and share, volume alone can lead to a kind of chaos. “Basically,” Stray noted at the NICAR conference earlier this year, “we’re drowning in documents.”

Stray and the AP think they’ve figured out a way to tackle that problem — and they’ve just won $475,000 from the Knight Foundation to test their theory. Over the next two years, Stray (a Lab contributor) and a team at the AP will work to build Overview, a visualization tool for documents. While Document Cloud, another News Challenge fundee, offers a way to store and search docs, Overview offers a way to organize them in a manner that emphasizes connections among their content. There’s dataviz; Overview suggests a subset of that field. Call it docuviz.

Document search and document visualization aren’t mutually exclusive; on the contrary, Overview plans to interface with Document Cloud, which has become expert at presentational issues like viewing, annotation, and searching. From there, Stray says, Overview will group documents into clusters that will identify the patterns hidden within documents’ text.

That approach isn’t new — in fact, it’s the basic insight that drives search engines and other familiar technologies — but it’s novel, Stray notes, in its application to a particularly journalistic use case. Keyword searches, which until now have been the focus of document-set reporting, are, by their nature, limiting; so are word-frequency tools like the much-praised Google Ngram viewer. Together, they present a problem that’s as practical as it is philosophical: How do you know what search terms to use if you don’t yet know the content of the documents you’re searching?

Overview solves that riddle semantically, emphasizing connections over categories — or, rather, creating categories out of connections. The tool processes documents according to the words they contain, and then analyzes the overlap between keywords to identify trends — and clusters — within the text. Overview, premised on the idea that visualization is itself a kind of meaning-making, “is one response to the deluge of document and data sets that journalists and others are now confronting,” Stray says.

If Overview were a newsroom tool (rather than, as it is at the moment, a working prototype), it might be used, for example, to help sort through the release of Sarah Palin’s email records (more than 24,000 of them). Or, for that matter, SEC filings or corporate communications or FOIA records or basically any set of documents that is big and bulky and otherwise unwieldy. Overview could prove particularly useful for investigative reporters, who routinely spend hours and days and weeks panning through documents in hopes of finding flecks of gold. Inefficiencies have been built into their work. “When you file a FOIA request,” Stray notes, “governments have an obligation to give you the material — but they don’t have an obligation to organize it, or tell you what it means, or how to find anything.”

Stray and his team have three main goals for the Overview project:

  • Produce a tool that journalists can use on real document-set problems, on deadline. To do that, Stray says, the team will have to pick a couple of key workflows, identify the most critical problems, and the build something that can live in, and be of use to, newsrooms. Ultimately, he says, “it’s got to be deployable.”
  • Build up a community of “journalist-geeks” who have an interest in the tool. The only way to build something that’s truly useful to journalists will be to involve journalists in the building process itself. Stray hopes to gather a group of people who, he says, “see it in their interest both to use the tool and contribute to its further development.” Overview will be a product; it will also be, ideally, a community.
  • Increase the speed of data-mining research in journalism. Overview is in some sense a proof of concept — and the ideas it’s testing could be of use far beyond the tool itself. “There’s no shortage of things that could be tried,” Stray points out, “but there isn’t an open platform where people can experiment with new techniques and with bringing technology over from other fields.” With the Overview project, he says, “we’re really trying to kick-start the development of open analytics technology.”

To begin with — employment alert! — Stray is looking to hire “two top-notch developers”: one person “who can put together a really good user experience,” and another, a computer scientist, who cares about the questions Overview is tackling and “wants to play with a lot of cutting-edge technology.”

Overview will be open-source — because of its News Challenge funding, yes, but more so because openness is in its DNA. (And that’s because of, not despite, the fact that the tool’s being built under the auspices of the AP. “This actually aligns very closely with our mission,” Stray says, “because part of what we do is try to support all of our members, and our customers, as well.” The AP, he notes, is “an organization that has a long tradition of being a shared resource for the industry.”) Though Overview will be based within, and used by, the AP, Stray is hoping that its production will be crowdsourced — through input, in particular, from fellow hacker-journalists, who he hopes will contribute both ideas and code to the effort. The broad goal? “A big, open-source project that people have contributed to and figured out how to configure to their needs.” Ultimately, Stray says, “we want this to be the Drupal of data visualization.”

POSTED     June 23, 2011, 1 p.m.
PART OF A SERIES     Knight News Challenge 2011
