HOME
          
LATEST STORY
An ad blocker for tragedies: How news sites handle content around sensitive stories
ABOUT                    SUBSCRIBE
June 23, 2011, 1 p.m.

“The Drupal of dataviz”: Overview, AP’s News Challenge winner, wants to make sense of big document sets

When WikiLeaks released its Iraq War Logs last year, it unleashed upon the world, and particularly upon newsrooms, 391,832 individual documents. If a single reader were to tackle that document set, says Jonathan Stray, interactive technology editor for the Associated Press, it would take about three years of full-time work simply to read through the information contained within it. That’s factoring in a very ambitious consumption rate of one minute per doc — and factoring out the additional work of finding patterns and standouts in the text, writing about them, and otherwise putting the documents into meaningful context for readers.

Oof. There’s very little that’s redeeming about the term “document dump,” but one benefit of the phrase is its implicit illustration of a problem: As documents become easier than ever to copy and store and share, volume alone can lead to a kind of chaos. “Basically,” Stray noted at the NICAR conference earlier this year, “we’re drowning in documents.”

Stray and the AP think they’ve figured out a way to tackle that problem — and they’ve just won $475,000 from the Knight Foundation to test their theory. Over the next two years, Stray (a Lab contributor) and a team at the AP will work to build Overview, a visualization tool for documents. While Document Cloud, another News Challenge fundee, offers a way to store and search docs, Overview offers a way to organize them in a manner that emphasizes connections among their content. There’s dataviz; Overview suggests a subset of that field. Call it docuviz.

Document search and document visualization aren’t mutually exclusive; on the contrary, Overview plans to interface with Document Cloud, which has become expert at presentational issues like viewing, annotation, and searching. From there, Stray says, Overview will group documents into clusters that will identify the patterns hidden within documents’ text.

That approach isn’t new — in fact, it’s the basic insight that drives search engines and other familiar technologies — but it’s novel, Stray notes, in its application to a particularly journalistic use case. Keyword searches, which until now have been the focus of document-set reporting, are, by their nature, limiting; so are word-frequency tools like the much-praised Google Ngram viewer. Together, they present a problem that’s as practical as it is philosophical: How do you know what search terms to use if you don’t yet know the content of the documents you’re searching?

Overview solves that riddle semantically, emphasizing connections over categories — or, rather, creating categories out of connections. The tool processes documents according to the words they contain, and then analyzes the overlap between keywords to identify trends — and clusters — within the text. Overview, premised on the idea that visualization is itself a kind of meaning-making, “is one response to the deluge of document and data sets that journalists and others are now confronting,” Stray says.

If Overview were a newsroom tool (rather than, as it is at the moment, a working prototype), it might be used, for example, to help sort through the release of Sarah Palin’s email records (more than 24,000 of them). Or, for that matter, SEC filings or corporate communications or FOIA records or basically any set of documents that is big and bulky and otherwise unwieldy. Overview could prove particularly useful for investigative reporters, who routinely spend hours and days and weeks panning through documents in hopes of finding flecks of gold. Inefficiencies have been built into their work. “When you file a FOIA request,” Stray notes, “governments have an obligation to give you the material — but they don’t have an obligation to organize it, or tell you what it means, or how to find anything.”

Stray and his team have three main goals for the Overview project:

  • Produce a tool that journalists can use on real document-set problems, on deadline. To do that, Stray says, the team will have to pick a couple of key workflows, identify the most critical problems, and the build something that can live in, and be of use to, newsrooms. Ultimately, he says, “it’s got to be deployable.”
  • Build up a community of “journalist-geeks” who have an interest in the tool. The only way to build something that’s truly useful to journalists will be to involve journalists in the building process itself. Stray hopes to gather a group of people who, he says, “see it in their interest both to use the tool and contribute to its further development.” Overview will be a product; it will also be, ideally, a community.
  • Increase the speed of data-mining research in journalism. Overview is in some sense a proof of concept — and the ideas it’s testing could be of use far beyond the tool itself. “There’s no shortage of things that could be tried,” Stray points out, “but there isn’t an open platform where people can experiment with new techniques and with bringing technology over from other fields.” With the Overview project, he says, “we’re really trying to kick-start the development of open analytics technology.”

To begin with — employment alert! — Stray is looking to hire “two top-notch developers”: one person “who can put together a really good user experience,” and another, a computer scientist, who cares about the questions Overview is tackling and “wants to play with a lot of cutting-edge technology.”

Overview will be open-source — because of its News Challenge funding, yes, but more so because openness is in its DNA. (And that’s because of, not despite, the fact that the tool’s being built under the auspices of the AP. “This actually aligns very closely with our mission,” Stray says, “because part of what we do is try to support all of our members, and our customers, as well.” The AP, he notes, is “an organization that has a long tradition of being a shared resource for the industry.”) Though Overview will be based within, and used by, the AP, Stray is hoping that its production will be crowdsourced — through input, in particular, from fellow hacker-journalists, who he hopes will contribute both ideas and code to the effort. The broad goal? “A big, open-source project that people have contributed to and figured out how to configure to their needs.” Ultimately, Stray says, “we want this to be the Drupal of data visualization.”

POSTED     June 23, 2011, 1 p.m.
PART OF A SERIES     Knight News Challenge 2011
SHARE THIS STORY
   
Show comments  
Show tags
 
Join the 15,000 who get the freshest future-of-journalism news in our daily email.
An ad blocker for tragedies: How news sites handle content around sensitive stories
For stories like the Germanwings plane crash, The New York Times and many other publishers flip a switch to remove ads to avoid unwanted connections.
Newsonomics: BuzzFeed and The New York Times play Facebook’s ubiquity game
The ubiquity game has different rules for digital startups than for legacy businesses. But for both, figuring out the right relationship with Facebook is key to their audience strategies.
Jeff Israely: Good content marketing benefits from a smart publisher’s touch
Our startup correspondent, building Worldcrunch in Paris, on the thinking behind its operation’s pivot: “The smart brands know they’ll lose your attention if they use this new publishing power simply to push their merchandise.”
What to read next
2481
tweets
Millennials say keeping up with the news is important to them — but good luck getting them to pay for it
The new report from the Media Insight Project looks at millennials’ habits and attitudes toward news consumption: “I really wouldn’t pay for any type of news because as a citizen it’s my right to know the news.”
926The next stage in the battle for our attention: Our wrists
News companies have moved from print dollars to digital dimes to mobile pennies. Now, with the highly anticipated launch of the Apple Watch, the screens are getting even smaller. How are smart publishers thinking about the right way to serve users and maintain their attention on smartwatches?
792A wave of distributed content is coming — will publishers sink or swim?
Instead of just publishing to their own websites, news organizations are being asked to publish directly to platforms they don’t control. Is the hunt for readers enough to justify losing some independence?
These stories are our most popular on Twitter over the past 30 days.
See all our most recent pieces ➚
Encyclo is our encyclopedia of the future of news, chronicling the key players in journalism’s evolution.
Here are a few of the entries you’ll find in Encyclo.   Get the full Encyclo ➚
The Orange County Register
Patch
Creative Commons
The New York Times
Public Radio International
Honolulu Civil Beat
Center for Public Integrity
E.W. Scripps
The Christian Science Monitor
NBCNews.com
Tucson Citizen
Chicago News Cooperative