Nieman Foundation at Harvard
HOME
          
LATEST STORY
PressPad, an attempt to bring some class diversity to posh British journalism, is shutting down
ABOUT                    SUBSCRIBE
Sept. 29, 2009, 9 a.m.

Five projects on the frontier of text-based data analysis and visualization

Last week, I attended the Transparent Text symposium at IBM’s offices in Cambridge. The conference focused on text-based data storage, analysis, and visualization — awesomely nerdy stuff, in other words.

Some of the presentations would be familiar to loyal readers of this site: Amanda Michel’s distributed reporting at ProPublica, Ethan Zuckerman’s Media Cloud and “nutritional labeling” for news, DocumentCloud, and The Guardian’s crowdsourcing tool. Here, then, are five other projects that piqued my interest at the conference:

OpenCalais

I’ve mentioned OpenCalais in the context of DocumentCloud, but there’s much more to the software, which was purchased by Thomson Reuters in 2007. In a sentence, OpenCalais parses text for names, locations, organizations, and other entities to make unstructured documents more useful. Oh, and it’s free.

Above are the slides presented by Tom Tague, head of OpenCalais, whose talk focused on how publishers are using the service. The best example is on the last slide: Two investigative-journalism networks, which Tague did not name, are using OpenCalais to compare birth, death, and wedding records with government contracts to identify conflicts of interest that wouldn’t be otherwise apparent.

IBM’s DeepQA project

IBM’s successor to Deep Blue, the chess-playing supercomputer that defeated Gary Kasparov, is DeepQA, a natural language processor that’s being trained to play Jeopardy. It’s a whole different challenge, the complexities of which were explained in a New York Times article last spring and in the IBM promotional video above.

What does this have to do with journalism? Nothing, at first, but the research behind DeepQA (or “Watson,” as they call it at IBM) could improve the way information is processed and interpreted — and hasn’t that long been the news industry’s specialty?

Maplight

Medicare Prescription Drug Price Negotiation Act of 2007 (at MAPLight.org)

Center for Responsive Politics Medicare Prescription Drug Price Negotiation Act of 2007 (at MAPLight.org)

Maplight is a project funded primarily by the Sunlight Foundation that seeks to “illuminate” the connection between money and politics in California and the federal government. Their databases allow users to compare votes on particular bills with campaign funding from interest groups that supported or opposed the legislation. The widget above, for instance, demonstrates the correlation, if not causation, between contributions and votes on a Medicare bill in 2007.

IBM’s Many Eyes project

Many Eyes is IBM’s free data-visualization software. (I used it for two posts earlier this year.) Fernanda Viégas and Martin Wattenberg demonstrated some of their best text-based visualizations, like Word Tree, and previewed a new one that compares Google searches, pictured above comparing the most common endings of searches for “is my son…” and “is my daughter…” Think of it as an amped-up version of Google Suggest.

Linked data at The New York Times

I actually missed this presentation, but Alexis Lloyd of The New York Times Co.’s research and development group, which we profiled at length in May, discussed how the Times is using linked data to organize its content. ReadWriteWeb reported on this project in June. The slide above, for instance, illustrates how the Times classifies airline accidents to create a more-intelligent archive of its plane-crash coverage.

Slide photos by Andreas Myhrvold Braendhaugen and lite used under a Creative Commons license.

POSTED     Sept. 29, 2009, 9 a.m.
 
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
PressPad, an attempt to bring some class diversity to posh British journalism, is shutting down
“While there is even more need for this intervention than when we began the project, the initiative needs more resources than the current team can provide.”
Is the Texas Tribune an example or an exception? A conversation with Evan Smith about earned income
“I think risk aversion is the thing that’s killing our business right now.”
The California Journalism Preservation Act would do more harm than good. Here’s how the state might better help news
“If there are resources to be put to work, we must ask where those resources should come from, who should receive them, and on what basis they should be distributed.”