Nieman Foundation at Harvard
HOME
          
LATEST STORY
Postcards and laundromat visits: The Texas Tribune audience team experiments with IRL distribution
ABOUT                    SUBSCRIBE
Sept. 29, 2009, 9 a.m.

Five projects on the frontier of text-based data analysis and visualization

Last week, I attended the Transparent Text symposium at IBM’s offices in Cambridge. The conference focused on text-based data storage, analysis, and visualization — awesomely nerdy stuff, in other words.

Some of the presentations would be familiar to loyal readers of this site: Amanda Michel’s distributed reporting at ProPublica, Ethan Zuckerman’s Media Cloud and “nutritional labeling” for news, DocumentCloud, and The Guardian’s crowdsourcing tool. Here, then, are five other projects that piqued my interest at the conference:

OpenCalais

I’ve mentioned OpenCalais in the context of DocumentCloud, but there’s much more to the software, which was purchased by Thomson Reuters in 2007. In a sentence, OpenCalais parses text for names, locations, organizations, and other entities to make unstructured documents more useful. Oh, and it’s free.

Above are the slides presented by Tom Tague, head of OpenCalais, whose talk focused on how publishers are using the service. The best example is on the last slide: Two investigative-journalism networks, which Tague did not name, are using OpenCalais to compare birth, death, and wedding records with government contracts to identify conflicts of interest that wouldn’t be otherwise apparent.

IBM’s DeepQA project

IBM’s successor to Deep Blue, the chess-playing supercomputer that defeated Gary Kasparov, is DeepQA, a natural language processor that’s being trained to play Jeopardy. It’s a whole different challenge, the complexities of which were explained in a New York Times article last spring and in the IBM promotional video above.

What does this have to do with journalism? Nothing, at first, but the research behind DeepQA (or “Watson,” as they call it at IBM) could improve the way information is processed and interpreted — and hasn’t that long been the news industry’s specialty?

Maplight

Medicare Prescription Drug Price Negotiation Act of 2007 (at MAPLight.org)

Center for Responsive Politics Medicare Prescription Drug Price Negotiation Act of 2007 (at MAPLight.org)

Maplight is a project funded primarily by the Sunlight Foundation that seeks to “illuminate” the connection between money and politics in California and the federal government. Their databases allow users to compare votes on particular bills with campaign funding from interest groups that supported or opposed the legislation. The widget above, for instance, demonstrates the correlation, if not causation, between contributions and votes on a Medicare bill in 2007.

IBM’s Many Eyes project

Many Eyes is IBM’s free data-visualization software. (I used it for two posts earlier this year.) Fernanda Viégas and Martin Wattenberg demonstrated some of their best text-based visualizations, like Word Tree, and previewed a new one that compares Google searches, pictured above comparing the most common endings of searches for “is my son…” and “is my daughter…” Think of it as an amped-up version of Google Suggest.

Linked data at The New York Times

I actually missed this presentation, but Alexis Lloyd of The New York Times Co.’s research and development group, which we profiled at length in May, discussed how the Times is using linked data to organize its content. ReadWriteWeb reported on this project in June. The slide above, for instance, illustrates how the Times classifies airline accidents to create a more-intelligent archive of its plane-crash coverage.

Slide photos by Andreas Myhrvold Braendhaugen and lite used under a Creative Commons license.

POSTED     Sept. 29, 2009, 9 a.m.
 
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
Postcards and laundromat visits: The Texas Tribune audience team experiments with IRL distribution
As social platforms falter for news, a number of nonprofit outlets are rethinking distribution for impact and in-person engagement.
Radio Ambulante launches its own record label as a home for its podcast’s original music
“So much of podcast music is background, feels like filler sometimes, but with our composers, it never is.”
How uncritical news coverage feeds the AI hype machine
“The coverage tends to be led by industry sources and often takes claims about what the technology can and can’t do, and might be able to do in the future, at face value in ways that contribute to the hype cycle.”