Nieman Foundation at Harvard
Are you willing to pay for Prepare to be asked before year’s end
ABOUT                    SUBSCRIBE
Dec. 9, 2014, 2:56 p.m.
Reporting & Production

The New York Times R&D Lab releases Hive, an open-source crowdsourcing tool

“We want to learn from others who are doing good things, and when we learn things we share them as well.”

hive-logo-medA few months ago we told you about a new tool from The New York Times that allowed readers to help identify ads inside the paper’s massive archive. Madison, as it was called, was the first iteration on a new crowdsourcing tool from The New York Times R&D Lab that would make it easier to break down specific tasks and get users to help an organization get at the data they need.

Today the R&D Lab is opening up the platform that powers the whole thing. Hive is an open-source framework that lets anyone build their own crowdsourcing project. The code responsible for Hive is now available on GitHub. With Hive, a developer can create assignments for users, define what they need to do, and keep track of their progress in helping to solve problems.

Here’s the R&D Lab’s Jacqui Maher with some of the nuts and bolts of Hive:

The system we built is Hive, an open-source platform that lets developers produce crowdsourcing applications for a variety of contexts. Informed by our work on Streamtools, Hive’s technical architecture takes advantage of Go’s efficiency in parsing and transmitting JSON along with its straightforward interface to Elasticsearch. Combining the speed of a compiled language with the flexibility of a search engine means Hive is able to handle a wide variety of user-submitted contributions on diverse sets of tasks.

NYTRDMatt Boggie, executive director of the R&D Lab, said Madison evolved from the print archive app TimesMachine, but in creating the tool they realized it could serve multiple purposes outside the Times’ back pages. “The big thing was we realized the problem we were solving was one particular manifestation of a common problem lots of organizations have,” he said.

The decision to make Hive open-source was fairly simple, he said, since so many news organizations have made a habit of asking readers for help in sifting through documents or making sense of disorganized piles of data. The benefit to the Times is seeing how other people and organizations use the platform and what ideas they can apply at the paper. “We want to learn from others who are doing good things, and when we learn things we share them as well,” he said.

In the case of Madison, the Times needed several types of data: the text of an ad, the product it was selling, and any information on the visuals or the size of the ad. Boggie said the trick was to make a system that could fit their specific needs while also being open enough to be useful for other purposes. The solution was to break crowdsourcing down into a series of smaller tasks that create a kind of feedback loop. For instance, in Madison, users are asked to find, tag, and transcribe ads. Each of those steps are only possible through the work of the other; in order to tag or transcribe an ad, you have to correctly identify what is an ad.

Boggie said so far they’ve had over 14,000 people use Madison and contribute some form of work. More than 100,000 assignments have been completed, and Boggie said they hope to open up a new set of ads — get ready for the 1970s — in early 2015. They also plan to make the data collected from Madison on the ads from the 1960s available as well.

POSTED     Dec. 9, 2014, 2:56 p.m.
SEE MORE ON Reporting & Production
Show tags
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
Are you willing to pay for Prepare to be asked before year’s end
The cable news network plans to launch a new subscription product — details TBD — by the end of 2024. Will Mark Thompson repeat his New York Times success, or is CNN too different a brand to get people spending?
Errol Morris on whether you should be afraid of generative AI in documentaries
“Our task is to get back to the real world, to the extent that it is recoverable.”
In the world’s tech capital, Gazetteer SF is staying off platforms to produce good local journalism
“Thank goodness that the mandate will never be to look what’s getting the most Twitter likes.”