Nieman Foundation at Harvard
HOME
          
LATEST STORY
The Marshall Project, an early model for single-subject nonprofit news sites, turns five today (and got a shoutout on Jeopardy last night)
ABOUT                    SUBSCRIBE
March 1, 2013, 1:19 p.m.
LINK: www.theverge.com  ➚   |   Posted by: Joshua Benton   |   March 1, 2013

Russell Brandom at The Verge has a piece on Common Crawl, “a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.” At one extreme, that dataset could be used to build your own local or targeted search engine; at a smaller scale, it could be a boon to data journalists:

For example, web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.

Be forewarned: If you think a Hadoop cluster is a kind of Easter candy, this isn’t the weekend hacking project for you. (Here’s an earlier piece from MIT Technology Review.)

Show tags Show comments / Leave a comment
 
Join the 50,000 who get the freshest future-of-journalism news in our daily email.
The Marshall Project, an early model for single-subject nonprofit news sites, turns five today (and got a shoutout on Jeopardy last night)
“As a former journalist, I was mindful of the power of honest storytelling. As an idealist, I felt that if only Americans knew the truth, changes would soon follow.”
News portals like Yahoo still bring Democrats and Republicans together for political news, but they’re fading fast
Plus: Hello “lifestyle misinformation,” hundreds of dead newspapers “revived” online to support Indian interests, and all of the fact-checking discussion you could possibly want.
Doing more with less: Seven practical tips for local newsrooms to strrrrretch their resources
Content doesn’t need to be perfect to be valuable; share resources within a city, not just a company; and other ideas.