Nieman Foundation at Harvard
HOME
          
LATEST STORY
A Swiss publisher is trying to attract a paying audience with an app sampling stories across publications
ABOUT                    SUBSCRIBE
Oct. 20, 2014, 4:47 p.m.
Ansel_Adams-cc

Light everywhere: The California Civic Data Coalition wants to make public datasets easier to crunch

Journalists from rival outlets are pursuing the dream of “pluggable data,” partnering to build open-source tools to analyze California campaign finance and lobbying data.

When Meg Whitman ran for governor of California in 2010, she donated $144 million of her own money to her campaign. Whitman, the Republican nominee, ultimately lost to Democrat Jerry Brown, but her spending ensured that the race was the most expensive non-presidential campaign in American history.

California_Data_Coalition_LogoIt was obvious Whitman was spending a fortune on the race, but it wasn’t easy to access California’s campaign finance and lobbying activity database, CAL-ACCESS, in order to do a more thorough analysis on her spending, or the finances of any other California campaign through the years.

The database had a basic search function, but if you wanted to access the raw data, you’d need to send $5 to the state’s Secretary of State office and wait to receive a CD with the data on it back in the mail.

In August 2013, after a long drawn-out fight with California journalists and civic data and open government activists, the Secretary of State to put all the raw data online in a format that’s downloadable. But even then, with 76 different tables and roughly 35 million records, the data was still unwieldy and difficult to use.

So last month reporters and developers from the Los Angeles Times, the Center for Investigative Reporting, and Stanford’s nascent Computational Journalism Lab formed the California Civic Data Coalition and published open source tools to make parsing and analyzing the data easier.

“Our whole goal for this project is to ask very simple questions,” said Aaron Williams, a news apps developer at CIR who was involved with the project. “If you want to know, say, who has spent the most money in a political campaign in California? We all knew that was Meg Whitman, for example, but to actually query that data out and to watch her campaign trail, to answer those kinds of questions, the data didn’t really provide you that information easily. And even with the raw data, there were still a lot of connections you needed to make.”

Though the Times and CIR are competitors, they decided to work together so they could both spend more time focusing on actually reporting on the contents of the data as opposed to making it usable.

“We want to compete on who can do the better deep dive, who can ask the smarter question, who can be more aggressive about getting the story,” said Ben Welsh, a database producer at the Times. “We don’t want to compete on who can unzip and link together 76 crappy database tables.”

Welsh and Agustin Armendariz, a former CIR staffer who is now at The New York Times, began discussing and working on a collaboration last year. And in August, through a grant from Knight Foundation and Mozilla, the coalition gathered in San Francisco for two days to actually build the tools.

The byproduct of the coalition’s work is two Django apps released a few weeks ago: one to access and download campaign finance and lobbying data from the state’s database and another, called the campaign browser, to “clean, regroup, filter and transform the massive, hairy state [campaign finance] database into something more legible.”

“We’re talking about making power tools for power users, for the small amount of people who really want to go after this data and look at it in more aggressive and different ways than the state’s website allows,” Welsh said. “That’s really our goal.”

“At the end of the day, we don’t want to make yet another campaign finance website,” he continued. “We want to make a set of tools, or power tools, to let analysts who are really interested in going at the data statistically, who are trying to figure stuff out in more complex ways — we just want to make that as easy as possible, so we can start diving in and doing it.”

This is a philosophy the coalition is calling “pluggable data” — an effort to improve and streamline how data is prepared by focusing on ways to make clean and extracting large data sets replicable.

The coalition compared the idea to packaged software, which developers already download regularly. As they write on the coalition site:

If a series of simple installation commands can provide a Django application with everything necessary to build a social networking site, why can’t it also provide U.S. Census statistics, the massive federal database that tracks our country’s chemical polluters or something as simple as a list of every U.S. county?

In our conversation, Welsh highlighted the work of GovTrack.us’s Josh Tauberer, The New York Times’ Derek Willis, and Eric Mill, formerly of the Sunlight Foundation, in establishing the @UnitedStates Project, a collaboration in open data at the federal level, as an example of a group that’s inspired the California Civic Data Coalition in emphasizing collaboration and pluggable data.

It’s a mindset the coalition is emphasizing as it continues improve and build out its current apps. They’re also building a browser for CAL-ACCESS lobbying activity data. With the project open-sourced and on GitHub, a grad student at Berkeley has already started contributing code, and the coalition is hopeful students at the Stanford Computational Journalism Lab will also contribute once it launches this winter.

For now, the coalition is only focusing on its current data sets, though Williams and Welsh didn’t rule out expanding its focus in the future. The current emphasis, they said, is to make their current CAL-ACCESS data sets more accessible.

Williams, for example, is working to make the data exportable into flat CSV tables so it can be imported into programs like Microsoft Excel or Microsoft Access. The goal, Williams and Welsh said, was for individuals who aren’t programmers but are interested in data journalism to be able to meaningfully use the data to find stories.

“A dream outcome would be some graphic artist is able, in one day, to download that file, answer the question they want, and make a great graphic,” Welsh said.

Photo of California’s Kings River Canyon by Ansel Adams via the U.S. National Archives.

POSTED     Oct. 20, 2014, 4:47 p.m.
SHARE THIS STORY
   
Show comments  
Show tags
 
Join the 15,000 who get the freshest future-of-journalism news in our daily email.
A Swiss publisher is trying to attract a paying audience with an app sampling stories across publications
Tamedia’s 12-App collects the 12 best stories each day from the company’s 20-plus publications.
What does it take to be a “full-service” digital journalism organization? Ask Discourse Media
“We’ve gone down lots of experimental rabbit holes.”
Spain’s Eldiario.es has 18,000 paying members, and its eye on the next several million
“We have a potential of six million readers. You may not convince all six million people to be your socios, but if you learn more about their interests, you can get closer.”
What to read next
0
tweets
Hoping to redefine “trade publication,” Digiday launches Glossy, a vertical to cover disruption in fashion
“I hate the term ‘trade publication,’ because it implies being a boring cheerleader for the industry.”
0Chasing subscriptions over scale, The Athletic wants to turn local sports fandom into a sustainable business — starting in Chicago
“It’s very easy today to be click-driven and produce articles that don’t have a lot of substance or depth and don’t cost that much to produce, but that dynamic is disappointing for fans who want higher-quality content.”
0A year in at Vox, Recode looks at its future: Video, distributed content, more podcasts, and no /
“There’s a huge opportunity to be a widely read, digitally native business site that uses tech as our lens, and I don’t think that’s out there.”
These stories are our most popular on Twitter over the past 30 days.
See all our most recent pieces ➚
Fuego is our heat-seeking Twitter bot, tracking the links the future-of-journalism crowd is talking about most on Twitter.
Here are a few of the top links Fuego’s currently watching.   Get the full Fuego ➚
Encyclo is our encyclopedia of the future of news, chronicling the key players in journalism’s evolution.
Here are a few of the entries you’ll find in Encyclo.   Get the full Encyclo ➚
Sacramento Press
Hearst
New York
AOL
INDenverTimes
U.S. News & World Report
Hacks/Hackers
Davis Wiki
I-News
ProPublica
Semana
Plaza Pública