Nieman Foundation at Harvard
Are you willing to pay for Prepare to be asked before year’s end
ABOUT                    SUBSCRIBE
Oct. 20, 2014, 4:47 p.m.

Light everywhere: The California Civic Data Coalition wants to make public datasets easier to crunch

Journalists from rival outlets are pursuing the dream of “pluggable data,” partnering to build open-source tools to analyze California campaign finance and lobbying data.

When Meg Whitman ran for governor of California in 2010, she donated $144 million of her own money to her campaign. Whitman, the Republican nominee, ultimately lost to Democrat Jerry Brown, but her spending ensured that the race was the most expensive non-presidential campaign in American history.

California_Data_Coalition_LogoIt was obvious Whitman was spending a fortune on the race, but it wasn’t easy to access California’s campaign finance and lobbying activity database, CAL-ACCESS, in order to do a more thorough analysis on her spending, or the finances of any other California campaign through the years.

The database had a basic search function, but if you wanted to access the raw data, you’d need to send $5 to the state’s Secretary of State office and wait to receive a CD with the data on it back in the mail.

In August 2013, after a long drawn-out fight with California journalists and civic data and open government activists, the Secretary of State to put all the raw data online in a format that’s downloadable. But even then, with 76 different tables and roughly 35 million records, the data was still unwieldy and difficult to use.

So last month reporters and developers from the Los Angeles Times, the Center for Investigative Reporting, and Stanford’s nascent Computational Journalism Lab formed the California Civic Data Coalition and published open source tools to make parsing and analyzing the data easier.

“Our whole goal for this project is to ask very simple questions,” said Aaron Williams, a news apps developer at CIR who was involved with the project. “If you want to know, say, who has spent the most money in a political campaign in California? We all knew that was Meg Whitman, for example, but to actually query that data out and to watch her campaign trail, to answer those kinds of questions, the data didn’t really provide you that information easily. And even with the raw data, there were still a lot of connections you needed to make.”

Though the Times and CIR are competitors, they decided to work together so they could both spend more time focusing on actually reporting on the contents of the data as opposed to making it usable.

“We want to compete on who can do the better deep dive, who can ask the smarter question, who can be more aggressive about getting the story,” said Ben Welsh, a database producer at the Times. “We don’t want to compete on who can unzip and link together 76 crappy database tables.”

Welsh and Agustin Armendariz, a former CIR staffer who is now at The New York Times, began discussing and working on a collaboration last year. And in August, through a grant from Knight Foundation and Mozilla, the coalition gathered in San Francisco for two days to actually build the tools.

The byproduct of the coalition’s work is two Django apps released a few weeks ago: one to access and download campaign finance and lobbying data from the state’s database and another, called the campaign browser, to “clean, regroup, filter and transform the massive, hairy state [campaign finance] database into something more legible.”

“We’re talking about making power tools for power users, for the small amount of people who really want to go after this data and look at it in more aggressive and different ways than the state’s website allows,” Welsh said. “That’s really our goal.”

“At the end of the day, we don’t want to make yet another campaign finance website,” he continued. “We want to make a set of tools, or power tools, to let analysts who are really interested in going at the data statistically, who are trying to figure stuff out in more complex ways — we just want to make that as easy as possible, so we can start diving in and doing it.”

This is a philosophy the coalition is calling “pluggable data” — an effort to improve and streamline how data is prepared by focusing on ways to make clean and extracting large data sets replicable.

The coalition compared the idea to packaged software, which developers already download regularly. As they write on the coalition site:

If a series of simple installation commands can provide a Django application with everything necessary to build a social networking site, why can’t it also provide U.S. Census statistics, the massive federal database that tracks our country’s chemical polluters or something as simple as a list of every U.S. county?

In our conversation, Welsh highlighted the work of’s Josh Tauberer, The New York Times’ Derek Willis, and Eric Mill, formerly of the Sunlight Foundation, in establishing the @UnitedStates Project, a collaboration in open data at the federal level, as an example of a group that’s inspired the California Civic Data Coalition in emphasizing collaboration and pluggable data.

It’s a mindset the coalition is emphasizing as it continues improve and build out its current apps. They’re also building a browser for CAL-ACCESS lobbying activity data. With the project open-sourced and on GitHub, a grad student at Berkeley has already started contributing code, and the coalition is hopeful students at the Stanford Computational Journalism Lab will also contribute once it launches this winter.

For now, the coalition is only focusing on its current data sets, though Williams and Welsh didn’t rule out expanding its focus in the future. The current emphasis, they said, is to make their current CAL-ACCESS data sets more accessible.

Williams, for example, is working to make the data exportable into flat CSV tables so it can be imported into programs like Microsoft Excel or Microsoft Access. The goal, Williams and Welsh said, was for individuals who aren’t programmers but are interested in data journalism to be able to meaningfully use the data to find stories.

“A dream outcome would be some graphic artist is able, in one day, to download that file, answer the question they want, and make a great graphic,” Welsh said.

Photo of California’s Kings River Canyon by Ansel Adams via the U.S. National Archives.

POSTED     Oct. 20, 2014, 4:47 p.m.
Show tags
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
Are you willing to pay for Prepare to be asked before year’s end
The cable news network plans to launch a new subscription product — details TBD — by the end of 2024. Will Mark Thompson repeat his New York Times success, or is CNN too different a brand to get people spending?
Errol Morris on whether you should be afraid of generative AI in documentaries
“Our task is to get back to the real world, to the extent that it is recoverable.”
In the world’s tech capital, Gazetteer SF is staying off platforms to produce good local journalism
“Thank goodness that the mandate will never be to look what’s getting the most Twitter likes.”