Nieman Foundation at Harvard
“The way we raise the money at The Guardian is different than any place I’ve ever been”
ABOUT                    SUBSCRIBE
Feb. 2, 2022, 9:16 a.m.
Aggregation & Discovery
Reporting & Production

How UC Berkeley computer science students helped build a database of police misconduct in California

When newsrooms, especially local ones, are strapped for engineering resources, the Berkeley students fill in a gap to help journalists complete more ambitious data projects.

In 2018, California passed the “Right to Know Act,” unsealing three types of internal law enforcement documents: use of force records, sexual assault records, and official dishonesty records.

Before the passage of SB1421, California had some of the strictest laws in the United States to shield police officers’ privacy, according to Capital Public Radio, and police misconduct records were deemed “off-limits”.

Six news outlets — Bay Area News Group, Capital Public Radio, the Investigative Reporting Program at the University of California, Berkeley, KPCC/LAist, KQED, and the Los Angeles Times — got together to request those documents, forming the California Reporting Project. Now, 40 news outlets are part of the initiative.

They sent public records requests to more than 700 agencies across the state, from police departments and sheriffs’ offices to prisons, schools, and welfare agencies that have police presence on site. if you’ve ever submitted a records request to a government agency, you know it’s not easy or straightforward to extract information from documents, if you can even get them at all.

But to sort through the more than 100,000 records they’ve gotten back since 2018, Lisa Pickoff-White, KQED’s only data reporter and the data lead on the California Reporting Project, enlisted the help of data science students from UC Berkeley to help organize the data.

The Data Science Discovery Program was founded in 2015 and is part of Berkeley’s Division of Computing, Data Science, and Society. Every semester, the program pairs around 200 students with companies and organizations that have data science–related projects they need help completing. Students spend six to 12 hours a week working on their assignments, for which they receive course credit.

The students have worked with media companies on editorial and operational projects, including the San Francisco Chronicle’s air quality map and the Wall Street Journal’s effort to analyze its source and topic diversity using natural processing language. When newsrooms, especially local ones, are strapped for engineering resources, the Berkeley students fill a gap to help journalists complete more ambitious projects.

“It’s a really natural fit. [We want] students to get a deep understanding of the context of the data analysis that they’re doing, and to consider human context and the implications of the insights and conclusions they’re making,” Data Science Discovery program manager Arlo Malmberg said. “All the things we emphasize in the data science program are at the core of what journalists do as well, in bringing forward the context of a problem in a story for readers, and in providing analysis of the causes of those issues.”

Pickoff-White co-selected four students to work with the California Reporting Project to build a police misconduct database from the records received. They all had particular interests in policing because of various connections in their personal lives. Usually in their data science courses, she said, they work individually on assignments and applications, but they were excited to work as a team on something tangible.

“The purpose of the project really resonated with me,” Pruthvi Innamuri, a sophomore computer science major who worked on the project, said. “During 2020, with a lot of police misconduct happening, I noticed a lot of communities feeling severely hurt and oppressed. I wanted to be able to use my computer science background to work on a project that’s able to better inform people in some way regarding this issue.”

Innamuri and his classmates built programs to recognize basic information from the police records, like names, locations, and case numbers. That made it easier to group files together and organize data for the journalists to analyze.

Some of the stories that have come out of the data from the records include a Mercury News story about how Richmond has more police dog bites than other cities and how Bakersfield police officers broke 45 bones in 31 people in the span of four years. The database isn’t complete yet and the students’ work helps make future data collection easier.

“I don’t know if we’d be able to do this without them,” Pickoff-White said. “None of these newsrooms would be able to automate this work on their own.”

Photo by Lagos Techie.

Hanaa' Tameez is a staff writer at Nieman Lab. You can reach her via email ( or Twitter DM (@HanaaTameez).
POSTED     Feb. 2, 2022, 9:16 a.m.
SEE MORE ON Aggregation & Discovery
Show tags
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
“The way we raise the money at The Guardian is different than any place I’ve ever been”
“This is truly a jointly owned responsibility among the business side and editorial.”
What’s with the rise of “fact-based journalism”?
“To describe one form of journalism as ‘fact-based’ is to tacitly acknowledge that there is also such a thing as ‘non-fact-based journalism.’ And there isn’t.”
Britney Spears and the generational shift in celebrity coverage
“There was just this nastiness that emerged in the way celebrities were covered in the 2000s.”