Nieman Foundation at Harvard
HOME
          
LATEST STORY
Dow Jones negotiates AI usage agreements with nearly 4,000 news publishers
ABOUT                    SUBSCRIBE
March 15, 2023, 10:23 a.m.

Journalists should be looking for undocumented APIs. Here’s how to start.

“Especially in circumstances when data is not accessible otherwise, finding an undocumented API can be the key to allowing us to do an investigation — by finding public access to the data.”

At The Markup, we build our own datasets, a lot. It’s one of the core tenets of our newsroom, and how we test our hypothesis-driven journalism.

One way we do that is by finding and using undocumented APIs (or application program interfaces), which are hidden in plain sight. These APIs run behind the scenes on websites and do things that are so mundane that most people just take them for granted. Autocompleting search queries, scrolling infinitely, and filtering pages after you press a button in the user interface are all usually powered by undocumented APIs.

Especially in circumstances when data is not accessible otherwise, finding an undocumented API can be the key to allowing us to do an investigation — by finding public access to the data.

I designed a step-by-step tutorial on how to find and use undocumented APIs based on my experience doing it at The Markup: https://inspectelement.org/apis.html.

In addition to following the tutorial, let’s talk about why these undocumented APIs matter and how I’ve used them in my reporting.

Finding Google’s blocklist of YouTube ad placements

Finding an undocumented API in Google Ads allowed Aaron Sankin and me to confirm that Google blocked key racial justice terms such as “Black Lives Matter” but not “White lives matter” and other well-known hate terms. From the webpage, it would be impossible to distinguish between a blocked response and one that was obscure and returned no results. By finding the underlying API that populates the page, we were able to intercept search queries before search results were rendered on the page. This allowed us to find structural differences in the API response based on Google’s verdict on a word or phrase and create a clear categorization system. With the API at our service and a categorization system in place, we were able to test curated lists of keywords that we had sourced from nonprofits and independent civil rights groups.

Identifying Amazon private label products

When Adrianne Jeffries and I started investigating whether or not Amazon gives an advantage to its own branded products in search, there was no definitive list saying which products were Amazon brands and exclusives.

At the time, Amazon brands were not clearly labeled on its website, and Amazon had more than 150 trademarked brands. We even ran a national survey of 1,000 adults and found that most Americans were unable to identify Amazon’s top brands.

We began inspecting Amazon’s website to find a reliable method of identifying Amazon’s brands and eventually found an unassuming button on the left-hand side of the search page that filtered the page to “our brands.”

When we listened to the network requests in the browser’s developer tools, we found that the “our brands” button calls an undocumented API that could take any search term and return products that were just Amazon brands and exclusives.

Aside from using the undocumented API throughout the months-long data collection, we built a web extension that uses the API to re-create our findings for Amazon shoppers across multiple countries and languages.

Collecting more than a million internet plans

Most recently, Aaron and I used a series of APIs found on four internet service providers’ websites to collect more than 1.1 million internet plans across major cities in the United States. The scope and scale of our investigation was only possible because we were able to scale up our data collection, which we originally developed using Selenium, a browser automation tool. Although browser automation is a fundamental tool for data collection, it sucked up too many resources and was far too slow. Our trial analysis of one city and one provider took weeks to complete. However, finding the underlying APIs that powered the multistep lookup process allowed us to streamline that process to the point where we were able to collect a 10 percent sample of 21 cities in two days.

A key tool for your next investigation

As journalists, we have a suite of techniques we can use to build our own datasets. This includes filing public records requests (such as FOIA or state-level requests), keeping an eye on government data repositories, crowdsourcing, conducting surveys, web scraping, and even building physical sensors.

One way we frequently get data is through APIs, which make it possible for us to write code to communicate with servers and request records.

Some APIs are official and documented, such as the Census Data API, which we use to retrieve statistical survey data from across the United States. The benefits of documented APIs are that you know what you are going to get, and there are notes and examples to help developers use the tool as intended.

However, as we’ve seen with Twitter and Facebook, official APIs can disappear at a moment’s notice. The environment built around these data sources collapses, and the entity that rescinded the API becomes more opaque.

When it comes to undocumented APIs, there’s even less of a guarantee that they’ll stay around forever, but they do exist for a reason and can help us ask questions and access data that is otherwise inaccessible and unadvertised. We’ll keep finding them and using them for our journalism, and hope you will too.

Again, if you’re interested in learning how to do this yourself, here’s my step-by-step tutorial: https://inspectelement.org/apis.html.

POSTED     March 15, 2023, 10:23 a.m.
Show tags
 
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
Dow Jones negotiates AI usage agreements with nearly 4,000 news publishers
Earlier this year, the WSJ owner sued Perplexity for failing to properly license its content. Now its research tool Factiva has negotiated its own AI licensing deals.
Back to the bundle
“If media companies can’t figure out how to be the bundlers, other layers of the ecosystem — telecoms, devices, social platforms — will.”
Religious-sounding language will be everywhere in 2025
“A great deal of language that looks a lot like Christian Nationalism isn’t actually calling for theocracy; it is secular minoritarianism pushed by secular people, often linked to rightwing cable and other media with zero meaningful ties to the church or theological principle.”