When Nieman Lab asked me for a prediction on journalism in 2015, I knew instantly what I wanted to write about. I’m definitely not someone you could describe as a futurist, so I picked an issue that had been irritating me lately: the increasing proliferation of small chart- or map-driven “data journalism” pieces produced as P.R. for Internet startups.
Bad Data Journalism is a particular hobby-horse of mine, so I was a bit surprised to see this post receive the amount of attention it did; clearly I’m not the only one bothered by the proliferation of chartjunk online. But I also left readers with a lot of questions that I saw wherever my post was shared on social media. The original post was a small plate in an entire menu; to go too long would miss the point. Here, I want to go a bit deeper on some of the things I had to elide before.
One of the initial objections I spotted on places like Hacker News and Twitter is that the PR practice of creating real or dubious studies is not exactly new. Indeed, 4 out of 5 dentists would agree that some form of this nonsense has been happening for some time.
And yet I feel like this is a distinctly specific form of the genre, reflecting the particular details of modern data. It’s no longer necessary to commission a market research firm to poll the public or look for a medical study that can be twisted to support whatever product you are selling. We’re in the era of Big Data now, where even the smallest companies are collecting vast amounts of data on their users and analyzing it for trends and insight. With this has come a remarkably different trend in how we ask data for questions about the world.
In a conventional model of journalism (and science), research first starts with a hypothesis. Very often, you can’t just resolve a question directly, so you figure out a way to accurately answer the question through more indirect means. For a traditional narrative journalist, this might mean reconstructing the events of a story by interviewing sources and assessing their veracity. In the case of data journalism, this means collecting or accessing datasets that might form a decent estimate of what you are trying to explore.
In statistics, this is called a proxy, a measured variable we must use to answer questions about something we don’t have data for. For instance, economists often use per-capita GDP as a proxy for comparing the quality of life across various countries, because the latter is a variable that is what they want to know but that is hard to quantify. The similarity of this process to the scientific method is not a coincidence; both fields commonly use either deductive or inductive reasoning to prove the veracity of a conclusion by proving the correctness of its antecedents.
The rise of data science has changed the direction of these investigations. In abductive reasoning, the hypothesis is inferred from the data observed. Formerly rare, this reversed reasoning has seen a resurgence in the era of big data. For instance, Target famously figured out how to guess a customer is pregnant solely based on the seemingly unrelated products she purchases. It is indeed logically tenuous to conclude a customer is pregnant because she bought cocoa-butter lotion, dietary supplements, and a blue rug (actual example provided by Target), but if that guess is more likely to be right and generates a boost in early marketing for Target that leads to sales, it’s worth it. Even if it’s a little creepy.
So it’s no surprise that smaller companies are also assembling their own data science teams for analyzing their data and profiling their users. Sometimes these efforts can go horribly wrong, but the marketing advantages are too great for companies to ignore the trend. And producing these little news factoids gives them a means of testing out their data analysis chops before they make egregious mistakes with customers. OkCupid was an early pioneer of this approach back in 2009, but I do feel it has accelerated in 2014 — though that is just a hunch.
The problem, though, is that these stories are not reported with the same rigor traditional journalism is. They start with the data and look for possible explanations hidden within it rather than starting with the hypothesis and critically assessing how good a proxy the data you can find is. It’s like trusting everything a source says without checking it out. Sometimes, it’ll be true, but often it’s misleading — often wildly so.
I’ve argued their methodology is terrible and their news stories are wrong. Do I think these companies are evil? No and yes. I’m in no way arguing these companies are purposely twisting data for malevolent purposes, as Big Tobacco did for instance. Indeed, lacking any evidence to the contrary, I can only assume their mistakes are well intentioned, even if they have no idea what they’re doing. It’s all just fluff, isn’t it? Maybe I should just relax.
But I do have a big problem with how this content is framed. No matter how well intentioned the inept investigation might be, the data story that is pushed out is inevitably divisive. Which political party is the most perverted? What cities are the most stressful to live in? What countries are the worst at sex? This is no accident; it helps these pieces go viral by catering to people’s inflated senses of their own groups or their dismissive attitudes towards other. It’s a particularly xenophobic and crass form of confirmation bias. It’d be easy to ignore all of this as trivial fluff, were these stories not so dogged about pitting one group against each other. That makes these stories less amusing than vile.
Remember that article about Target figuring out which customers are pregnant; it’s hard not to see it as an invasion of privacy even if it’s perfectly legal. Contrast that with this analysis by Jawbone showing how the Napa earthquake affected its users’ sleep. Despite being built on deeply personal information, it doesn’t seem to have raised any ire from readers online. Why?
It’s possible that the difference is wearing a Jawbone is voluntary — but so is shopping at Target. Indeed, I think it’s clear that both companies analyzed personal data that users generally assume is private. It looks like Jawbone managed to sidestep squeamishness by releasing a cool chart instead of boasting about their ability to target individual user’s sleep pattern, although that’s likely something their servers are doing. All of which suggests a golden opportunity for big retailers who didn’t know how to talk about their use of big data without sounding totally creepy. Now they can — with maps!
It’s pretty easy to imagine some compelling interactives that large retailers might produce, following the lead of this New York Times interactive of Google searches before Christmas. Amazon could report on what people are actually purchasing. Walmart could illustrate regional food purchases before Thanksgiving. Target could show what Halloween costumes are popular around the country. Done correctly, these kind of pieces would almost certainly be viral. And unlike the data from a small startup, they might actually be a decently accurate reflection of the general population — and thus somewhat true too.
Nothing. We’re pretty much screwed.
Okay, that sounds a bit bleak. But likely the only thing that will abate this type of content are the limits on how fast companies can produce it. I would love it if the public developed a greater skepticism for these posts — since online news sources don’t seem to have — but I don’t see that happening. Instead, it’s more likely that the general popularity of chart posts will fade as their novelty declines, much like BuzzFeed’s quizzes or some listicles. That’s little comfort, but a little hope is better than nothing.
Jacob Harris is a senior software architect at The New York Times.