Analysis without benchmarks: An approach for measuring the success of innovation projects

Innovation on mobile is great, but we need new and better ways to judge whether what we’re trying is working.

By Sarah Schmalbach April 3, 2017, 12:27 p.m.

Newsroom innovation initiatives like our mobile lab in the Guardian U.S. are springing up everywhere. Projects are being funded by philanthropies and tech companies through smaller programs in New Jersey and larger ones at the BBC, and there are also national newsroom transformation projects underway like the Poynter Local News Innovation Program. Google, the Knight Foundation and ONA also recently teamed up to issue newsroom innovation challenge grants (submissions due April 10), and oh, Facebook plans to lend their engineers to newsrooms to build new products together.

However, in order for publishers to realize the potential of all of this innovation work, we need to transform the way we measure its success.

We need to quickly redefine the signals we use to tell us what’s working, and find ways to measure success without any pre-existing benchmarks.

Metrics developed to measure desktop news sites — like pageviews and time spent — aren’t useful when your innovation project isn’t a website and someone spending more time doesn’t equate to a better experience.

Since being able to measure the success of our projects is essential to making any meaningful progress, we set about building out our analytics structure from the ground up, with the help of the analytics and data science teams at MaassMedia.

What did we find?

Over time in the lab, we’ve found that the best way to truly gauge success has been to put the user back at the center of it, since it’s their new mobile habits and preferences that matter much more than our old and ingrained habits for tracking.

As a result, we’ve started looking at a new set of qualitative and quantitative metrics that help us gauge whether or not an experiment was valuable enough to make it worth offering again, or offer it to a wider audience.

And it’s a work in progress. Each day we’re inventing brand new benchmarks for success and building upon them. Our efforts also help us avoid the all-too-common practice of making unscientific comparisons between innovation work and existing products, and it gives us something that looking at just pageviews and click throughs can’t: multi-dimensional signals about how satisfied our audience was with an experiment.

The metrics we’ve chosen to look at tie directly into where we see potential for news organizations to build new formats that better serve and more deeply engage mobile news readers. This set of metrics also helps us gain a sense of user interest in a feature, which, if high can lead us to see potential for other news organizations, and conversely, where we might want to develop or hone features in the future. We’ve outlined our approach below.

The first new metric: Net interaction rate

The net interaction rate, a quantitative measurement recommended by MaassMedia, signals whether or not a project was an overall success. Roughly, it sets data, or user interactions, that we deem positive off of those we see as negative, divided by the total number of things shown (in this case, notifications).

Here’s what it looks like:

Looking at the net interaction rate for an experiment helps us understand how positive or negative the experience was overall for our audience, and it’s also simple to calculate once the right tracking has been implemented.

There are a few additional reasons that using this metric might uniquely benefit innovation teams.

By their nature, innovation teams build things that haven’t been built before, and because we’re breaking new ground rather than treading well-worn paths, it’s important that we re-think what makes for a positive user experience. Applying the net interaction rate requires teams to talk upfront about the user experience, so positive and negative interaction categories can be established. Therefore using this metric builds deep thinking about the experience into the process, and promotes good and open team communication.
Applying this metric also helps avoid a scenario in which one team or one team member is overly responsible for the assessment of what makes an innovation project a success. In order to apply this metric, a multidisciplinary team (which may span editorial, product, analytics, development, design, etc.), needs to have a shared vision of which interactions are positive and which are negative before launching. Is someone closing a notification bad? Is someone sharing their quiz results good? Establishing a common definition of ‘good’ ensures that everyone can consistently interpret the data after an experiment.
This metric also applies well to innovation work because the calculation itself creates space for users to have negative interactions during an experiment — interactions that wouldn’t occur if they were just opening another app or browser tab, which are now routine tasks for a majority of the digital audience. The negative engagements that are bound to happen when trying something new, can’t overshadow the positive engagements associated with offering something new and better. This leaves you free to focus on what you really want to know, which is whether or not this was a great experience for people on the whole and is worth pursuing.

Below is an example, drawn from our experimentation with a Leaderboard alert, an update of medal counts sent in a daily news notification during the Rio Olympics. The three data visualizations below illustrate the total interactions over time with the alert, the positive and negative interactions over time, as well as the net interaction rate with the alert over the two-week span of the games.

An example of a Leaderboard alert.

One of the lessons that came from analyzing the net interaction rate for this experiment included a definition of the ideal function of each notification — particularly about whether the alert was meant to drive deeper engagement with the Guardian site or whether it was to provide quick information-based utility for the audience — in order to inform the positive and negative categories.

To illustrate this in more detail, I’ve invited Lynette Chen, a senior analytics consultant at MaassMedia, to give context the visualizations, and explain the initial application of the metric.

Lynette writes:

In this visualization of the Leaderboard alert over time, the increase in the number of interactions might lead you to think the experiment was more successful towards the end of the Olympics.

The total number of interactions with the Leaderboard alert over time.

A deeper analysis, in which you breakout the engagements into positive and negative groupings, reveals that although the total number of interactions increased over time, this increase was actually driven by a surge in negative engagements. [For the Leaderboard alert experiment at the time of the analysis, a user closing the notification or managing their update settings counted as a negative engagement, whereas tapping on the alert or tapping through to the full leaderboard page counted as positive.]

A breakout of the number of user interactions by positive and negative engagements over time.

To better demonstrate the difference between the positive and negative engagements, you need to use the net interaction rate. Through graphing the net interaction rate, a clear negative linear relationship becomes apparent. Thus, the net interaction rate decreases over time, suggesting that users were more likely to close the notification after receiving the Leaderboard alert every day for two weeks.

A calculation of the net interaction rate for the Leaderboard alert over time.

This early experiment led to greater discussion of how to interpret expected behaviors in long-running experiments, such as notification closes or changes to settings. For instance, if users closed a notification because it provided all the information they needed, that kind of interaction may not necessarily be counted as negative. Going forward, we believe that survey information about how users interact with notifications might add more context.

Despite the fact that the net interaction rate with the Leaderboard alert declined over time, there are a few reasons we might still run a similar experiment in the future.

Primarily, we might run this type of experiment again to find out through survey responses if the alert provided value even when users closed it, which would impact our categorization of the interactions. We also received user feedback that there was limited incentive to tap through on the alert, since it led to that day’s live blog. The blog was a good source of background information on the day’s events but wasn’t directly linked to the content in the alert, which was a summary of the countries at the top of the medal-count leaderboard. If we focused on providing more relevant information when users tapped through, they may have had more incentive to do so.

The second new metric: Survey responses from users

The second metric, or set of metrics, we look at are the qualitative responses to a short survey we send subscribers after they participate in an experiment. These add helpful context to our quantitative analysis. We ask subscribers to rate the experiment’s usefulness, their level of interest in it and whether or not they’d sign up for it again.

For example, here are the results from a question we asked about the live-data alert offered in the Guardian apps the night of the U.S. presidential election.

Responses to a user survey about the live-data alert sent the night of the U.S. presidential election.

An overwhelming majority of people said that the content in the expanded version of the alert was useful, giving us one of many very good signals that they were happy with the new format.

The live-data alert sent the night of the U.S. presidential election.

We understand that users who fill in surveys are a self-selecting group and their opinions may run askew from the general population who participates in our experiments, but we always consider survey feedback ‘directional’ and combine it with quantitative data about actual engagement in order to gauge success of an experiment.

Over the past few months of working with MaassMedia, we quickly learned that while running an experiment once tells you something, it doesn’t tell you everything.

As we continue to run experiments of the same kind, we can use each set of results to start creating benchmarks, upon which we can make better conclusions. We’re looking to see if engagement and satisfaction levels remain high if you run the same experiment for, say, a sports audience and a breaking news audience. We’re also interested to see if satisfaction levels stay the same when, for example, we cover politics in a new way, and then also apply the same format to economic news.

As you might expect, the results vary, and it is only after running multiple instances of an experiment across many topics that you can start to spot trends that point towards opportunities for a format’s lasting success.

As more teams like ours take root in within journalism, we need to keep experimenting with better ways to interpret where our experimentation is most effective, and what signals will show us the way towards evolving news for new mobile platforms.

We’re eager to hear what methods and metrics you’ve found work for your team. Please add them in the comments below. We’d love to give them a try!

This piece is copublished with the Guardian Mobile Innovation Lab, of which Sarah Schmalbach is senior product manager. Disclosure: Both the Guardian Mobile Innovation Lab and Nieman Lab are funded by the Knight Foundation.

Photo of measuring tape by Sean MacEntee used under a Creative Commons license.

POSTED April 3, 2017, 12:27 p.m.

SEE MORE ON Mobile & Apps

Show tags

TWITTER FACEBOOK EMAIL