How The New York Times assesses, tests, and prepares for the (un)expected news event

How The New York Times assesses, tests, and prepares for the (un)expected news event

Rather than hastily address issues in the months leading up to big events where we expected lots of reader traffic, we decided to take stock of our systems as a whole and enact longer term resilience measures.

By Alexandra Shaheen, Megan Araula, and Shawn Bower July 15, 2021, 10:40 a.m.

The New York Times experiences traffic levels that ebb and flow with the news cycle. Planned events that occur at a fixed time and date, such as an election or the Olympics, are known traffic generators. We expect our users to visit The Times’s website and apps for special coverage during those events. However, unplanned events are also a major part of the news business. When a story breaks, a push notification is sent and users arrive at our platforms in droves.

Our technology must be able to handle both types of events. Times Engineering ensures that our journalists can get the news out when it happens and that our readers can access information when they need it most.

Over the past few years, our engineering team has put together what we call “election readiness efforts” to prepare our systems to withstand both planned and unplanned news events. These efforts have coincided with the United States election cycle because elections are important to our readers and often generate record-breaking traffic. Elections are an opportunity for us to merge our talents in journalism and engineering to provide readers with extensive coverage of the events and a user experience that helps them understand these moments in history.

To prevent our systems — which are a mix of legacy and modern software — from breaking during these important news events, we have spun up election readiness efforts to systematically improve our systems.

A brief history of our election readiness efforts

The election readiness effort in 2016 was small and centered around the implementation of a Content Delivery Network (or, CDN) that could provide us with protection during traffic surges. During the presidential election that year, we saw no outages and the CDN became a pivotal disaster recovery tool.

In 2017, we kicked off a large project to migrate our data from distributed data centers to the cloud. With a tight deadline, we pushed applications to the cloud using both microservices and the “lift and shift” technique. At the time, Google Kubernetes Engine (or, GKE) was the most mature managed environment for running containers, so we used that for our microservices. For legacy systems, we moved to Amazon Web Service (AWS) using EC2. These methods helped us quickly move our stack to the cloud and we were able to shut down our data center on April 30, 2018.

The downside to this approach was that we ended up with a bifurcated system spanning both clouds. Since this was the early days in Google, we had to make compromises by putting a number of endpoints on the internet with limited ability to extend our corporate networks to the Google backplane. This also meant the only way for our applications hosted in Google to talk to our applications hosted in AWS was over the internet. We spent significant amounts of time finding and implementing authentication solutions that fixed this issue.

This was a large migration and that fundamentally changed our applications. This meant all the data we collected about how our systems ran during the 2016 elections were no longer relevant. There were also new business-critical systems that were built after 2016 that had not been assessed for reliability. We did not know where we were vulnerable, but would soon find out the hard way.

On September 5, 2018, the Times Opinion section published a guest essay by a then-anonymous author from within the Trump administration. The resulting traffic surge caused numerous issues with our website and apps, and showed us how much work we needed to do in the two short months before the 2018 midterm elections. While we saw some challenges on election night that year, our two months of work helped stave off major outages.

2018 was a turning point for us and our site reliability strategy. Rather than hastily address issues in the months leading up to big events where we expected lots of reader traffic, we decided to take stock of our systems as a whole and enact longer term resilience measures. In the fall of 2019, we kicked off the readiness effort for the 2020 presidential election.

Assessment phase

Many of the engineering teams at the Times are small and operate independently of each other. They don’t share programming languages, software development life cycle, project management methodology, or deployment strategies. The teams monitor their own systems and performance, which was a strategic decision made when we migrated to the cloud. While this is great for agility and feature releases, it complicates the overall resilience of our systems.

Leading up to the 2020 election cycle, there was no one person or team that fully understood our entire federated architecture. We needed a strategy and fast — or, there was little chance we would be able to meet what we expected to be a historic news moment.

Step 0: Team formation

We first had to form a team that could assess the state of our architecture. The Times technology landscape is vast and tough to parse at a holistic level. We have our main website and apps that interact with numerous APIs; a CMS that creates and delivers data to our printing facilities and front-end applications; standalone products such as NYT Cooking and Games; our user and subscription platforms; data and analytics platforms; as well as the infrastructure (like CDN, Cloud and DNS) to deliver our content to readers.

In order to gather information about all of these systems as quickly as possible, we made sure the team was composed of engineers with different expertise from all over our Engineering group.

Step 1: Scope

Identifying the scope of this work was a fundamental part of the process. We knew we couldn’t address every resilience gap, so we needed to build consensus among the team and our stakeholders on which workflows and systems were most critical for the successful performance of our platforms for the 2020 presidential election.

We identified and ranked workflows — which might be the process by which we publish the homepage or the ability to take payment for a subscription. We then mapped which systems were critical to these workflows and created a tiered system.

Our most important workflows were assigned “Tier 0” and qualified as “mission critical.” Most of our Tier 0 workflows centered around publishing because if any of them failed during the election, the Times would not be able to get the news out, which would severely impact our report and business. There have been five moments in our history where we’ve failed to print the daily report in New York, the most recent being a 1978 labor strike.

Our “Tier 1” workflows qualified as critical and related to subscriptions, push notifications, and marketing. The “Tier 2” workflows were designated as important and included features such as commenting, targeted advertising and data capturing.

This tiered schema helped us define the scope of this work so we could strategically focus on improving our systems’ resilience.

Step 2: Assess and test

Once we had a team and a scope, we were ready to assess our systems. We used architecture readiness reviews and operational maturity assessments to gauge the status of each system and measured them against formalized standards we created for each tier. We aggregated the scores from both assessments, which helped inform us and our stakeholders where investment and prioritization was required.

It can be difficult to prioritize resilience work and technical debt on feature teams’ roadmaps. A product manager often plans a few quarters ahead with work that includes new features and improvements that address user needs; It can be hard to fit resilience work into these plans. It can be difficult to split developer time between infrastructure improvements and product requirements, particularly on smaller teams or newer teams with significant greenfield development tasks ahead of them.

As we assessed the teams, some did not have enough resources to split up the work, while others had to sacrifice new feature development in favor of the fortification of their systems.

As the first two election events of the year — the Iowa Caucus and Super Tuesday — rolled around, we gathered in an office war room at The Times’s headquarters in Midtown Manhattan. The stack held up. In between bites of food and sips of coffee, we talked about the news of a virus spreading around the world.

By mid-March, we had begun working remotely because of the coronavirus. We found ourselves in the midst of a news moment with many competing headlines and daily elevated traffic. When we began our election readiness work for 2020, we knew the election would likely be unprecedented, but by April we were planning for the unknown.

Stress tests are one of our primary tools for preparing our systems for big news events, and we have conducted them for many past election cycles. However, we quickly learned that remotely coordinating and conducting production stress tests for over 20 systems was challenging.

Over the course of the election cycle, we ran seven load tests on production — simultaneously hitting dependent systems to see how much of a load they could take before breaking, and any downstream impact. Because we couldn’t sit in a room together, we set up video calls and Slack channels so team leads could observe how and where systems degraded.

The election leads floated from hangout to hangout, observing how and where systems degraded, pitching in as needed with issues with load testing software. We iterated on the process after every stress test, improving test operations and communications as we could.

By September 2020, we were regularly stress testing our website. More teams were able to handle record breaking traffic and engineers were more comfortable with the process. As November grew closer, we were becoming confident. There were only a handful of systems that needed work; they were identified, and we had a plan to move forward.

The publication of The Times’s investigation into former President Donald J. Trump’s tax information on September 27, 2020 provided a true stress test of our systems, as readers came to our platforms in high numbers. It was a glimpse of what the election might look like. Most systems were able to handle the traffic, but there were gaps in systems that were not easily stress tested. We knew that degradation on this key night would be catastrophic. We had more work to do before November.

Alexandra Shaheen is a program lead for the The New York Times’ Delivery Engineering mission, which is responsible for foundational infrastructure and developer tooling. Megan Araula is a staff software engineer working on edge infrastructure with the Delivery Engineering team. Shawn Bower is the director of information security with The Times’s InfoSec team. This article originally appeared on NYT Open and is © 2021 The New York Times Company.

Illustration by Gizem Vural.

POSTED July 15, 2021, 10:40 a.m.

Show tags

TWITTER FACEBOOK EMAIL