Nieman Foundation at Harvard
HOME
          
LATEST STORY
A year in, The Guardian’s European edition contributes 15% of the publisher’s pageviews
ABOUT                    SUBSCRIBE
July 31, 2024, 2:58 p.m.
Reporting & Production

To preserve their work — and drafts of history — journalists take archiving into their own hands

From loading up the Wayback Machine to meticulous AirTables to 72 hours of scraping, journalists are doing whatever they can to keep their clips when websites go dark.

When news sites shut down, those sites’ owners often don’t prioritize the preservation of the content.

MTV pulled down MTV News in June. After Deadspin was sold, many of its archives temporarily disappeared. This week, Flaming Hydra reported that The Awl’s archives are gone. And those examples are just from the past couple of months; in 2021, the authors of a Reynolds Journalism Institute report found that just 7 out of 24 newsrooms they interviewed were fully preserving their news content.

“It’s really kind of a web of responsibility in terms of creating an accurate record,” Talya Cooper, a research curation librarian at NYU and The Intercept’s former archivist, told me. “When you hear about something being shut down, it’s not just ‘Wow, all of this content is being lost.’ It’s also all of the content that is derived from this content — a key bedrock of evidence that could be used to verify a claim, or bolster someone’s career, or any number of things.”

AI further complicates matters — what happens when sites are used to feed ChatGPT, then go offline? “What happens when that information is baked into large language models and the source of that information is not live on the web anymore?” Cooper wondered. “It’s kind of mind-boggling to think about, but it is reality for a lot of websites that have been crawled and had their content put into the blender of large language models. How will it be possible, in the future, to trace back some of the claims that will be made by ChatGPT if the content is no longer alive?”

When news sites’ archives disappear, readers aren’t the only ones who lose out — there are all kinds of personal and professional challenges for journalists, too. They’re left to archive their work on their own, so that they have clips to show the next job. Web pages, photographs, and text stories are easier to save than audio files, interactives, and other types of digital journalism; to preserve those, journalists often have to get creative. Paid personal archiving services are available, but “it’s not necessarily appealing when you’re just trying to look for a way to save something that was previously online for free,” one journalist told me.

I spoke with three journalists about how they’re going beyond the Wayback Machine to preserve their own work.

“Is there a plan in place to archive work?”

Alex Azzi, a career women’s sports reporter, is a LexisNexis pro and often mines newspaper archives for stories about athletes’ middle and high school sports careers. It can actually be easier to find women’s sports stories from the 1980s and 1990s from newspaper archives, she said, than stories than from the early 2000s — when, as print newspapers declined, bloggers started filling reporting gaps.

Often, those blogs don’t stay online forever. Women’s sports have long been undercovered by traditional news organizations; “fans stepped in to fill that void by creating their own work,” Azzi said. “[But] running a website, just maintaining servers, can be really expensive.”

Once, when Azzi was working on a story related to women’s hockey, she needed to find information about a lawsuit. A reporter named Meg Linehan had worked for a women’s sports publication called Excelle Sports and was the only person to cover the lawsuit in-depth at the time. But Excelle Sports shut down in 2017 after just two years of publishing and its stories were no longer online.

Azzi got in touch with Linehan. Linehan shared a copy of a draft of the story in Google Docs, but she no longer remembered how close the draft was to the published version.

In the case of women’s hockey, Azzi found that some of the best-kept archives existed on the servers of the University of Toronto. Since the 1990s, former University of Toronto computer science professor and hockey player Andria Hunter archived Canadian women’s hockey coverage on her university website. Recently, she’s started moving all of that work over to a personal website,The Women’s Hockey Web.

“Thank goodness she did that because [otherwise] we would have no records of the early years of the first Women’s Hockey League in Canada,” Azzi said.

In November 2022, Azzi was laid off from NBC Sports. In June 2023, NBC started redesigning its website, which temporarily took much of Azzi’s work offline while she was applying to other jobs, leaving her without active links for clips. She found links to some of her stories on the Wayback Machine and Wikipedia; luckily, she also had some PDFs saved.

“I constantly have to be editing my resume to make sure that it’s [accurately linking] to my work,” Azzi said. “NBC Sports links ended up changing for a lot of my stories.”

In July 2023, Azzi was hired as senior editor of women’s sports coverage for The Messenger.

“One of the questions I was thinking about when I got offered that job — I don’t think I ultimately asked it because I didn’t want to sound like I was predicting the end too soon — was ‘Is there a plan in place to archive work?’” Azzi told me. The site shut down in February 2024, after just six months, and everyone’s work “disappeared by 7 p.m. the night the site shut down,” Azzi recalled. She doesn’t necessarily think that asking about archives when she got the job offer would have resulted in a different outcome. “But I do think it’s kind of funny, in retrospect, that that was one of the things that was already on my mind,” she said.

Luckily, Azzi undertook some of her own preservation work in her time at The Messenger. She archived around 30 stories on the Wayback Machine so that her writers would have some record of their work. She also copy and pasted freelancers’ stories into Word documents (a tedious process that required reformatting stories to remove ads, while making sure the documents still had visual indicators to show the stories had indeed been published on a professional news site.)

“I went through and was like, okay, what are the stories that I would be saddest to see [gone] if there was no longer an online record of them? What are the stories that I think are most important to make sure that there’s some record of them going forward?” Azzi recalled. “It wasn’t a complete collection.”

“Any outlet could close”

Andrea Gutierrez, a freelance radio reporter based in Los Angeles, is currently updating her archive of her own work. She keeps a detailed AirTable with the air dates of her pieces, the shows or programs they aired on, and Dropbox links to the physical audio files.

The physical audio files are key. When KPCC rebranded and moved all of its contents over to LAist.com, the KPCC links stopped working and she had to find the new URLs. Also, there isn’t always an easy way to download audio files from the site.

Some pieces may stay online, semi-trapped within the platforms they were originally uploaded to. When Gutierrez was working for TED Radio Hour, she produced social videos that combined audio scripts from the show with videos with the hosts. She didn’t save the video files for herself; they’re still on Instagram, but she can’t retrieve them without resorting to “shady sites to download the videos.”

“You have to think about those things,” Gutierrez said. “Is NPR always going to have a downloadable file? It’s easy to download them now. But [will that always be true]?”

Before working in journalism, Gutierrez was a production editor at a university where her main job was to archive students’ theses and dissertations. Later, she worked for a feminist print magazine called Make / Shift. When the magazine shut down in 2017, it donated all of its archives to an archive in Los Angeles. Those experiences made her realize that the person ultimately responsible for saving her work is her.

“Any job could end. Any outlet could close,” Gutierrez said. “I knew I had to have my own back.”

“My wife saved my ass”

In the months before Vice shut down, Matthew Gault, then a reporter for Motherboard, and his colleagues generally knew the end was coming, but they weren’t sure exactly when it would be. When they heard rumors that the closure was imminent, in February 2024, they scrambled to save their work. Gault had 10 years worth of clips from Vice, and the idea of right-clicking to “save as PDF” for hundreds of stories was unappealing.

“My wife saved my ass,” Gault told me. “She had already built a scraper that would scrape the sites, pull out all the advertisements, and then save it as a readable PDF for every single individual piece.”

Karen Gault, the heroine of this story, is a software engineer for a local power company. To scrape her husband’s byline, she customized a script based on Gotham Grabber, open-source code on GitHub to scrape stories from Gothamist and DNAinfo. Once she successfully scraped her husband’s byline, they moved to help other writers.

Karen Gault’s computer screen, where she scraped multiple Vice bylines at a time.

Matthew organized a Signal chat with current and former Vice employees who wanted their work saved, and in one weekend, Karen and their friend Chris scraped more than 60 author tags. In a massive Google Drive folder, they uploaded a zipped file for each journalist with their own Vice archive.

One particular byline took nearly an entire day to scrape, Karen said, because so many of the reporter’s 3,200 stories were filled with images, which take more time to pull out.

Matthew was in charge of making dinner that weekend, and resolved to save his own work in the future. As of today, Vice’s website and content is still online but no one knows when that might change.

“I’m fairly cynical about all this and tend to think that things go away very easily. And given that attitude, I should have known better,” Matthew said. “There’s a lot of stuff that I’m very proud of. In 10 years, to not be able to look at things and have my memory jogged about the experience of writing, or the talks with editors that came to make that piece happen — that would be a big personal loss for me.”

Large legacy news outlets are often better equipped to preserve their archives than smaller digital-only news outlets because they have the resources. But those protocols have to be constantly updated and refined to keep up with changing technology and evolving forms of journalism.

“It’s not just a question of running out of money in the future,” NYU’s Tayla Cooper said. “What if your site gets ransomware? What if the power goes out at your website provider? There are a lot of scenarios in which you might want an archive of the things that you’ve created.”

Ryan Murphy has been working in the archives department of The New York Times since 1997, when he started as an intern. Today, he creates merchandise for sale using materials from the archives — calendars, books, and so on. He’s also the customer reprint sales agent, helping anyone who calls looking for a specific issue or story from the Times.

Even at a paper as large as the Times, things can go missing, Murphy said. There are two types of archives: the historical archives, usually in black and white and on microfilm, and the modern archives, in color and high resolution. When Murphy first joined the paper, the modern archives were on CD-ROMs and his job was to manage that collection. Each color issue of The Times was on its own CD.

“Whenever reporters wanted the original color issue, they would check it out like it was a library CD. Sometimes that would not come back,” Murphy said. “One day, I was like, ‘You know what, I’m just going to copy every CD-ROM into my computer.’ Finally, we made a server to hold all these old digital files, but a lot of them went missing. That just goes to show, like, we have access to information from hundreds of years ago, but sometimes we lose things from 2002.”

Good digital preservation is expensive, Cooper said. Less expensive methods are labor-intensive. Because of financial constraints, newsrooms may rely on content management systems as imperfect archive instead of investing in digital asset management systems.

“When I was at The Intercept, we were like, ‘Why would we need an archive when we have everything is in WordPress?’ But if a website is to shut down, where does the WordPress go? Where does the CMS go?” Cooper said. “All journalists have had a story in the CMS that somehow vanished when they refreshed the page. [We need to] think a little bit bigger-picture about the idea of journalism as evidence and an archive, and not just a business of putting things out every day.”

When I asked Cooper what she would do if she were an archivist for a news outlet that was shutting down, she said she would reach out to organizations like Archive Team or the Library of Congress, and ask them if they could take on the task of archiving the site and sustaining the archives long-term. Similar services exist in academia — CLOCKSS and Portico, for instance, are digital archiving services for scholarly articles and journals.

“It would be more beneficial to everyone involved if it were possible to transfer the server to an institution or allow the institution to access it and scrape it rather than doing a web crawl,” Cooper said. “Transferring over the backends, rather than relying on front-end capture is another way that we can think about digital preservation. I don’t know if I know of an example where that actually has happened. That’s a dream scenario.”

Public domain photo of a library card catalog

Hanaa' Tameez is a staff writer at Nieman Lab. You can reach her via email (hanaa@niemanlab.org), Twitter DM (@HanaaTameez), or on Signal (@hanaatameez.01).
POSTED     July 31, 2024, 2:58 p.m.
SEE MORE ON Reporting & Production
Show tags
 
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
A year in, The Guardian’s European edition contributes 15% of the publisher’s pageviews
After the launch of Guardian Europe, one-time donations from European readers increased by 45%.
Press Forward awards $20 million to 205 small local newsrooms
In response to the volume and quality of applications, Press Forward doubled the funding and number of grantees for this open call.
Midwestern news nonprofit The Beacon shuts down its Wichita newsroom
“We’ve realized that we can’t do it all, and have made the decision to no longer have a staffed newsroom in Wichita.”