Nieman Foundation at Harvard
The Copa, Euro, and Wimbledon finals collide on July 14. Here’s how The Athletic is preparing for its “biggest day ever.”
ABOUT                    SUBSCRIBE
July 30, 2015, 10:29 a.m.
Reporting & Production

The New York Times built a robot to help make article tagging easier

Developed by the Times R&D lab, the Editor tool scans text to suggest article tags in real time. But the automatic tagging system won’t be moving into the newsroom soon.

If you write online, you know that a final, tedious part of the process is adding tags to your story before sending it out to the wider world.

Tags and keywords in articles help readers dig deeper into related stories and topics, and give search audiences another way to discover stories. A Nieman Lab reader could go down a rabbit hole of tags, finding all our stories mentioning Snapchat, Nick Denton, or Mystery Science Theater 3000.

Those tags can also help newsrooms create new products and find inventive ways of collecting content. That’s one reason The New York Times Research and Development lab is experimenting with a new tool that automates the tagging process using machine learning — and does it in real time.

The Times R&D Editor tool analyzes text as it’s written and suggests tags along the way, in much the way that spell-check tools highlight misspelled words:

Editor is an experimental text editing interface that explores how collaboration between machine learning systems and journalists could afford fine-grained annotation and tagging of news articles. Our approach applies machine learning techniques interactively, as part of the writing process, rather than retroactively. This approach can offload the burden of work to the computational processes, and can create affordances for journalists to augment, edit and correct those processes with their knowledge.

It’s similar to Thomson Reuters’ Open Calais system, which extracts metadata from text files of any kind. Editor works by connecting the corpus of tags housed at the Times with an artificial neural network designed to read over a writer’s shoulder in a text editing system. They explain:

As the journalist is writing in the text editor, every word, phrase and sentence is emitted on to the network so that any microservice can process that text and send relevant metadata back to the editor interface. Annotated phrases are highlighted in the text as it is written. When journalists finish writing, they can simply review the suggested annotations with as little effort as is required to perform a spell check, correcting, verifying or removing tags where needed. Editor also has a contextual menu that allows the journalist to make annotations that only a person would be able to judge, like identifying a pull quote, a fact, a key point, etc.

“We started looking at what we could do if we started tagging smaller entities in the articles. [We thought] it might afford greater capabilities for reuses and other types of presentation,” said Alexis Lloyd, creative director at the Times R&D Lab.

Tags are a big deal at the Times; the paper has a system of article tags that goes back over 100 years. That metadata makes things like Times Topics pages possible. It’s an important process that is entirely manual, relying on reporters and editors to provide a context layer around every story. And in some cases, that process can lag: The Times’ innovation report cited many gaps in the paper’s metadata system as a strategic weakness:

“Everyone forgets about metadata,” said John O’Donovan, the chief technology officer for The Financial Times. “They think they can just make stuff and then forget about how it is organized in terms of how you describe your content. But all your assets are useless to you unless you have metadata — your archive is full of stuff that is of no value because you can’t find it and don’t know what it’s about.”

Lloyd said the idea behind Editor was not just to make the metadata process more efficient, but also to make it more granular. By using a system that combs through articles at a word-by-word level, the amount of data associated with people, places, companies, and events becomes that much richer.

And that much more data opens new doors for potential products, Lloyd told me. “Having that underlying metadata helps to scale to all kinds of new platforms as they emerge,” she said. “It’s part of our broader thinking about the future of news and how that will become more complex, in terms of forms and formats.”

Reporters at the Times won’t be seeing Editor, at least in its current state, in the newsroom any time soon. Like many projects at the R&D lab, it’s possible that parts of Editor will eventually be put to use in systems around the Times. “Like most prototypes, it’s a way of exploring a set of approaches and new capabilities,” Lloyd said. “It’s not intended to be something that then moves, whole cloth, into production.”

The key feature of the automatic tagging system relies on bringing machines into the mix, an idea that inspires conflicting ideas of progress and dread in some journalists. For Editor to work, the lab needed to build a way for machines and humans to supplement each other’s strengths. Humans are great at seeing context and connections and understanding language, while machines can do computations at enormous scale and have perfect memory. Mike Dewar, a data scientist at the Times R&D lab, said the artificial neural network makes connections between the text and an index of terms pulled from every article in the Times archive.

It took around four months to build Editor, and part of that time was spent training the neural network in how a reporter might tag certain stories. Dewar said that teaching the network the way tags are associated with certain phrases or words gives it a benchmark to use when checking text in the future.

The biggest challenge was latency, as Editor works to make connections between what’s being written and the index of tags. In order for Editor to be really effective, it has to operate at the speed of typing, Dewar said: “It needs to respond very quickly.”

While the Editor system may not find its way into wider use outside of the R&D lab, Lloyd said it hints at ways newsrooms can use automation and machine learning to make reporters’ jobs simpler. Instead of having a bot trained to check text against an index of tags, you might have a tool that could check quotes against things people have said in the past, or that could tally how often certain sources or subjects wind up in stories.

Robots continue to expand their foothold in the world of journalism. In March, the AP said it planned to use its automated reporting services to increase college sports coverage. Lloyd has experimented with how bots can work more cooperatively with people, or at least learn from them and their Slack conversations.

The notion of robots snatching up the shrinking number of jobs at American newspapers is a little far-fetched, Lloyd said. Many reporters see Editor and projects like it as potential tools in the writing process. The way to think about it, Lloyd said, is how machines can augment reporting.

“We think about how automated or computational approaches can provide superpowers for journalists,” Lloyd said.

Photo of a Cyberman by Chris Sampson used under a Creative Commons license.

POSTED     July 30, 2015, 10:29 a.m.
SEE MORE ON Reporting & Production
Show tags
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
The Copa, Euro, and Wimbledon finals collide on July 14. Here’s how The Athletic is preparing for its “biggest day ever.”
The Athletic intends to use its live coverage as a “shop window,” giving new readers a taste of what they might get if they subscribed.
Making sense of science: Using LLMs to help reporters understand complex research
Can AI models save reporters time in figuring out an unfamiliar field’s jargon?
Are you willing to pay for Prepare to be asked before year’s end
The cable news network plans to launch a new subscription product — details TBD — by the end of 2024. Will Mark Thompson repeat his New York Times success, or is CNN too different a brand to get people spending?