Nieman Foundation at Harvard
HOME
          
LATEST STORY
From shrimp Jesus to fake self-portraits, AI-generated images have become the latest form of social media spam
ABOUT                    SUBSCRIBE
March 20, 2024, 11:52 a.m.
Business Models

The Intercept charts a new legal strategy for digital publishers suing OpenAI

Raw Story, AlterNet, and The Intercept are among the first smaller publications to go up against the AI goliath for copyright violations.

Two lawsuits against OpenAI are charting a new path for copyright litigation against AI developers — one tailored to outlets that only publish on the internet.

On February 28, The Intercept, as well as the progressive news sites Raw Story and AlterNet, filed lawsuits claiming OpenAI had used their stories to train ChatGPT without permission or compensation. All three publications are being represented by the same civil rights law firm, Loevy & Loevy.

These new suits come just two months after The New York Times filed a landmark case in U.S. Federal District Court. The Times was the first major American media company to sue OpenAI for infringing on its copyright in training the lucrative GPT large language models (LLMs).

Instead of relying on standard claims of copyright infringement, the way the Times case does, the lawyers at Loevy & Loevy have narrowed in on OpenAI’s alleged violation of a 1998 law called the Digital Millennium Copyright Act, or DMCA.

“We think that this is the model that will give online news organizations, especially smaller ones, the best opportunity to ensure that they’re compensated for the use of their work in training AI models,” said Matt Topic, a partner at Loevy & Loevy and one of the lead lawyers on the suits. “We’re fully prepared to bring additional cases for other organizations who are similarly interested in obtaining compensation for what OpenAI has done with their work.”

The claims of The Intercept, Raw Story, and AlterNet will ring familiar for anyone who has been following the allegations from publishers against OpenAI over the last several months. Novelists, journalists, and other authors have alleged that OpenAI, in vacuuming up text from different corners of the internet, folded their copyrighted works into its training data sets. Some of the proof is in ChatGPT’s own outputs.

A study released this month by Patronus AI, a startup launched by former Meta researchers, found that GPT-4 reproduced copyrighted content at the highest rate among popular LLMs. When asked to finish a passage of a copyrighted novel, GPT-4 reproduced the text verbatim 60% of the time. The new lawsuits similarly allege that ChatGPT reproduces journalistic works near-verbatim when prompted.

John Byrne, the owner and CEO of both RawStory and AlterNet, said that ChatGPT has also output headlines in the style of Raw Story and rewritten existing news articles to mimic the voice of a particular author on his sites. “With one or two articles you would never be able to do it. You have to have hundreds or thousands of articles in order to do that,” he said. (I wasn’t able to reproduce these claims after testing ChatGPT.)

When the Times filed its lawsuit in December, Byrne says for the first time he thought it might be possible to take legal action. But the Times’ strategy isn’t replicable for a small publisher like Raw Story — mainly because it’s cost prohibitive.

To sue OpenAI directly for copyright infringement, publishers must hold a registration for the infringed works with the U.S. Copyright Office. Physical newspapers can pay one fee, once a month, to register everything that was in its papers during that month. For that reason, the Times can pay as it goes to cover a large swath of its published journalism.

Digital-only publications, however, have to file electronic registrations for each individual article or piece of work on their site. One standard electronic registration with the U.S. Copyright Office costs $65. To register all the articles on a prolific news website like The Intercept could quickly balloon to tens of thousands of dollars, or even more.

“The Copyright Office still hasn’t gotten an easy way to register online news organizations, which is an incredible failing,” said David Bralow, general counsel for The Intercept. “If you’re The New York Times you can register collective works in the old-fashioned microfilm sort of way, or just send in a bunch of fixed newspapers each month, and take advantage of the actual Copyright Act. But we can’t.”

In January, the U.S. Copyright Office announced it would consider a new type of registration for “frequently updated news sites.” If approved, it would allow digital-only publications, for the first time, to pay for filings in bulk. But for now, small and nonprofit publications like The Intercept are still up against exorbitant fees to legally claim copyright infringement by AI developers.

The DMCA, however, offers another way into the courts. The law was originally written to minimize violations of copyright during the early rise of the internet. One of its tenets is that it is illegal to remove copyright-related information from an online work, including the author’s name, the title of the work, and any copyright notices or terms of use.

While the Times lawsuit does include some DMCA violations among its many allegations, these new lawsuits are centered solely around the allegation that OpenAI stripped online articles from The Intercept, Raw Story, and AlterNet of their copyright-related information. In particular, the lawsuits allege that OpenAI removed things like author names and terms of use when adding the articles to its training data sets. Loevy & Loevy has hired outside data scientists and AI experts to help verify this claim, according to Topic.

I reviewed OpenAI’s GitHub account, which is cited in the filing. One upload indicates that web pages from The Intercept, Raw Story, and AlterNet were included in an internal training data set at OpenAI called WebText. WebText was used to train GPT-2, an earlier iteration of the LLM released in 2019.

The description of WebText on GitHub says it includes “the text contents of 45 million links posted by users of the ‘Reddit’ social network.” An accompanying breakdown puts Raw Story and AlterNet among the 200 most-cited domains in the set, together totaling over 56,000 entries. If the publications win their DMCA violation case, they will be entitled to statutory damages starting at $2,500 per violation.

Since the release of GPT-2, OpenAI has stopped publishing such granular information on its training data.

“I think the world would be in a different place with AI if ChatGPT had been trained about titles, authors, and copyright notices,” Topic said. “It probably would figure out that if there’s a letter C with a circle around it included within a work, that means something, and what it means is that you can’t use this work without obtaining the permission of the author.”

A seat at the table

If successful, a legal strategy focused on DMCA violations could lead to other smaller digital publications filing suits against AI developers. That said, many newsrooms simply cannot afford to go to court.

The Intercept, Raw Story and AlterNet may have sidestepped this issue by seeking out representation from Loevy & Loevy, a civil rights law firm with an extensive record of winning police brutality and wrongful conviction cases. Topic, in particular, has carved out a practice representing smaller newsrooms in state and federal FOIA cases. Though he declined to comment on specific arrangements in the OpenAI lawsuits, he confirmed many of his clients cannot afford legal representation upfront.

“We’re a firm that is accustomed to doing cases on a contingency basis. I think these arrangements allow smaller newsrooms to vindicate their rights in a way that they otherwise rarely would be able to,” he said.

All three plaintiffs, however, say their litigation is about more than just recouping damages. “I think that smaller publishers will be left behind if they don’t assert their rights to their journalism, or worse, just be obliterated,” said Byrne, Raw Story’s founder and CEO. “They’re not going to have a seat at the table.”

Major news companies including The Associated Press and Axel Springer have inked multi-year and multimillion-dollar licensing deals with OpenAI to use their stories to train AI models. But many smaller news organizations, who are also feeling heat from the rise of generative AI, have not been approached, or not been offered similarly competitive contracts.

According to Bralow, The Intercept is more than open to such licensing deals with AI developers, as long as it’s fairly compensated for its work. In fact, he argues it’s important for diverse and progressive news outlets to be included in such models. “If you just simply rely on the AP or The Wall Street Journal, you’re not going to get that parallax view of the world,” he said.

In the meantime, The Intercept isn’t alone in considering taking matters into its own hands. Bralow has been fielding calls from a number of other nonprofit news organizations since the lawsuit went public, with questions about how they might go about building their own legal cases.

This industry-wide concern around copyright violations is surfacing as business models across nonprofit and independent journalism continue to erode. In February alone, The Intercept laid off 15 staffers, nearly half of its editorial team. The departures included editor-in-chief Roger Hodge.

“[The lawsuit] is not simply moral, it’s for the ultimate sustainability of these news organizations and these unique voices,” said Barlow. “The principle of being able to take content without compensation is critically important to us — and so is our survivability. They go hand in hand.”

Photo of OpenAI website from Wikimedia Commons used under a Creative Commons license.

Andrew Deck is a generative AI staff writer at Nieman Lab. Have tips about how AI is being used in your newsroom? You can reach Andrew via email (andrew_deck@harvard.edu), Twitter (@decka227), or Signal (+1 203-841-6241).
POSTED     March 20, 2024, 11:52 a.m.
SEE MORE ON Business Models
Show tags
 
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
From shrimp Jesus to fake self-portraits, AI-generated images have become the latest form of social media spam
Within days of visiting the pages — and without commenting on, liking, or following any of the material — Facebook’s algorithm recommended reams of other AI-generated content.
What journalists and independent creators can learn from each other
“The question is not about the topics but how you approach the topics.”
Deepfake detection improves when using algorithms that are more aware of demographic diversity
“Our research addresses deepfake detection algorithms’ fairness, rather than just attempting to balance the data. It offers a new approach to algorithm design that considers demographic fairness as a core aspect.”