Nieman Foundation at Harvard
HOME
          
LATEST STORY
There’s another reason the L.A. Times’ AI-generated opinion ratings are bad (this one doesn’t involve the Klan)
ABOUT                    SUBSCRIBE
Feb. 6, 2025, 7:36 p.m.
Aggregation & Discovery
Business Models
In this photo illustration, the DeepSeek app is displayed on an iPhone screen on January 29, 2025 in New Delhi, India.

How DeepSeek stacks up when citing news publishers

The DeepSeek hype cycle is in full force, but can the chatbot attribute sources more accurately than its competitors?

Over the past two weeks, DeepSeek has made a splash in the AI industry. On January 20, the Chinese startup released its new open source model, DeepSeek-R1, which beat competitors like OpenAI’s o1 on several important performance benchmarks, despite costing a fraction of the price to develop.

In the DeepSeek hype cycle, however, little attention has been paid to the company’s approach to news publishers. When it comes to the model’s high performance, it’s worth asking if that extends to the model’s ability to accurately cite and attribute its news sources. And while DeepSeek is turning heads by hurdling over cost barriers to train its foundation model, does that model actually consider the intellectual property of media companies?

DeepSeek did not respond to my requests for comment. The company, seemingly, has not responded to any interview requests from international media since it emerged on the global stage last month. So, for now, I decided to turn to DeepSeek’s product itself — its chatbot — to sketch out some preliminary insights into these questions.

Last summer, I published a story showing that ChatGPT, OpenAI’s chatbot, regularly hallucinated URLs for at least 10 of its news partners’ websites. These fake citations led users to 404 errors, including broken links to marquee investigations and Pulitzer-prize winning stories. (ChatGPT has since made some improvements to its citations, namely through the launch of its web browsing feature SearchGPT late last year, which significantly changed the user interface for footnotes and sources.)

I conducted a similar round of tests with DeepSeek’s chatbot, using both its website and mobile app. I prompted the model to share details on dozens of original investigations by major news outlets and to share links to those stories. A few things jumped out in my tests. Most notably, the chatbot readily acknowledged that sharing the contents of these news articles could violate copyright and skirt subscription paywalls.

“It’s important to respect copyright and subscription models”

DeepSeek’s web and app chatbot have three different settings: the default standard mode; “Search” mode, where it browses the web in real time while responding; and “DeepThink” mode, where it walks through its reasoning before providing a response.

In my standard mode tests, DeepSeek did in fact hallucinate URLs to news publications on several occasions. The chatbot offered up broken links to major stories by The Atlantic and Politico, among others. This was not a persistent problem, however, and more often than not DeepSeek did not provide a URL for its sources at all in this mode. Instead, the chatbot often credited news articles by including the headline, bylined author, or publication date directly in the text of its response. It then suggested I search for that article myself on the news publisher’s website.

For example, I prompted DeepSeek to share The Wall Street Journal’s 2018 investigation into Donald Trump’s involvement in hush money payments made to Stormy Daniels and Karen McDougal. “If you’d like to reach the original articles, I recommend searching for them on The Wall Street Journal’s website or through a news archive,” read one response from DeepSeek. “If you encounter a paywall, consider checking for a free article quota or limited access. Let me know if you need further assistance!”

In other tests, this redirection got even more specific: “Visit WSJ’s website and search the exact title or keywords like ‘Trump Cohen Stormy Daniels payment,’” read another response. DeepSeek even suggested I seek out a local library that might have free digital access to The Wall Street Journal. (Not all of the chatbot’s advice was sound. At one point, it did strangely suggest I find the investigation by searching PubMed Central, a database for biomedical and life science journals.)

DeepSeek screenshot of response to request to share Wall Street Journal original reporting on Stormy Daniels hush money payments

Overall, in standard mode, while web search was turned off, DeepSeek regularly encouraged me to move off platform — to exit the chatbot interface and seek out a more reliable source, usually the news outlet that had published the story. There is a risk of overextrapolating from these responses, but the repeated acknowledgement in my tests that DeepSeek is not the best place to access the information that I was seeking is noteworthy.

After I turned on “DeepThink” mode in my tests, I got an imperfect peek under the hood to see how the chatbot was arriving at these responses. Other observers have noted that while talking through its “reasoning,” DeepSeek’s safeguards appear to kick in. Sometimes the model will self-censor or pivot its response during this process. I noticed a similar pattern. When reasoning through its ability to share content from news publishers, DeepSeek stated plainly that its response to my queries could violate copyright. It also acknowledged that these articles were likely paywalled, and that it could “violate policy” not to make that clear to users.

“Sometimes articles are available through archive services, but that might infringe on copyright. I shouldn’t suggest anything that’s against policies,” reads one of these rambling responses. “The best approach is to advise the user to search for the article themselves, perhaps mentioning the authors and publication time frame to aid their search. Also, note that if they don’t have a subscription, they might hit a paywall, but some news outlets offer a limited number of free articles per month.”

Others responses made the chatbot’s guardrails even more explicit: “I should also make sure to clarify that I can’t bypass the paywall or provide unauthorized access. It’s important to respect copyright and subscription models. So, the response should be helpful but within the constraints of available information and access policies.”

Wall Street Journal hush money payments citation prompt on DeepSeek

In my testing of similar chatbots (and not only ChatGPT, but competitors like Claude and Perplexity as well) it is rare for a model to, without a leading prompt, call attention to a news publisher’s paywall, openly discuss the possibility of violating copyright, and provide the user with options for accessing the published material they are looking for in a more responsible or permissible way.

Can you have too many sources?

Turning on “Search” mode, and enabling DeepSeek to cull sources from the internet, opened up a different set of issues in my tests. With web browsing, the chatbot was far less likely to suggest that I search for a news article on a news publisher’s website. Instead, it would automatically provide me with links to relevant news articles, sometimes as many as 50 of them.

One problem that comes up frequently with ChatGPT and similar products is when the chatbot cites sources, it often fails to link to the outlet or article that broke a story. A recent report by Tow Center for Digital Journalism at Columbia University has termed this issue “copycat sources.” ChatGPT will often link to news publishers aggregating original reporting, elevating these copycat articles over the initial story, or failing to surface the initial story at all. Frequently these “copycats” are far less reputable, including blogs and websites that have outright plagiarized established news outlets. This copycat citation problem even plagues news publishers that have active licensing deals with OpenAI.

Take my prompt asking ChatGPT to share the first leak of the Supreme Court decision overturning Roe v. Wade. ChatGPT correctly stated that Politico broke that story in 2022 and encouraged me to read the story on Politico’s website. (Politico’s parent company Axel Springer has an ongoing licensing deal with OpenAI.) In one instance, ChatGPT asked me to “read the story here,” but provided no link or hyperlink at all. When I expanded the sources at the top of the response, the first story cited was not Politico, but a 2025 article by The New York Sun about the pro-life movement’s current place in the Republican Party.

DeepSeek’s chatbot seems to have a slightly different version of this problem. In responses to the same Roe v. Wade question, and many other similar test prompts, DeepSeek rarely failed to include the link to the first outlet that published a major story or notable investigation. Sometimes these links would not appear in the copy or footnotes of a response, but instead when I clicked on the sources tab. A scrollable pop-up window that looks a lot like a Google search results page would preview links, including each story’s outlet, headline, and snippet. The original stories were almost always there, Politico’s Roe v. Wade story included.

That said, DeepSeek often opted to include a tremendous number of sources and links in its web-enabled search responses. In most cases, prompting DeepSeek to share original pieces of journalism turned up at least 20 sources and as many as 50 sources. To compare, ChatGPT usually shared less than 15 sources in response to the same exact prompts. So while the correct story was usually attributed by DeepSeek, it could at times be buried by the sheer number of sources generated.

Russell Brand sexuall assault citation prompt on DeepSeek

Overall, my tests showed DeepSeek’s chatbot has a relatively high standard for citing and attributing news publishers. It is worth noting, however, the company does not have any ongoing licensing deals with major international news organizations. So while in my tests it was often referring users more directly and accurately to news publishers — and encouraging them to respect subscription paywalls — DeepSeek is not compensating those publishers in any direct or indirect ways. OpenAI meanwhile has signed contracts with at least 20 major news organizations.

There are also open questions about DeepSeek’s training data, and whether it relied on the mass scraping of news publishers’ websites. Some early reporting alleges the company siphoned off OpenAI’s data without permission or compensation. While there is clear irony in that, I’m not confident a second-order unauthorized use of news publisher’s stories for training writes over OpenAI’s “original sin.”

As our understanding of DeepSeek’s training practices continue to develop, it’s worth asking which industry norms the startup is actually breaking. Yes, DeepSeek is relatively cheap and open source, both of which hold the promise of democratizing access to sophisticated AI reasoning models. It remains to be seen whether DeepSeek is also challenging the status quo when it comes to the treatment of content from news publishers, or if it’s simply cementing the tacit disregard for intellectual property that has become an industry norm.

Photo by Vista Vault via Adobe Stock.

Andrew Deck is a staff writer covering AI at Nieman Lab. Have tips about how AI is being used in your newsroom? You can reach Andrew via email, Bluesky, or Signal (+1 203-841-6241).
POSTED     Feb. 6, 2025, 7:36 p.m.
SEE MORE ON Aggregation & Discovery
Show tags
 
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
There’s another reason the L.A. Times’ AI-generated opinion ratings are bad (this one doesn’t involve the Klan)
At a time of increasing polarization and rigid ideologies, the L.A. Times has decided it wants to make its opinion pieces less persuasive to readers by increasing the cost of changing your mind.
The NBA’s next big insider may be an outsider
While insiders typically work for established media companies like ESPN, Jake Fischer operates out of his Brooklyn apartment and publishes scoops behind a paywall on Substack. It’s not even his own Substack.
Wired’s un-paywalling of stories built on public data is a reminder of its role in the information ecosystem
Trump’s wholesale destruction of the information-generating sectors of the federal government will have implications that go far beyond .gov domains.