Nieman Foundation at Harvard
Spanish-language misinformation is flourishing — and often hidden. Is help on the way?
ABOUT                    SUBSCRIBE
Aug. 16, 2019, 9:50 a.m.

One potential route to flagging fake news at scale: Linguistic analysis

It’s not perfect, but legitimate and faked news articles use language differently in ways that can be detected algorithmically: “On average, fake news articles use more expressions that are common in hate speech, as well as words related to sex, death, and anxiety.”

Have you ever read something online and shared it among your networks, only to find out it was false?

As a software engineer and computational linguist who spends most of her work (and even leisure) hours in front of a computer screen, I’m concerned about what I read online. In the age of social media, many of us consume unreliable news sources. We’re exposed to a wild flow of information in our social networks — especially if we spend a lot of time scanning our friends’ random posts on Twitter and Facebook.

A study in the United Kingdom found that about two-thirds of the adults surveyed regularly read news on Facebook, and that half of those had the experience of initially believing a fake news story. Another study, conducted by researchers at MIT, focused on the cognitive aspects of exposure to fake news and found that, on average, newsreaders believe a false news headline at least 20 percent of the time.

It’s often difficult to find the origin of a story after partisan groups, social media bots and friends of friends have shared it thousands of times. Sites that do fact-checking such as Snopes and BuzzFeed can only address a small portion of the most popular rumors.

The technology behind the internet and social media has enabled this spread of misinformation; maybe it’s time to ask what this technology has to offer in addressing the problem.

My colleagues and I at the Discourse Processing Lab at Simon Fraser University have conducted research on the linguistic characteristics of fake news. Recent advances in machine learning have made it possible for computers to instantaneously complete tasks that would have taken humans much longer. When machine learning is applied to natural language processing, it is possible to build text classification systems that can distinguish one type of text from another.

During the past few years, natural language processing scientists have become more active in building algorithms to detect misinformation; this helps us to understand the characteristics of fake news and develop technology to help readers. One approach finds relevant sources of information, assigns each source a credibility score, and then integrates them in order to confirm or debunk a given claim. This approach is heavily dependent on tracking down the original source of news and scoring its credibility based on a variety of factors.

A second approach examines the writing style of a news article rather than its origin. The linguistic characteristics of a written piece can tell us a lot about the authors and their motives. For example, specific words and phrases tend to occur more frequently in a deceptive text compared to one written honestly.

Our research identifies linguistic characteristics to detect fake news using machine learning and natural language processing technology. Our analysis of a large collection of fact-checked news articles on a variety of topics shows that, on average, fake news articles use more expressions that are common in hate speech, as well as words related to sex, death, and anxiety. Genuine news, on the other hand, contains a larger proportion of words related to work (business) and money (economy).

This suggests that a stylistic approach combined with machine learning might be useful in detecting suspicious news.

Our fake news detector is built based on linguistic characteristics extracted from a large body of news articles. It takes a piece of text and shows how similar it is to the fake news and real news items that it has seen before. (Try it out!)

The main challenge, however, is to build a system that can handle the vast variety of news topics and the quick change of headlines online. Computer algorithms learn from samples, and if these samples are not sufficiently representative of online news, the model’s predictions would not be reliable.

One option is to have human experts collect and label a large quantity of fake and real news articles. This data enables a machine-learning algorithm to find common features that keep occurring in each collection regardless of other varieties. Ultimately, the algorithm will be able to distinguish with confidence between previously unseen real or fake news articles.

Fatemeh Torabi Asr is a postdoctoral research fellow in the Discourse Processing Lab at Simon Fraser University. This article is republished from The Conversation under a Creative Commons license.The Conversation

POSTED     Aug. 16, 2019, 9:50 a.m.
Show comments  
Show tags
Join the 50,000 who get the freshest future-of-journalism news in our daily email.
Spanish-language misinformation is flourishing — and often hidden. Is help on the way?
“Conspiracies are flourishing with virtually no response from credible Spanish-language media outlets.”
For COVID-19, as with everything else, Americans on the right and left live in different universes when it comes to trusting the media
A new study looks at how people in seven countries view the motives of the news media in covering the pandemic. Only in the United States is that a profoundly partisan question.
A new nonprofit newsroom, Mountain State Spotlight, wants to be the watchdog for West Virginia
“In my experience in legacy media, the answer was to throw your hands in the air and say, ‘There’s not much we can do.’ Well, I don’t find that acceptable. We have to find ways to reach those audiences.”