In October 2023, an AI-synthesized impersonation of the voice of an opposition leader helped swing the election in Slovakia to a pro-Russia candidate. Another AI audio fake was layered onto a real video clip of a candidate in Pakistan, supposedly calling on voters to boycott the general election in February 2024. Ahead of the Bangladeshi elections in January, several fakes created with inexpensive, commercial AI generators gained voter traction with smears of rival candidates to the incumbent prime minister. And, in the U.S., an audio clip masquerading as the voice of President Joe Biden urged voters not to vote in one key state’s primary election.
Experts agree that the historic election year of 2024 is set to be the year of AI-driven deepfakes, with potentially disastrous consequences for at-risk democracies. Recent research suggests that, in general, about half of the public can’t tell the difference between real and AI-generated imagery, and that voters cannot reliably detect speech deepfakes — and technology has only improved since then. Deepfakes range from subtle image changes using synthetic media and voice cloning of digital recordings to hired digital avatars and sophisticated “face-swaps” that use customized tools. (The overwhelming majority of deepfake traffic on the internet is driven by misogyny and personal vindictiveness: to humiliate individual women with fake sexualized imagery — but this tactic is also increasingly being used to attack women journalists.)
Media manipulation investigators told GIJN that fake AI-generated audio simulations — in which a real voice is cloned by a machine learning tool to state a fake message — could emerge as an even bigger threat to elections in 2024 and 2025 than fabricated videos. One reason is that, like so-called cheapfakes, audio deepfakes are easier and cheaper to produce. (Cheapfakes have already been widely used in election disinformation, and involve video purportedly from one place that was actually from another, and where short audio clips are crudely spliced into videos, or the closed captions blatantly edited.) Another advantage they offer bad actors is they can be used in automated robocalls to target (especially) older, highly active voters with misinformation. And tracing the origin of robocalls remains a global blind spot for investigative reporters. (The overwhelming majority of deepfake traffic on the internet is driven by misogyny and personal vindictiveness: to humiliate individual women with fake sexualized imagery — but this tactic is also increasingly being used to attack women journalists.)
“AI audio fakes can pose a significant threat,” emphasizes Olga Yurkova, journalism trainer and cofounder of StopFake.org, an independent Ukrainian fact-check organization. “They are easier and cheaper to create than deepfake videos, and there are fewer contextual clues to detect with the naked eye. Also, they have a greater potential to spread, for example, in WhatsApp chats.”She adds: “Analysis is more complex, and voice generation tools are more advanced than video generation tools. Even with voice samples and spectral analysis skills, it takes time, and there is no guarantee that the result will be accurate. In addition, there are many opportunities to fake audio without resorting to deepfake technology.”
Data journalism trainer Samantha Sunne says newsrooms need constant vigilance in elections — both for the sudden threat of comparatively under-researched AI audio fakes, and because “deepfake technology is changing quickly and so are the detection and monitoring tools.”
Fact check organizations and some pro-democracy NGOs have mobilized to help citizens groups and newsrooms analyze suspicious viral election content. For instance, a human rights empowerment nonprofit called Witness conducted a pilot Deepfakes Rapid Response project in the past year, using a network of about 40 research and commercial experts to analyze dozens of suspicious clips. In an interview with GIJN, the manager of the Rapid Response project, Shirin Anlen, said AI audio fakes appear to be both the easiest to make and the hardest to detect — and that they seem tailor-made for election mischief.
“As a community, we found that we are not as prepared for audio as we were for video — that’s the gap we see right now,” says Anlen, who added that researchers were “surprised” by the high proportion of impactful AI audio fakes in 2023. Of the six high-impact cases involving elections or human rights that the Response Force chose to deeply investigate, four were audio fakes.
“Audio does seem to be used more in elections and areas of crisis — it’s easier to create and distribute, through various platforms or robocalls,” Anlen explains. “It’s also very personalized — you often really need to know the person, the way they talk, to detect manipulation. Then you have double-audio and background noise, music, or cross-talking — all these make detection more complex, unlike video, where you can see manipulation, maybe with a glitch in the face.”
But Anlen warns that “video detection is also lagging behind the generative techniques,” and that the release of the new text-to-video OpenAI tool Sora illustrates a trend toward almost seamless simulations. She adds that a lack of media literacy among older voters amplifies the threat of audio fakes and AI-driven robocalls even further — “because people not used to, say, X [Twitter] or TikTok may have less ability to filter out audio fakes.”
The Financial Times reported that voice-cloning tools have also targeted elections in countries such as India, the U.K., Nigeria, Sudan, and Ethiopia. The FT investigation alleged that AI audio fakes were suddenly popular among propagandists due to the new, easy availability of inexpensive and powerful AI tools “from start-ups such as ElevenLabs, Resemble AI, Respeecher and Replica Studios.” Note that several text-to-speech AI tools are designed for pranks, commercial ads, or even fun gifts, but experts warn they can be repurposed for political propaganda or even incitement. The report showed that basic tools can be used from as little as $1 per month, and advanced tools for $330 per month — a tiny fraction of political campaign budgets.
To date, the most convincing audio fakes have been made of voices that have said the most words on the internet, which, of course, often involves well-known public figures, including politicians. One of the most eerily accurate examples targeted British actor and intellectual Stephen Fry, where an AI program exploited Fry’s extensive online narration of seven Harry Potter novels to create a fake narration about Nazi resistance, which also included German and Dutch names and words — perfectly modulated to Fry’s accent and intonation — that the actor himself had never said. The AI program had uncannily predicted how Fry would say those foreign words. (See Fry’s explainer clip from the 12:30 to 15:30-minute mark in the video below to gain a sense of the alarming sophistication of advanced speech deepfakes.)
However, Hany Farid, a computer science professor and media forensics expert at the University of California, Berkeley, told Scientific American magazine that a single minute’s recording of someone’s voice can now be enough to fabricate a new, convincing audio deepfake from generative AI tools that costs just $5 a month. This poses a new impersonation threat to mid-level election-related officials — bureaucrats whose public utterances are normally limited to short announcements. Farid explained the two primary ways that audio fakes are made: either text-to-speech — where a scammer uploads real audio and then types what they’d like the voice to “say” — or speech-to-speech, where the scammer records a statement in their own voice, and then has the tool convert it. He described the effort involved in creating a convincing fake of even a non-public figure as “trivial.”
A new hybrid fake model is provided by the digital avatar industry, where some AI startups offer a selection of digitally fabricated actors that can be made to “say” longer messages that sync to their lips better than fake messages superimposed on real people in video clips. According to The New York Times, researchers at social media analysis company Graphika traced avatar-driven news broadcasts to services offered by “an AI company based above a clothing shop in London’s Oxford Circus,” which offers scores of digital characters and languages to choose from.
While expert analysis and new detection tools are required for advanced speech deepfakes that even friends of the speaker can’t distinguish, often journalists recognize obvious manipulation in many audio clips right away, based on their knowledge of a candidate, low quality of the recording, context, or just plain common sense. But experts warn that gut-level suspicion is just a small part of the detection process. A speedy, evidence-based response, highlighting real audio in a story “truth sandwich,” and tracing the source of the scam are all equally important.
Here’s a step-by-step process for analyzing potential audio deepfakes.
The increase in deepfakes also poses a maddening threat to investigative stories themselves. Politicians or partisan officials now revealed as making outrageous statements or abusing peoples’ rights in real video or audio clips obtained by journalists may well claim that this legitimate evidence is simply the result of an advanced AI deepfake; convenient denials that could be difficult to rebut. This has already happened with politicians in countries such as India and Ethiopia, and this new onus on journalists to prove that a properly sourced, verified recording is indeed real is a deep concern for experts such as Sam Gregory, executive director of Witness. This problem is known as “the liar’s dividend,” and its ultimate solution involves media trust: that newsrooms relentlessly ensure that all their other stories and sources on elections are also solid. (Watch Gregory discuss the threat of deepfakes in his TED Talk below.)
The Slovakia case is especially concerning for watchdog reporters, for two reasons. First, because the fake two minute audio clip, which focused on election rigging, also fabricated the voice of an investigative journalist, Monika Tódová — supposedly in conversation with the opposition leader. In an investigative story on the incident by The Dial, Tódová revealed that she initially dismissed the viral clip as not believable. “[But] I had friends writing me that their college-educated coworkers had listened to it and believed it,” she recalled. “And they were sharing it on social media. I found myself in the midst of a totally new reality.”
And, second: the timing of the Slovakian audio deepfake bore the hallmarks of foreign state operatives. The Dial investigation found that the clip was released just prior to Slovakia’s legislated two-day “silence” period for all campaigning prior to election day. This tactic both maximized impact and gave journalists little recourse to rebut it, because the country’s media was legally limited in debunking the disinformation. (This case precisely vindicates a prediction that ProPublica’s Craig Silverman made to GIJN in 2022, that “elections are likely most vulnerable to deepfakes in the 48 hours prior to election days, as campaigns or journalists would have little time to vet or refute.”)
The fake Biden robocall that circulated right before the New Hampshire primary election is also noteworthy. NBC News ultimately tracked down the source of that deepfake audio, a magician who claimed he was paid by a consultant from a rival Democratic presidential campaign. According to the report, the man acknowledged that “creating the fake audio took less than 20 minutes and cost only $1.” He came forward about his role in the disinformation campaign after regretting his involvement. “It’s so scary that it’s this easy to do,” he told NBC News. “People aren’t ready for it.”
Ukraine’s StopFake.org recently debunked and traced a deepfake video purporting to show a top general denouncing President Volodymyr Zelensky. Using the Deepware Scanner tool and consistency analysis, the team found that the scammer had used a machine learning technique called GAN (generative adversarial network) to superimpose fake imagery and audio onto a real video of the Ukrainian general taken the year before. Other analysts found that the deepfake was first posted by a Telegram channel that claims to share “humorous content.”
StopFake’s Yurkova says it has used detection tools in combination with normal reverse image tools to investigate suspicious multimedia content, but warns that “unfortunately, it doesn’t always work.”
“We have little experience with pure audio fakes,” she explains. “We often distinguish such fakes by ordinary listening, but this works mostly for low-quality ones.”
It’s important to note that deepfake detection is an emerging technology, and both open source and commercial tools are frequently inaccurate or case-limited — and journalists need to alert audiences to their limits. Indeed, Witness’s Anlen warns that “from our experience, we have yet to find [a tool] that didn’t fail our tests and that provided transparent and accessible results.” Nonetheless, they can be helpful as leads or as supporting evidence.
Here are more technical tips for dealing with suspicious audio.
In time-pressed cases, newsrooms can apply to human rights tech NGOs to help analyze suspicious election content. For instance, using this form, under-resourced newsrooms can apply for intensive analysis on “high-impact” clips from the experts at the Deepfakes Rapid Response project. (Bear in mind that this quick-response project has limited capacity.)
“We mostly collaborate with fact checkers or local journalists who have limited access to detection tools,” explains Anlen, who added that researchers had already engaged with newsrooms on elections in Indonesia, Pakistan, and India. “Therefore, we are less likely to work with, for example, The New York Times or The Guardian for analyzing requests because they have great investigative resources. We have 15 teams — about 40 experts — with different expertise: video or image or audio-specific; local context. We try to pass on as much analysis information as possible, and journalists can do whatever they wish with that data.”
The mantra for dealing with deepfakes among researchers at Witness is “Prepare, don’t panic.”
In his seminal blog post on the challenge, Sam Gregory wrote: “We need a commitment by funders, journalism educators, and the social media platforms to deepen the media forensics capacity and expertise in using detection tools of journalists and others globally who are at the forefront of protecting truth and challenging lies.”
Rowan Philp is senior reporter at the Global Investigative Journalism Network, where this story was originally published. It’s republished here under a under a Creative Commons Attribution-NonCommercial 4.0 International license.