Audio deepfakes are becoming more common. The technology uses artificial intelligence to analyze audio data, discerning patterns and characteristics of a target voice, and recreate a clone of that voice that can be used to say anything the programmers like. While it was developed for benevolent purposes – one startup, for example, uses audio deepfake technology to clone the voices of ALS patients so that they can continue to “speak” in a copy of their voice as they lose their own – the technology has increasingly been adopted for nefarious ends. Many experts assess that while scary audio deepfake demonstrations are a staple of security conferences, successful scams and disinformation campaigns using the technology are thus far relatively rare. Nonetheless, the technology is very much there – and increasingly sophisticated and accessible to criminals, trolls, and other bad actors.
How Audio Deepfakes Work
The most technologically advanced AI speech generators can synthesize a voice (and spit out any inputs) with a voice sample as short as three seconds. More widely available voice-generating programs, such as the ones from companies like ElevenLabs, Respeecher, Resemble AI or Replica Studios, require samples of 10 minutes or more. These AI-generated “voices” vary in their believability: Some sound obviously robotic, like an early Siri or other voice assistant, while others sound fluid and human. These voices can read pre-programmed speeches or responses or can read out inputted text to simulate a conversation. These tools are readily available to scammers.
While the method of the attack is cutting-edge, their source is classic: data breaches that reveal the personal information of millions of people every year. In many instances, would-be scammers (most often ones that have purchased bundled leaked data from online black markets) simply perform internet searches to match leaked data to real people, looking for targets who have publicly available voice samples. A TikTok video is a common example of a publicly available voice sample, but even those without accounts may have posted Facebook videos or Instagram stories, including their speaking voice. This scam methodology is relatively simple, relying on leaked information that is available in bulk, and can be undertaken in mass numbers for a higher likelihood of success. Scams targeting high-profile individuals are thought to be less common but can be more lucrative if successful and easier on the technical side, as public figures tend to have ample voice samples available online.
Audio deepfakes can be easier to generate and harder to spot than video deepfakes. Listeners have less contextual clues when listening to a recording versus watching a video, where odd facial expressions, video glitches, or incorrect anatomical or background details could tip a viewer off. Inconsistencies in someone’s voice can easily be concealed with background music or feigned audio issues – in one faked audio of a UK politician berating a staff member that circulated last year, for example, the audio was designed to sound as if it was taking place in a crowded restaurant to obfuscate imperfections.
Types of Deepfake Financial Scams
Some of the most alarming examples of deepfake audio scams are ploys in which scammers call individuals posing as loved ones in distress and demanding money. In one example, scammers call their target claiming that they have kidnapped a loved one – using their cloned voice asking for help – and demanding a ransom payment to free them. Similarly, some scammers will use a loved one’s cloned voice to call a target and ask for help to pay off someone whose car they hit or to post bail. In all variants of this scam, the heightened emotion and urgency created by hearing a loved one in danger or in a stressful situation can bypass the target’s skepticism, making them more likely to shell out money or give away personal information.
A similar scam is more common in the workplace: Like an elevated version of commonplace smishing or phishing scams, an employee receives a phone call from someone who sounds like their boss, who urges them to immediately transfer money to, or purchase gift cards for, a false vendor, client, or even the boss himself. The scam already has a history of success: In 2019, criminals used voice-generating software to call the CEO of a UK-based energy firm impersonating the CEO of the firm’s parent company and successfully secured a €220,000 ($243,000 at the time) transfer to a false vendor. In 2020, a branch manager of an unnamed Japanese business in Hong Kong authorized $35 million in transfers at the behest of what he thought was a phone call from the director of his parent business. The firm that is leading the investigation has never publicized whether any funds have been recovered.
Another use of audio deepfakes in scams involves using leaked personal information alongside voice clones to attempt to fraudulently access financial accounts and other assets. Scammers target customer service representatives at financial institutions and other corporations, seeking to impersonate clients to secure large cash transfers, authorize fraudulent credit card use, or other ends.
Despite the success of money-seeking scams, the use of audio deepfakes to spread disinformation is among the most concerning uses for AI and deepfake researchers. Audio deepfakes have already begun to be used to spread disinformation in the 2024 U.S. elections: Ahead of the New Hampshire primaries, some voters received a robocall featuring the cloned voice of President Joe Biden urging them not to vote, saying “your vote makes a difference in November, not this Tuesday.” New Hampshire officials called the call potential voter suppression and are investigating. In September, an organization that rates the trustworthiness of news sites uncovered a network of TikTok accounts posing as real news outlets, featuring AI-generated voice-overs from real news anchors and other public figures peddling disinformation (one, posing as former President Barack Obama, lent credence to a conspiracy theory that he was involved with the accidental death of his personal chef).
Deepfake audio-enabled political disinformation is a global problem. In both Slovakia and Nigeria, fake audio of opposition politicians planning to rig the outcome of the votes went viral in days before voting. Audio and video deepfakes are rife in current election cycles in India, Pakistan, and Bangladesh, both in attempts to bolster politicians (common audios included videos of Indian Prime Minister Narendra Modi singing in regional language or speaking Arabic) and denigrate them. At least 500,000 video and voice deepfakes were shared on social media sites globally in 2023, according to estimates by DeepMedia, one of several companies working on tools to detect synthetic media.
Beyond the damage done by discrete instances of audio deepfake disinformation, experts are concerned that their spread will further erode credibility in a media environment where trust and media literacy is already low. Trust in mainstream news media is already at a historic low in the U.S., and the increasing prevalence of audio and video deepfakes is deepening the climate of confusion that increasingly characterizes the online world.
Strategies for Countering Audio Deepfakes
The implementation of new technologies is often an arms race between scammers seeking to exploit the technology for financial gain and companies and governments trying to understand the technology and regulate and guard against exploitation. While there are no federal laws against deepfakes, federal agencies have attempted to address their use, with the Federal Trade Commission (FTC) last year stating that the production of voice clones could be illegal as making, selling or using a tool effectively designed to deceive violates its prohibition on deceptive or unfair conduct. California, Texas, Virginia and New York have laws banning the creation or distribution of sexually explicit deepfakes, and until its 2023 expiration, a California law banned political deepfakes within 60 days of an election.
On the corporate side, financial institutions and large companies are already exploring deepfake detection technology. Startups like Pindrop, Nuance and DeepMedia are developing tools they hope will be able to identify voice clones and other synthetic media in real time, and in the meantime, some are developing training programs for customer service and call center employees. Large corporations have long invested in anti-fraud training for employees, training them to check source emails and be wary of urgent, unusual requests; now experts recommend that they extend this training to explicitly reference audio deepfake scams, as employees can easily be caught unawares by this cutting-edge technology.
For individuals, the best practices to avoid audio deepfake scams are similar to general anti-fraud advice. People should be wary of urgent, unusual requests for money from any source, and carefully check credentials and email addresses to ensure that the source is legitimate. Unfortunately, spoofing a phone number (making it appear as if a different phone number is calling other than the one actually calling) is simple, meaning that scammers are able to make a phone call from a loved one or from a bank appear as if it is coming from a recognized number. When in doubt, it is recommended to hang up immediately and call the number directly; scammers are not able to intercept incoming calls. Experts also recommend that individuals remain mindful about their privacy online, limiting public access to personal information and avoiding posting long samples of their voice or the voices of their loved ones.