Researchers Mount Vishing Attacks With Real-Time Voice Cloning

 Deepfake voice call vishing attack

Cybersecurity researchers revealed Tuesday how artificial intelligence can be used to clone a person’s voice in real time to mount voice phishing attacks on unsuspecting organizations.

Researchers from the NCC Group noted in a company blog that they launched attacks using real-time voice cloning against real organizations and successfully recovered sensitive and confidential information.

“Not only that, but we have also shown how these techniques can convince people in positions of key operational responsibility to carry out actions on behalf of the attacker,” wrote the researchers, Pablo Alobera, Pablo López, and Víctor Lasa.

“In security assessments that simulated real-world attack conditions, we have been able to carry out actions such as email address changes, password resets, and so on,” they added.

When starting their project, the researchers identified several challenges to mounting voice phishing (vishing) attacks with cloned voices. One was the technology. The vast majority of the state-of-the-art deepfake technologies and architectures were focused on offline inferences, the researchers discovered.

“These were capable of achieving good results when cloning a pre-recorded extract of someone talking, but they couldn’t change a user’s voice in real time — a key requirement for vishing attacks,” they wrote.

“In addition to this, many of them were strongly dependent on TTS (text-to-speech) models,” they added.

How Real-Time Cloning Works

They pointed out that with that technology, attackers were left with two alternatives. They could use a set of pre-recorded sentences, which would present obvious problems when trying to have a real-time conversation, or they could input sentences into the TTS model on the fly, which would introduce unnatural delays in the cloned responses.

The researchers were able to route the signal from an attacker’s microphone to a real-time voice modulator that utilized a machine learning (ML) model of a person’s voice. As the attacker spoke, what the target heard was the cloned voice from the ML model.

The researchers also noted that the audio signal could be routed directly to applications like Microsoft Teams or Google Meet, allowing the use of AI-impersonated voices in popular messaging applications during a voice call in real time.

“In this proof of concept,” they added, “the impersonated individual’s phone number was spoofed with their consent. Like traditional vishing, spoofing is used to exploit the trust the victim will have when they are presented with a familiar caller ID on their device.”

Although the technology was not mature enough to support real-time cloned voice vishing until recently, that is no longer the case. [T]he reality today is that the tools and infrastructure needed for real-time voice cloning are accessible to those with even limited technical and financial means,” the researchers wrote.

“It is worth pointing out that this was all possible using hardware, audio sources, and audio processing software that were all ‘good enough’, rather than being exceptional,” they noted. “That is, the financial outlay to achieve our results would certainly be within the reach of many individuals and certainly of a small organization.”

Proliferating Technology

Real-time cloning is highly significant for vishers, noted Matthew Harris, senior product manager for fraud protection at Crane Authentication, a global brand protection and product security company. “Victims rely on the caller’s number, voice, and message content — all of which can be spoofed or cloned,” he told TechNewsWorld. “Real-time voice cloning makes the scam more believable and increases the chances of success.”

“Scammers initially struggled with realism — voices and visuals often had telltale flaws,” he said, “but open-source LLMs have helped overcome these issues. AI models now learn from past mistakes, making synthetic voices and faces increasingly indistinguishable from real ones.”

Real-time voice cloning is the difference between reading a script and improvising with a human, explained Brandon Kovacs, a senior security consultant at Bishop Fox, a cybersecurity firm specializing in offensive security testing and consulting, in Phoenix.

“Real-time voice conversion lets attackers handle questions, adjust requests, and escalate authority,” he told TechNewsWorld. “This becomes even more potent when paired with simultaneous deepfake video on Zoom or Teams.”

“Real-time voice cloning is a force multiplier,” added T. Frank Downs, senior director of proactive services at BlueVoyant, an enterprise cybersecurity company, in New York City. “It allows attackers to adapt tone, cadence, and context on the fly, which sustains the illusion of authenticity throughout the call.”

“This dynamic response capability dramatically increases the success rate and makes detection during the interaction nearly impossible,” he told TechNewsWorld.

Doing good voice cloning across a range of targets can be challenging, observed Roger Grimes, CISO advisor at KnowBe4, a security awareness training provider, in Clearwater, Fla. “Some voice cloning samples are easy to do while others take finessing,” he told TechNewsWorld. “It can be a skill to learn like anything else. But every month, the voice-cloning tools are making it easier to do without having great skills.”

“That’s because today’s generative AI is better than any previous tool at probabilistic pattern-matching,” he said. “The ability to do probabilistic pattern matching lets AI get far better results than previous voice-cloning tools. There are still some voices and languages that AI voice cloning can’t do well, and that’s usually due to a lack of inputted data around that style of voice or language. For example, voice-cloning tools don’t do well with African native speaking voices.”

He maintained that voice cloning will proliferate among social engineers. “We expect real-time voice cloning to make most voice-based social engineering AI-enabled by the end of 2026,” he predicted. “By the end of 2026, most voice-based social engineering will not involve a true human voice. Hacking via social engineering is getting ready to change forever.”

Deepfake Vishing Threats on the Rise

Even without AI, it’s relatively easy to create voice messages that impersonate CEOs. “It only takes a recording of them along with a tool like ElevenLabs to create a fairly compelling fake,” explained Alex Quilici, CEO of YouMail, a call management services company, in Irvine, Calif.

“We have not seen a lot of these yet, though we are hearing more stories,” he told TechNewsWorld. “We fully expect them to be the next major attack vector.”

Quilici added that at the anecdotal level, deepfakes of CEOs are becoming more common. “However, to this point, they are certainly not widespread,” he said.

“What is currently widespread are text messages impersonating executives, and we have seen those at our enterprise clients,” he continued. “The scam is basically blasting texts to a number of different employees, impersonating the CEO, and trying to get them to take a particular action. This is because texts are the easiest to create and send and seem to work.”

Marc Maiffret, CTO of BeyondTrust, a maker of privileged account management and vulnerability management solutions, in Carlsbad, Calif., warned that AI-driven impersonation attacks are intensifying, targeting both individuals and organizations. “Combating them requires human vigilance and strong identity security,” he told TechNewsWorld. “Organizations should continue enforcing the principle of least privilege, identity infrastructure monitoring, and securing access to sensitive accounts.”

“Limit what attackers can do — even with stolen credentials,” he said. “AI-based social engineering attacks highlight why identity is one of the most important domains for businesses to secure. Deepfakes are, and will continue to be, a great example of the need to treat identity as the new perimeter.”

Meanwhile, the NCC researchers are moving on to deepfake videos. “Preliminary investigations so far have mainly succeeded only in identifying new limitations and problems in this field,” they wrote. “For example, there are difficulties in synchronizing the modified audio and video signals.”

“However,” they noted, “given the unprecedented speed at which this technology is moving forward, it is expected that a deepfake solution which realistically simulates both audio and video is feasible; it is just a matter of time.”

Previous Post Next Post