AI in Audio for Video: Voice Cloning, Dubbing, Restoration, Generative Music

Why this matters

If your product carries sound — a conferencing app, a streaming service, a telemedicine platform, an e-learning tool, a creator app — a customer or a competitor will soon ask why it does not use AI for audio. Some of the answers are easy wins: a neural noise suppressor that scrubs a dog barking out of a sales call is a feature you can ship today, and it works. Others are traps: clone a presenter's voice without written permission and you have created legal exposure, not a feature. The hard part is telling the two apart, because the marketing language treats "neural codec," "AI dubbing," and "generative soundtrack" as if they were the same kind of decision, when one is a quiet infrastructure upgrade and another is a product that can get you sued. This article is for the product manager, founder, or engineer who has to decide what AI-audio capability to build, buy, or avoid. It separates the four families, gives you the real 2026 numbers, and is blunt about the legal and ethical edges, because in audio the edges are where the money and the lawsuits both live.

Four doors, one technique

Almost every "AI in audio" headline you will read is one of four jobs, and it helps to name them before the marketing blurs them together. The first job is compression: shrinking sound to fewer bits, the job a codec has always done, now done by a neural network. The second is cleanup: removing noise, echo, and gaps from a real recording — repairing sound that exists. The third is synthesis of a known voice: cloning a specific person's voice and using it to dub or re-voice content. The fourth is synthesis from nothing: generating music or sound effects that never existed.

Underneath all four sits the same engine. A neural network — a mathematical model with millions or billions of adjustable numbers, trained by showing it enormous quantities of example audio until it learns the statistical shape of sound — has, since roughly 2020, become good enough at audio to beat the hand-crafted algorithms that ruled for forty years. The difference between a neural codec and a neural dubbing tool is not the engine; it is what you train it to do and what you feed it at run time. That single fact is why this field moves so fast: a research advance in one door tends to open the next.

The four jobs also split cleanly on a question you should ask of any AI-audio feature before you build it: does it touch sound that exists, or does it create sound that does not? Compression and cleanup work on real, recorded audio — they are low-risk, infrastructure-grade, and rarely raise a legal eyebrow. Voice cloning and generative music create or impersonate — and that is where consent, likeness rights, copyright, and disclosure law all live. Keep that split in your head and most of the decisions in this article make themselves.

Figure 1. The four families of AI audio, split by whether they work on existing sound or create new sound. The right-hand half is where the legal risk lives.

Door 1 — neural codecs: AI that compresses sound

A codec is the software that squeezes sound into fewer bits to send or store, then unpacks it again. For forty years codecs were hand-designed by engineers who modelled how human hearing works — the masking tricks behind MP3, AAC, and Opus. A neural codec throws that out and lets a neural network learn the compression from data: feed it millions of audio clips and it discovers, on its own, how to represent sound compactly and rebuild it convincingly. We cover the architecture in depth in the next article, neural audio codecs: Lyra, EnCodec, SoundStream and what comes next; here is the short version of why it matters for video.

Three systems set the pace. SoundStream, from Google in 2021, was the first end-to-end neural codec to clearly beat the old guard: at 3 kbps it outperformed Opus running at 12 kbps, and approached the quality of the EVS speech codec at 9.6 kbps — roughly three to four times fewer bits for the same quality. Lyra v2, Google's production codec built on SoundStream, runs at 3.2, 6, or 9.2 kbps; at 9.2 kbps it sounds about as good as Opus at 14 kbps, and it can switch bitrate instantly because of a design trick called residual vector quantization. EnCodec, from Meta in 2022, pushed quality higher still — handling everything from 24 kHz mono speech at 1.5 kbps up to 48 kHz stereo music at 24 kbps, and scoring higher in listening tests than SoundStream at the same bitrate.

Walk through the SoundStream number, because it is the headline that matters. If a conferencing call currently uses Opus at 12 kbps per speaker, and a neural codec delivers the same perceived quality at 3 kbps, then a 50-person all-hands that was spending 600 kbps on audio could spend 150 kbps:

50 speakers × 12 kbps = 600 kbps (Opus) 50 speakers × 3 kbps = 150 kbps (neural codec at equal quality)

That is a 4× reduction on the audio budget, which matters most exactly where bandwidth is scarcest — a patient on rural mobile data in a telemedicine call, a student on a congested school network. The catch, and it is a real one, is cost on the other axis: neural codecs are far heavier on the processor than Opus, which decodes in microseconds. Lyra was engineered specifically to run on a phone through aggressive model-shrinking, and it ships in Google Meet, but most neural codecs in 2026 are still too expensive in CPU or battery to deploy at scale on every client. The bandwidth is almost free; the computation is not.

$Horizontal bar chart comparing the bitrate each codec needs for roughly the same perceived quality. Opus is the baseline at 12 kbps. SoundStream reaches the same quality at 3 kbps. EnCodec, which also handles music, sits at 6 kbps. Lyra v2 at 9.2 kbps is annotated as sounding about as good as Opus at 14 kbps. The x-axis is bitrate in kbps; lower is better. A caption notes that neural codecs reach Opus quality at a fraction of the bitrate, at a higher CPU cost.$ Figure 3. Neural codecs reach Opus-level quality at a fraction of the bitrate — the saving is in bits sent, the price is in processor cycles spent.

Door 2 — neural cleanup: AI that repairs sound

The most-shipped, least-controversial AI audio in 2026 is cleanup: using a neural network to remove the things that make real recordings sound bad. This is the AI you can deploy today without a lawyer in the room, because it repairs sound that genuinely exists rather than inventing or impersonating anything. Three jobs dominate.

Noise suppression removes background sound — the dog, the keyboard, the café — while keeping the voice. The neural era began with RNNoise in 2017 and now runs through Krisp, NVIDIA's RTX Voice, and the built-in suppressors in every major conferencing platform. We cover this in full in noise suppression: classical NS, RNNoise, Krisp, NVIDIA RTX Voice. Echo cancellation removes the far-end voice that leaks back through a speaker; the modern WebRTC AEC3 has neural extensions, covered in acoustic echo cancellation: how it really works. Both are mature, both ship, both work.

The most striking of the three is packet-loss concealment, or PLC — the job of inventing the audio that should have arrived when a network packet was dropped. Classical PLC repeated the last bit of waveform and faded it out, which sounds like a robotic stutter for anything longer than a syllable. Google's WaveNetEQ replaced that with a neural network that has learned what speech sounds like, so when a packet goes missing it generates a plausible continuation of the actual voice — the right vowel, the right pitch, in the speaker's own timbre — rather than a repeat. It is conditioned on the recent history of the signal, runs in real time inside the NetEQ jitter buffer (covered in jitter buffer: NetEQ, the brain of WebRTC audio), is compressed enough to run on a phone, and ships in Google's calling products. It ranked second in the 2022 Interspeech PLC Challenge.

Here is the conceptual line worth marking, because it is the same line the law cares about. WaveNetEQ generates audio — it synthesises a voice that was never recorded for those few milliseconds. We call it "concealment" and not "fabrication" because it fills a gap in a real, consented conversation with the speaker's own voice, to preserve what they meant to say. The exact same generative capability, pointed at a different goal, becomes voice cloning. The technology does not know the difference; your product design and your consent flow do.

Pitfall — neural cleanup can erase real information. An aggressive noise suppressor does not know that the "noise" it is removing might be a wheeze in a telemedicine consult, a faint alarm in a security feed, or a musical detail in a performance. Neural models trained to maximise speech clarity will confidently delete anything that is not speech. In clinical, security, or music contexts, test what your suppressor removes, not just how clean the voice sounds — and give users an off switch. The same model that rescues a sales call can destroy a diagnostic recording.

Door 3 — voice cloning and AI dubbing: synthesising a known voice

Here the risk profile changes completely. Voice cloning is training a model on samples of one specific person's voice so it can then say anything in that voice. AI dubbing is the headline application: take a video in one language, transcribe it, translate it, and generate a new soundtrack in the target language — in the original speaker's own voice, with their tone and pacing preserved.

The 2026 reference product is ElevenLabs Dubbing. Its version 2 pipeline shows how far this has come: it uses speaker diarization — automatically figuring out who is speaking when — to separate each voice in a clip, builds a voice clone of each speaker, and then conditions the dubbed output on the source speaker's vocal performance, capturing tone, pace, and emotional register, so the German dub of an English actor sounds like that actor being angry, not a generic German voice reading a script. It covers well over 90 languages. The economics have collapsed to the point of disruption: dubbing is billed per source minute, and a creator-tier plan dubs roughly an hour of source audio for a low double-digit dollar figure — a job that a traditional dubbing studio would price in the thousands and deliver in weeks. Voice cloning itself needs surprisingly little: instant clones work from a short clip, and a professional-grade clone from competitors like Resemble AI is recommended at 10 to 25 minutes of clean source audio.

For a video product, the legitimate use cases are real and large. An e-learning company can offer every course in 30 languages without re-recording. A telemedicine platform can let a doctor's after-visit summary play back in the patient's language. An OTT service can localize a back catalogue that was never commercially viable to dub by hand. These are genuine, valuable features.

And they are exactly where the law has moved fastest, because the same tool clones a consenting narrator and a non-consenting celebrity with equal ease.

The consent and likeness problem

A person's voice is, in most legal frameworks, both personal data and a protected aspect of their identity. Three distinct regimes now bear on cloning it, and they do not fully agree, so a product shipping globally has to satisfy the strictest one it touches.

In the United States, there is no single federal voice-cloning statute yet — the proposed NO FAKES Act, which would create a federal right against unauthorized AI replicas of a person's voice and likeness, has been introduced in Congress but not passed as of mid-2026. In its absence, a patchwork applies: state right-of-publicity laws (notably Tennessee's 2024 ELVIS Act, which explicitly covers voice), and existing likeness and unfair-competition law. In the European Union, two things apply at once: a voice is personal data under the GDPR, so cloning one needs a lawful basis (consent is the cleanest), and the EU AI Act adds a transparency layer on top.

That AI Act layer has a hard date. Under Article 50 of Regulation (EU) 2024/1689, which enters into force on 2 August 2026, two obligations land. First, providers of AI systems that generate synthetic audio must mark their output, in a machine-readable format, as artificially generated (Article 50(2)). Second, anyone deploying an AI system that generates or manipulates audio constituting a "deep fake" must disclose that the content is artificially generated or manipulated (Article 50(4)) — with a narrower disclosure carve-out for evidently artistic or satirical work. In plain terms: from August 2026, if your product clones a voice or generates a dub for an EU audience, you have to label it as AI-made, and the underlying generator has to watermark it. This is not optional and the penalties under the Act are large.

The practical rule for a product team is simpler than the law: get explicit, written, scoped consent from the owner of any real voice you clone, and disclose AI-generated audio to the people who hear it. "Scoped" matters — consent to clone a voice for a course is not consent to use it in an advertisement. Build the consent capture and the disclosure label into the feature from day one; retrofitting them after launch is how products end up in the news for the wrong reason.

Figure 2. Before shipping AI-generated voice, two gates decide everything: is it a real person's voice, and do you have scoped consent. The EU AI Act adds a labelling requirement from August 2026.

Door 4 — generative music and sound: synthesising from nothing

The fourth door creates audio that never existed: background music for a video, a sound effect, a full song from a text prompt. Suno and Udio are the consumer-facing names — type "upbeat corporate background, 90 seconds, no vocals" and get a finished track. For a video product, the appeal is obvious: royalty-free, on-demand soundtrack and effects with no licensing negotiation and no music-supervisor budget.

The technology works. The problem is the training data, and it has produced the defining legal fight of AI audio. In June 2024 the RIAA, on behalf of Universal, Sony, and Warner, sued both Suno and Udio, alleging they trained their models on copyrighted master recordings without permission. The case is the music industry's version of the question every generative-AI field is asking: is training on copyrighted work "fair use," or is it infringement? The answer is being negotiated and litigated in real time. By late 2025 and into 2026 the dispute had split three ways: Warner settled with Suno (November 2025) and Universal settled with Udio (October 2025), both turning into licensing deals — Universal's reportedly carrying a per-generation royalty in the range of fractions of a cent, plus content-identification and audit obligations — while Sony has not settled, and a pivotal fair-use ruling in the Sony cases is expected around mid-2026. Suno, still litigating, raised another $400 million in mid-2026, which tells you the market is betting the technology survives the lawsuits in some licensed form.

What this means for a video product in 2026 is concrete and cautionary. The legal status of a generated track depends on the tool's training data and its terms, and those are shifting under your feet as settlements land. If you put AI-generated music into content you publish or sell, you inherit whatever provenance risk the generator carries. The defensible path is to use generators that have either licensed their training data or trained only on owned or public-domain material, to read the commercial-use terms of whatever you use, and to keep records of what was generated and how. Treat a generated soundtrack the way you would treat stock music with an uncertain license — useful, but check the paperwork before it ships in a paying product.

A field guide: which door, which product, what to watch

The four doors map onto product decisions differently, and the table below is the summary worth keeping. Read the right-hand column first — the risk is the deciding factor more often than the capability.

Family	What it does	Where it ships in 2026	Maturity	Main thing to watch
Neural codecs	Compress audio with a learned model	Lyra in Google Meet; research and niche use elsewhere	Production for speech, emerging for music	CPU/battery cost, not bandwidth
Neural cleanup (NS, AEC, PLC)	Remove noise, echo, fill dropped packets	Every major conferencing platform	Mature	Over-aggressive models erase real information
Voice cloning / dubbing	Re-voice content in a known voice	ElevenLabs, Resemble, and rivals	Production, very fast	Consent, likeness rights, EU AI Act labelling
Generative music / SFX	Create new audio from a prompt	Suno, Udio, and others	Production, legally unsettled	Training-data copyright; check commercial terms

The pattern is plain. The left two doors — compression and cleanup — are infrastructure: ship them when the engineering math works, and the only real question is whether the quality gain beats the processor cost. The right two doors — cloning and generation — are content, and for those the gating question is never "does it work" (it does) but "do we have the right to ship what it makes." A founder who internalizes that split will spend their legal budget where it actually matters and their engineering budget where it actually pays.

How the four doors connect: a worked pipeline

To see how these fit together, follow a single piece of audio through a realistic 2026 video product — an e-learning platform localizing a recorded lecture. The lecturer recorded in a noisy home office. Door 2 runs first: a neural noise suppressor removes the room hum and an air conditioner, cleaning the real recording. The cleaned audio is transcribed and translated, then Door 3 generates the Spanish, Mandarin, and Arabic dubs in the lecturer's own cloned voice — which the platform may legally do because the lecturer signed a scoped consent at upload, and which it labels as AI-generated to satisfy Article 50 for its EU students. A short musical sting between sections comes from Door 4, a generative tool whose commercial terms the platform checked. And when a student watches on a weak connection, Door 1 — a neural codec — delivers the audio at a quarter of the usual bitrate without the voice degrading.

Four families, one lecture, each door doing the job it is actually good at. Notice that the two low-risk doors run silently in the infrastructure, while the two high-risk doors each required a deliberate legal step — consent at one, license-checking at the other — built into the product flow rather than bolted on. That is the whole discipline of AI audio in one example.

Where Fora Soft fits in

We build video conferencing, telemedicine, e-learning, OTT streaming, and surveillance products, and AI audio shows up in all of them — but rarely in the form the headlines suggest. The features we ship most are the quiet ones: neural noise suppression on conferencing and telemedicine calls, where a clean voice is a clinical and educational necessity, and the cleanup and concealment that keep a call intelligible on a bad network. When clients ask about voice cloning or AI dubbing for localization, our first conversation is about consent capture and disclosure, not the model — because the engineering is the easy part and the rights are where projects succeed or fail. For real-time AI processing on calls, the constraint is always latency and per-client compute, which is why we treat a neural feature as a budget line in the WebRTC audio pipeline, not a free upgrade. The right AI audio feature, shipped with the rights and the latency budget handled, is a real advantage; the wrong one is a liability.

Call to action

Talk to a audio engineer — book a 30-minute scoping call to talk through your ai in audio for video plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the AI Audio Decision Checklist — One page: the four families of AI audio (neural codecs, neural cleanup, voice cloning / dubbing, generative music), the 2026 facts, and the consent + copyright gates before you ship.

References

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, M. Tagliasacchi, SoundStream: An End-to-End Neural Audio Codec, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021. Google Research. https://research.google/blog/soundstream-an-end-to-end-neural-audio-codec/ (Primary source from the codec's authors: end-to-end neural codec with residual vector quantization; at 3 kbps outperforms Opus at 12 kbps and approaches EVS at 9.6 kbps.)
A. Défossez, J. Copet, G. Synnaeve, Y. Adi, High Fidelity Neural Audio Compression (EnCodec), arXiv:2210.13438, Meta AI, October 2022. https://arxiv.org/abs/2210.13438 (Primary source: convolutional encoder–decoder with RVQ and adversarial training; 24 kHz mono at 1.5 kbps to 48 kHz stereo at 24 kbps; higher MUSHRA than SoundStream at equal bitrate.)
Google Open Source Blog, Lyra V2 — a better, faster, and more versatile speech codec, September 2022. https://opensource.googleblog.com/2022/09/lyra-v2-a-better-faster-and-more-versatile-speech-codec.html (Primary vendor source: Lyra v2 bitrates 3.2 / 6 / 9.2 kbps; built on SoundStream; RVQ enables instant bitrate switching; 9.2 kbps ≈ Opus 14 kbps; runs on phones.)
J. Stimberg, A. Narest, et al., WaveNetEQ — Packet Loss Concealment with WaveRNN, IEEE 2021 (also Google AI Blog, "Improving Audio Quality in Duo with WaveNetEQ"). https://ieeexplore.ieee.org/document/9443419/ (Primary source: neural PLC conditioned on log-mel history; runs in real time inside the NetEQ jitter buffer; shipped in Google Duo; ranked 2nd in the Interspeech 2022 PLC Challenge.)
Regulation (EU) 2024/1689 (EU Artificial Intelligence Act), Article 50 — Transparency obligations for providers and deployers of certain AI systems. Date of entry into force 2 August 2026 (per Article 113). Official Journal version of 13 June 2024. https://artificialintelligenceact.eu/article/50/ (Standards / legislative primary source: §50(2) machine-readable marking of synthetic audio; §50(4) deep-fake disclosure obligation; artistic-work carve-out.)
ElevenLabs, Dubbing documentation and Dubbing Studio product pages, accessed 2026-06-07. https://elevenlabs.io/docs/overview/capabilities/dubbing (Vendor source: Dubbing v2 uses speaker diarization, per-speaker voice cloning, and performance-conditioned synthesis across 90+ languages; per-source-minute billing.)
Recording Industry Association of America (RIAA), Record Companies Bring Landmark Cases for Responsible AI Against Suno and Udio, June 2024. https://www.riaa.com/record-companies-bring-landmark-cases-for-responsible-ai-againstsuno-and-udio-in-boston-and-new-york-federal-courts-respectively/ (Primary source for the music-industry copyright suits against Suno and Udio.)
TechCrunch, Still facing copyright lawsuits, AI music generator Suno raises another $400M, 3 June 2026. https://techcrunch.com/2026/06/03/still-facing-copyright-lawsuits-ai-music-generator-suno-raises-another-400m/ (Current-status source: Suno $400M raise mid-2026; Warner–Suno and Universal–Udio settlements; Sony unsettled; fair-use ruling pending. Trade press — flag for a primary-court-record swap at review.)
C. K. A. Reddy, V. Gopal, R. Cutler, DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors, ICASSP 2022, arXiv:2110.01763. https://arxiv.org/abs/2110.01763 (Source for the neural-noise-suppression evaluation context; ties Door 2 to the objective-metrics article.)
J. Zhang, M. Zhu, et al., The CCF AATC 2025 Speech Restoration Challenge, arXiv:2509.12974, 2025. https://arxiv.org/abs/2509.12974 (Current research source: diffusion-based and multi-degradation neural speech restoration — denoising, dereverberation, declipping — the cleanup frontier in 2025–2026.)
EU AI Act, Article 113 — Entry into Force and Application, Regulation (EU) 2024/1689. https://artificialintelligenceact.eu/article/113/ (Legislative primary source confirming the 2 August 2026 application date for the Article 50 transparency obligations.)

Living note: AI-audio law is moving monthly. The Suno/Udio settlements (Warner–Suno Nov 2025; Universal–Udio Oct 2025) and the pending Sony fair-use ruling were current as of 2026-06-07; the NO FAKES Act remained unpassed in Congress; the EU AI Act Article 50 date (2 Aug 2026) is fixed by Article 113. Re-verify the litigation status and the NO FAKES Act on the next refresh.

AI in Audio for Video: Voice Cloning, Dubbing, Restoration, Generative Music

Why this matters

Four doors, one technique

Door 1 — neural codecs: AI that compresses sound

Door 2 — neural cleanup: AI that repairs sound

Door 3 — voice cloning and AI dubbing: synthesising a known voice

The consent and likeness problem

Door 4 — generative music and sound: synthesising from nothing

A field guide: which door, which product, what to watch

How the four doors connect: a worked pipeline

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

AI in Audio for Video: Voice Cloning, Dubbing, Restoration, Generative Music

Why this matters

Four doors, one technique

Door 1 — neural codecs: AI that compresses sound

Door 2 — neural cleanup: AI that repairs sound

Door 3 — voice cloning and AI dubbing: synthesising a known voice

The consent and likeness problem

Door 4 — generative music and sound: synthesising from nothing

A field guide: which door, which product, what to watch

How the four doors connect: a worked pipeline

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

SoundStream

Opus

Lyra

Bitrate

Noise suppression

EnCodec

Jitter buffer

RNNoise