Published 2026-05-31 · 16 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
If your product carries a live two-way conversation — video conferencing, telemedicine, a customer-support voice agent, a live-class breakout room — echo is the defect users notice first and forgive least. A caller who hears their own voice come back a fraction of a second later cannot hold a normal conversation; they stutter, pause, and blame your app. Echo cancellation is the feature that prevents this, and in 2026 it sits at a crossroads: the classical signal-processing canceller that every browser ships for free is good but not perfect, and a new wave of AI cancellers fixes its weak spots at a cost in compute and money. This article is for the product manager, founder, or engineering lead who has to decide whether the free canceller is enough or whether the product needs to reach for an AI one. By the end you will understand why echo happens, how both kinds of canceller work, why "double-talk" is the case that separates a good canceller from a bad one, and a clear test for choosing.
What Echo Actually Is
Start with the loop that creates the problem. Picture two people on a call, Anna and Boris. Anna speaks. Her voice travels to Boris's device, comes out of his speaker, and fills the room he is sitting in. Boris's microphone, sitting right there, picks up everything in that room — including Anna's voice coming out of his own speaker. That captured copy of Anna's voice then travels back across the network to Anna, who hears herself a moment later. That returned copy of her own voice is the echo.
The delay is what makes it intolerable. If the round trip were instant, Anna would not notice. But the trip takes time — the network, the buffers, the journey from speaker to microphone — so Anna hears her own words come back tens to hundreds of milliseconds after she said them. The international guideline for when this becomes a problem is set by an ITU-T standard on talker echo, which establishes that the longer the delay, the more even a faint echo annoys the listener (ITU-T G.131). Past roughly a few tens of milliseconds of round-trip delay, audible echo turns a conversation into a chore.
There are two flavours of this problem, and the distinction shapes everything that follows. The first is acoustic echo: the kind just described, where sound physically leaves a speaker, crosses the air of a room, and re-enters a microphone. This is the dominant problem on any device with an open speaker and microphone — a laptop, a phone on speaker, a conference-room system. The second is network echo (also called line echo): an older problem inside the telephone network, where signal reflects off the electrical junction that converts a four-wire line to the two-wire line running to a home phone. Acoustic echo is the one that matters for video products in 2026; network echo is largely a legacy telephony concern, governed by its own standard (ITU-T G.168).
A crucial boundary, because it trips people up constantly: echo cancellation is not the same as noise suppression. Noise suppression removes the room's background sounds — a fan, a dog, keyboard clatter — from your own microphone. Echo cancellation removes the other person's voice that leaked back through your speaker. They are different problems with different inputs, and a complete product needs both. Noise suppression has only the microphone signal to work with; echo cancellation has a second, decisive input we will meet in a moment. Noise suppression is covered in its own article on real-time noise suppression; this one is only about echo. Both sit in the same clean-audio front end that feeds everything downstream, including the streaming speech-to-text that turns the call into captions and transcripts.
Figure 1. Echo is the far-end voice leaking back through the near-end speaker and microphone. Acoustic echo is the case that matters for video products.
The One Input That Makes Cancellation Possible
Here is the insight the entire field rests on, and it is worth slowing down for. The echo canceller has one enormous advantage over noise suppression: it already has a clean copy of the sound it needs to remove.
Think about what the device knows. When Anna's voice arrives at Boris's device, the software receives that audio before it plays it out of the speaker. That incoming far-end audio is called the reference signal — the known, clean copy of what is about to come out of the speaker. A fraction of a second later, the microphone captures a messy mixture: Boris's own voice, plus the room noise, plus the echo of Anna's voice that just played. The canceller's job is to look at the clean reference it already holds, figure out how that reference got transformed on its trip through the speaker and the room, and subtract that transformed copy from the microphone mixture.
Engineers use a fixed vocabulary for the four signals, and learning it makes every diagram and vendor doc readable. NVIDIA's audio SDK lays them out cleanly: the far-end signal x is the other person's voice (the reference); the near-end microphone signal y is what your mic captures, which is your own speech s plus the echo e; and the canceller's output s' is the microphone signal with the echo removed (NVIDIA, 2026). The whole operation is one subtraction expressed in plain terms:
near-end mic (y) = your speech (s) + echo (e)
output (s') = y − estimate of e
≈ your speech (s), with the echo gone
If only the far-end voice is present and you are silent, a perfect canceller outputs silence — it removed the echo and there was nothing else (NVIDIA, 2026). The difficulty, and the reason this is a whole field rather than one subtraction, is the phrase "estimate of e". The canceller never gets the echo handed to it directly; it only has the reference x and must predict what that reference became by the time it re-entered the microphone. Everything below is about making that prediction good.
How The Classical Canceller Works
The classical acoustic echo canceller, the design that has shipped in phones and conferencing systems since the 1990s and still forms the core of every browser today, predicts the echo with a tool called an adaptive filter. The word "filter" here means a small mathematical model that takes the reference signal in and produces a predicted echo out. The word "adaptive" means it tunes itself continuously, because the room it is modelling keeps changing.
To understand what the filter is modelling, picture the path Anna's voice takes inside Boris's room. It leaves the speaker, bounces off the desk, the wall, the window, and the ceiling, and each bounce arrives at the microphone at a slightly different time and volume. That collection of delayed, faded copies is called the echo path, and it is what turns the clean reference into the specific echo this room produces. The adaptive filter's job is to learn that echo path — to become a mathematical copy of "what this room does to sound" — so it can apply the same transformation to the reference and produce a matching prediction.
The filter learns by trial and error, many times per second. It starts with a guess, applies it to the reference to predict the echo, subtracts that prediction from the microphone signal, and looks at what is left over — the error. If the error still contains echo, the filter nudges its settings to reduce it, and repeats. The standard rule for how it nudges is an algorithm called Normalized Least Mean Squares, or NLMS, which adjusts the filter after each slice of audio to shrink the error, scaling each adjustment by how loud the input is so a sudden loud passage does not throw it off course (Benesty et al., 2015). Modern implementations run this in the frequency domain — splitting the sound into frequency bands and adapting each one separately — which is faster and is the approach used by the widely deployed open-source SpeexDSP canceller (Xiph.Org, 2026).
No filter is perfect, so a classical canceller has a mandatory second stage. After the filter subtracts its best prediction, a residual echo suppressor — historically called a nonlinear processor, or NLP — attenuates whatever echo the filter could not model, by estimating how much echo energy remains in each frequency band and turning those bands down (VOCAL Technologies, 2026). This second stage is why even a good classical canceller can make the far end sound slightly clipped or "pumping": it is aggressively muting bands to kill residual echo, and that muting catches some real speech too.
Figure 2. The classical canceller: an adaptive filter predicts the echo and subtracts it, an error signal tunes the filter, and a residual suppressor cleans up the rest.
The Case That Breaks Everything: Double-Talk
Every echo canceller meets one situation that decides whether it is good or merely adequate, and it is the moment both people speak at once. Engineers call it double-talk.
Recall how the filter learns: it looks at the leftover error and assumes the error is residual echo it should drive to zero. Now picture Anna and Boris speaking simultaneously. Boris's microphone now contains the echo of Anna's voice and Boris's own speech. The error signal suddenly contains Boris's real voice, which is not echo at all. If the filter keeps tuning to drive that error to zero, it will mistake Boris's voice for echo and corrupt its careful model of the room — and once the model is corrupted, the echo it was cancelling leaks back through (Benesty et al., 2015).
The classical defence is a double-talk detector: a separate module that watches for the moment both parties are speaking and, when it fires, freezes the adaptive filter so it stops learning until the near-end speaker goes quiet (VOCAL Technologies, 2026). This works, but it is a blunt instrument. Freeze too eagerly and the filter never adapts to a changing room; freeze too late and it has already corrupted itself. Tuning this trade-off is where decades of classical AEC research went, and it is exactly the seam where the AI cancellers pull ahead.
A second hard case is nonlinearity. The adaptive filter assumes the speaker reproduces the reference faithfully, just delayed and faded. Cheap laptop and phone speakers do not — driven loud, they distort, adding harmonics that were never in the reference. A linear filter cannot predict a distortion it has no clean copy of, so that nonlinear echo slips past the filter and lands on the residual suppressor, which can only crudely mute it. Nonlinear speaker echo is the single biggest reason classical cancellers underperform on consumer hardware.
How The AI Hybrid Works
The 2026 approach does not throw the classical canceller away. It keeps the adaptive filter — which is fast, cheap, and very good at the linear part of the problem — and replaces the brittle back end with a neural network. This is why the topic is "classical AEC plus AI hybrid" rather than "AI instead of AEC": the best systems are hybrids.
The division of labour is the key idea. The classical adaptive filter does what it has always done well: predict and subtract the bulk of the linear echo, cheaply. Then, instead of a hand-tuned nonlinear processor, a small neural network takes over the hard residual. It receives the microphone signal, the reference signal, and the filter's output, and it has been trained on thousands of hours of real echo to recognise what residual echo looks like versus what real near-end speech looks like — including the nonlinear speaker distortion and the double-talk cases the classical back end fumbles (Westhausen and Meyer, 2020). Where the classical double-talk detector makes a crude freeze-or-don't decision, the network has learned a far finer judgement about which parts of the signal are echo and which are the near-end talker, so it can suppress echo during double-talk without muting the person who is speaking.
A widely cited example of this design is DTLN-AEC, an open-source model whose name — Dual-signal Transformation LSTM Network for AEC — describes exactly what it does: it takes the two signals (microphone and reference) and uses a compact memory network to separate near-end speech from echo (Westhausen and Meyer, 2020). It was built deliberately small to run in real time, which matters because, as we will see, latency is the gate on all of this.
The reason this matters now and not five years ago is that the research community organised around the problem. From 2021 to 2023, Microsoft ran an annual Acoustic Echo Cancellation Challenge at the major signal-processing conference, ICASSP, that standardised the datasets and the scoring and pulled the whole field forward. The 2023 edition trained models on recordings from more than 10,000 real audio devices and human speakers in real environments, added a personalised track, and — critically for live use — required submissions to keep total algorithmic plus buffering latency at or below 20 milliseconds (Cutler et al., 2023). That competition is why production-grade AI echo cancellation exists as an off-the-shelf option in 2026.
Figure 3. The hybrid keeps the classical adaptive filter for the linear echo and swaps the brittle back end for a neural network trained on real echo.
The table below sets the two approaches side by side on the axes that decide a product choice. Read it as trade-offs, not a leaderboard: the classical canceller wins on cost and latency, the hybrid wins on the hard echo cases, and the right column depends on the hardware your product runs on.
| Axis | Classical AEC | AI hybrid |
|---|---|---|
| Linear echo (clean speaker) | Strong | Strong (same adaptive filter) |
| Nonlinear echo (cheap speaker) | Weak — filter can't model it | Strong — network learned it |
| Double-talk (both speaking) | Crude freeze, can mute or leak | Fine-grained, suppresses without muting |
| Added latency | ~0–2 ms | up to ~20 ms |
| Compute | Very low (CPU, per-sample) | Higher (CPU or GPU, framed) |
| Cost | Free (AEC3, SpeexDSP) | Free open (DTLN-AEC) or paid SDK (Krisp, NVIDIA) |
| Best for | Headsets, clean audio chains | Loud open speakers, distorting hardware |
The Number That Gates A Live Call: Latency
As with every real-time audio feature, one number decides whether a canceller can be used in a live conversation at all: the delay it adds. A canceller that sounds flawless in a recording but adds 60 milliseconds of delay can be useless in a two-way call, because that delay stacks on top of everything else and pushes the conversation past the point where it feels natural.
The reason the AEC Challenge fixed its latency cap at 20 milliseconds is that this is roughly the ceiling a live communication path can absorb from one processing block (Cutler et al., 2023). The delay has two sources. The first is the frame size: the canceller processes audio in short slices, and it cannot finish a slice until the whole slice has arrived, so a 10-millisecond frame means at least a 10-millisecond wait. The second is look-ahead: a model that reaches forward into upcoming audio to make a better decision must wait for that audio to arrive, adding more delay. Classical cancellers add very little — they work sample by sample. AI cancellers add more, which is why they are designed against a hard latency budget rather than purely for quality.
Make it concrete with the conversation budget. The ITU-T guideline for one-way mouth-to-ear delay says keep it at or below 150 milliseconds for a conversation to feel natural (ITU-T G.114). Suppose the rest of your pipeline — network, capture, encode, decode — already uses 110 milliseconds:
150 ms (G.114 target) − 110 ms (rest of pipeline) = 40 ms left
A classical canceller adding a couple of milliseconds barely touches that 40-millisecond headroom. An AI canceller built to the 20-millisecond AEC-Challenge budget spends half of it:
40 ms headroom − 20 ms (AI canceller) = 20 ms remaining
That can be the right trade when the AI canceller's quality gain is worth the delay, and the wrong one when your pipeline is already tight. The rule is to fix the latency budget first and shop for quality within it — never the reverse. The wider question of where every millisecond goes, and where the canceller should live (on the device, at the edge, or in the cloud), is the subject of our latency and deployment-topology article.
How To Tell If A Canceller Is Actually Good — The Metrics
Vendors love a before-and-after clip, which proves nothing because they chose the clip. Two objective measures let you read any benchmark critically, and they answer two different questions.
The classical engineering measure is Echo Return Loss Enhancement, or ERLE: how many decibels of echo the canceller removes, measured when only the far-end is talking. A high ERLE means strong echo removal in the easy single-talk case. Its limitation is that it says nothing about double-talk and does not track how a human actually perceives the call — a canceller can post a great ERLE while mangling the near-end voice (Purin et al., 2021).
The modern measure fixes that. AECMOS, a neural model from Microsoft, listens to a processed clip and predicts the score a human listening panel would give, rating echo annoyance and other degradations separately, and it needs no clean reference clip to do it (Purin et al., 2021). Because it scores echo and speech degradation apart, it catches the canceller that kills echo by also killing the voice — the failure ERLE hides. AECMOS is the metric the AEC Challenge used to rank submissions, and it is freely available, so you can score your own candidates on your own recordings (Cutler et al., 2023). For a product decision, run AECMOS on clips recorded in your real conditions — your actual devices, your actual rooms — and treat any vendor's demo reel as marketing.
The Free Baseline First — The Browser Already Cancels Echo
Before integrating any AI canceller, check whether you need to, because the web platform already cancels echo and turning it on is one line. The W3C standard that defines how a web page gets a microphone includes a constraint called echoCancellation, and when you request the microphone with it on, the browser applies its built-in canceller (W3C, 2026).
// Request the microphone with the browser's built-in echo cancellation on.
navigator.mediaDevices.getUserMedia({
audio: { echoCancellation: true, noiseSuppression: true }
});
That built-in canceller is not a toy. In Chrome and every Chromium browser it is AEC3, the third-generation WebRTC echo canceller, a mature classical hybrid that handles the linear echo and double-talk well and ships to billions of devices for free (BlogGeek.me, 2026). For a large share of products — especially desktop conferencing where users wear headsets and the echo path is weak — AEC3 with echoCancellation: true is enough, and the correct engineering decision is to ship it and move on.
You reach past the baseline in specific cases: speakerphone-style usage where the speaker is loud and close to the microphone, cheap hardware that distorts and produces heavy nonlinear echo, or a product where audio quality is a headline feature and the residual artefacts of AEC3 are not acceptable. That is when you license an AI canceller — Krisp, whose SDK turns on AI echo cancellation by default and runs entirely on the device, or NVIDIA's audio SDK, which offers a GPU-accelerated AEC effect at 16 or 48 kHz — and feed your audio through it instead of, or in addition to, the browser's (Krisp, 2026; NVIDIA, 2026). The same vendor decision and on-device-vs-cloud trade comes up for noise suppression too, and we walk through that build-versus-buy fork in detail in the noise suppression article. The deep mechanics of wiring a custom canceller into a live WebRTC pipeline — where it sits, how to avoid double-cancelling — is the subject of our AI in video conferencing playbook.
A Common Mistake — Stacking Two Cancellers
The single most frequent error teams make is to run two echo cancellers on the same stream without realising it. They license an AI canceller, feed it the microphone audio, and forget that the browser's echoCancellation is still on, so the signal passes through AEC3 first and the AI canceller second.
The result is worse than either alone. The browser's canceller has already subtracted its estimate of the echo and applied its residual suppression, so the audio reaching the AI canceller no longer matches the reference signal the AI canceller expects — the echo it was trained to remove has been partially mangled by the first stage. The two cancellers fight each other, the adaptation gets confused, and you often hear more artefacts, not fewer. The fix is to pick one canceller per stream: if you are using an AI canceller, turn the browser's off with echoCancellation: false and feed the AI canceller the raw microphone and reference signals it needs. A canceller only works when it receives the unprocessed inputs it was designed for.
Where Fora Soft Fits In
We build the products where echo is unforgivable — video conferencing platforms, telemedicine systems where a doctor and patient must hold a natural conversation, e-learning classrooms with many open microphones, and live broadcast tools. In that work the canceller choice is rarely about the single highest quality score; it is about matching the canceller to the device the product actually runs on. A headset-first desktop conferencing tool usually does fine on the browser's built-in AEC3 and should not pay for more. A mobile or kiosk product with a loud open speaker close to the microphone — the worst case for nonlinear echo — is where an on-device AI canceller earns its cost. The pattern we apply is to fix the latency budget first, then identify the worst echo path the product must survive, and only then decide whether the free canceller covers it or an AI one is required — because a canceller that blows the latency budget is not high quality, it is unusable.
How To Choose — Five Questions
Work through these in order; each one narrows the field.
- Is the browser's built-in canceller already enough? Turn on
echoCancellation: true, test with real users on real devices, and if echo complaints stop, ship it and stop here. AEC3 is free and good. - What is the echo path? Headsets and earbuds have a weak echo path the built-in canceller handles easily. A loud open speaker near the microphone — speakerphone, kiosk, conference room — is the hard case that may need AI.
- What hardware must it run on? Cheap speakers that distort produce nonlinear echo that classical cancellers miss; that points toward an AI canceller. A clean audio chain may not need one.
- What is your latency budget? Run the G.114 arithmetic for your pipeline. A classical canceller costs almost nothing; an AI one costs up to ~20 milliseconds — confirm you have the headroom before committing.
- Self-host or buy? Open models like DTLN-AEC or SpeexDSP are free but you integrate, tune, and maintain them; Krisp and NVIDIA are paid SDKs that hand you production quality and platform coverage for a fee.
What To Read Next
- Real-time noise suppression in production — Krisp, RNNoise, and DeepFilterNet
- Streaming ASR in production — Deepgram, Whisper, and AssemblyAI in 2026
- AI in video conferencing — engineering playbook
Talk To Us / See Our Work / Download
- Talk to a video engineer about adding clean two-way audio to your product → /services/ai-software-development
- See our case studies in video conferencing, telemedicine, e-learning, and OTT → /cases
- Download the Echo Cancellation Selection Checklist (one page, printable) → Download the checklist
References
- NVIDIA. About the Acoustic Echo Cancellation Effect — NVIDIA Audio Effects (AFX) SDK User Guide, last updated 16 March 2026, accessed 2026-05-31.
https://docs.nvidia.com/maxine/afx/latest/AboutTheEffects/AboutAcousticEchoCancellation.html. First-party vendor source for the canonical signal model used throughout this article: near-end micy = s + e, far-end referencex, outputs' = (s + e) − e; silent output when only far-end echo is present; 16 kHz / 48 kHz 32-bit-float operation; GPU-accelerated AEC as an off-the-shelf effect; batched server-side processing for conferencing. - ITU-T. Recommendation G.131 — Talker echo and its control, accessed 2026-05-31.
https://www.itu.int/rec/T-REC-G.131. Primary standards source for why delay turns a faint echo intolerable — the longer the round-trip delay, the lower the echo level a talker will tolerate. Establishes the human-perception basis for needing cancellation at all. - ITU-T. Recommendation G.168 — Digital network echo cancellers, (2015) plus Corrigendum 1 (12/2022), accessed 2026-05-31.
https://www.itu.int/rec/T-REC-G.168. Primary standards source defining the conformance tests for network (line) echo cancellers — the legacy telephony echo problem this article distinguishes from acoustic echo. Cited to draw the acoustic-vs-network boundary precisely. - ITU-T. Recommendation G.167 — Acoustic echo controllers, (03/1993), accessed 2026-05-31.
https://www.itu.int/rec/T-REC-G.167. Primary standards source for acoustic echo control in hands-free terminals — the foundational standard for the acoustic-echo problem that dominates video products. Paired in conformance testing with P.340. - ITU-T. Recommendation P.340 — Transmission characteristics and speech quality parameters of hands-free terminals, accessed 2026-05-31.
https://www.itu.int/rec/T-REC-P.340. Primary standards source for the hands-free-terminal characteristics against which acoustic echo control is tested. Anchors the claim that acoustic echo from open speaker/microphone setups is a standardised, measured problem. - ITU-T. Recommendation G.114 — One-way transmission time, accessed 2026-05-31.
https://www.itu.int/rec/T-REC-G.114. Primary standards source for the mouth-to-ear delay budget — ≤150 ms one-way for natural conversation — that every canceller's added latency must fit inside; the basis of this article's latency arithmetic. - W3C. Media Capture and Streams (Recommendation), accessed 2026-05-31.
https://www.w3.org/TR/mediacapture-streams/. Primary standards source for the browserechoCancellationconstraint ongetUserMediaaudio — how it is requested and reported. Grounds the "free baseline first" recommendation and the one-line code example. - Westhausen, N. L., Meyer, B. T. Acoustic Echo Cancellation with the Dual-Signal Transformation LSTM Network (DTLN-AEC), arXiv:2010.14337, 2020, accessed 2026-05-31.
https://arxiv.org/abs/2010.14337. Primary academic source for the AI-hybrid design: a compact dual-signal (microphone + reference) network small enough for real time, separating near-end speech from echo where the classical back end fails — the canonical open example of the hybrid approach. - Cutler, R., et al. ICASSP 2023 Acoustic Echo Cancellation Challenge, arXiv:2309.12553, 2023, accessed 2026-05-31.
https://arxiv.org/abs/2309.12553. Primary academic source for the modern AI-AEC landscape: the fourth annual challenge, a >10,000-device real-recording dataset plus a synthetic set, a personalised track, the full-band AECMOS metric, MOS+WAcc ranking, and the 20 ms algorithmic-plus-buffering latency cap that defines the live-use budget. - Purin, M., Sootla, S., Sponza, M., Saabas, A., Cutler, R. AECMOS: A Speech Quality Assessment Metric for Echo Impairment, arXiv:2110.03010, 2021, accessed 2026-05-31.
https://arxiv.org/abs/2110.03010. Primary academic source for AECMOS — a neural, no-reference perceptual metric that scores echo annoyance separately from other degradation and correlates with human ratings, replacing intrusive measures like ERLE/PESQ for ranking cancellers. Per the standards/primary-source-first rule, this controls how this article frames "how to measure a canceller". - Benesty, J., Paleologu, C., Gänsler, T., Ciochină, S. (and the optimized-NLMS survey literature). A Perspective on Stereophonic Acoustic Echo Cancellation and the NLMS adaptive-filter / double-talk literature, EURASIP/Springer, 2015, accessed 2026-05-31.
https://asp-eurasipjournals.springeropen.com/articles/10.1186/s13634-015-0283-1. Academic source for the classical adaptive-filter mechanics: NLMS update rule, energy normalisation, and why double-talk corrupts the filter and must be detected and frozen. - VOCAL Technologies. Acoustic Echo Canceller (technical reference), accessed 2026-05-31.
https://vocal.com/echo-cancellation/acoustic-echo-canceller/. Vendor-deployer source for the production structure of a classical canceller: adaptive filter, double-talk detector that freezes adaptation, and a residual/nonlinear-processor stage that attenuates leftover echo by estimated band energy. - Xiph.Org. SpeexDSP (
libspeexdsp/mdf.c, echo canceller source), accessed 2026-05-31.https://github.com/xiph/speexdsp. First-party open-source source for the frequency-domain Multidelay Block (MDF) adaptive filter — the canonical free, permissively licensed classical AEC implementation, adapting each frequency band independently. - BlogGeek.me. AEC (Acoustic Echo Cancellation) — WebRTC glossary, accessed 2026-05-31.
https://bloggeek.me/webrtcglossary/aec/. Vendor-deployer / educational source for WebRTC AEC3 — the third-generation echo canceller shipped in Chrome and every Chromium browser — as the free, production-grade classical baseline thatechoCancellation: trueactivates.


