Published 2026-06-02 · 17 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
If your product carries a live two-way conversation in the browser — video conferencing, telemedicine, a virtual classroom, an AI voice agent — echo is the defect a user notices in the first ten seconds and blames on you. WebRTC, the technology that lets browsers send audio and video to each other directly, already includes a capable echo canceller for free, so most teams never need to build one. The trap is that "it's built in" hides three ways to get it wrong: feeding it the wrong signal, stacking a second canceller on top, and assuming a browser fix works the same on a phone. This article is for the product manager, founder, or engineering lead who has to ship clean calls and talk to engineers about why a call has echo. By the end you will understand how WebRTC cancels echo, the handful of switches that control it, why mobile is a different story, and the specific moment when a paid or open AI canceller earns its place. For the underlying theory of how any echo canceller works — adaptive filters, double-talk, the AI hybrid — read the companion article on echo cancellation fundamentals; this one is about making it work inside a real WebRTC call.
First, What Echo Is In A WebRTC Call
Before placing the canceller, anchor the problem it solves. The sound that loops back from a loudspeaker into a microphone during a two-way call is called acoustic echo. When the far-end person's voice plays out of your laptop speaker, it crosses the room, and your microphone catches it along with your own voice. Without cleanup, that captured copy gets sent back, and the far-end person hears themselves a fraction of a second late — which makes normal conversation almost impossible.
Two words will repeat through this article, so define them once. The "far-end" signal is the audio coming from the other person — what your device is about to play out of its speaker. The "near-end" signal is what your own microphone records — your voice plus, unfortunately, the echo of the far-end audio bouncing around your room. The whole job of echo cancellation is to remove the far-end echo from the near-end microphone signal while leaving your real voice untouched.
There is a second kind of echo worth naming so we can set it aside. Line echo, also called network echo, happens inside the telephone network on the electrical path, and it is the subject of the old telecom standard ITU-T G.168. That standard explicitly does not cover acoustic echo — the room-and-speaker kind — which is what every browser call deals with. In WebRTC, you care about acoustic echo, and the relevant terminal-quality standard is ITU-T P.340 for hands-free devices. Keep the two separate: the network kind is someone else's problem; the room kind is yours.
The One Signal That Makes Cancellation Possible
Here is the single most important idea in this whole topic, and the one most echo bugs come back to. To remove the echo, the canceller must be given a clean copy of the far-end audio — the exact sound that went to the speaker — as a separate input. This copy is called the reference signal, or in WebRTC's own code, the "render" signal. Think of it like a noise-cancelling headphone that is allowed to listen to the music it is trying to cancel: knowing the original is what makes subtraction possible.
The canceller works by predicting the echo from that reference and subtracting it. It builds a model of how your room changes the far-end sound on the way from speaker to microphone — the delay, the bounce off walls, the muffling — and uses that model to guess what the echo in your microphone will look like. Then it subtracts the guess from the microphone signal. Whatever is left should be only your voice.
This is why one wiring mistake is fatal. The reference must be the far-end audio before it is mixed with anything else — before your own microphone, before any music or sound effect your app plays. If the cancelled output accidentally gets fed back in as the reference, the canceller starts chasing its own tail and the audio falls apart. In production, when a WebRTC call has stubborn echo, the cause is far more often a broken reference wire than a weak algorithm.
Figure 1. Where echo cancellation sits in a WebRTC call: the far-end audio is tapped as the reference signal feeding the canceller, while the microphone path runs through the Audio Processing Module before encoding.
Where The Canceller Lives: WebRTC's Audio Processing Module
WebRTC does not treat echo cancellation as a standalone box. It lives inside a bundle of audio cleanup called the Audio Processing Module, usually shortened to APM. The APM is the part of WebRTC that takes raw microphone audio and makes it call-ready, and it runs three jobs in a fixed order: first echo cancellation, then noise suppression, then automatic gain control, which evens out how loud you sound.
The order is deliberate, and it matters for a reason worth understanding. Echo cancellation goes first because it needs the microphone signal in its most original state to line it up against the reference. If noise suppression ran first and altered the sound, the canceller's room model would be working against a moving target. So echo cancellation reads the rawest audio, removes the echo, and only then does the noise suppressor clean up hiss and the gain control set the level. This is also why the rule from our WebRTC noise suppression article matters here: anything you insert into the audio path has to respect this ordering, or you fight the built-in chain.
Because echo cancellation is the APM's first and most fragile stage, it is also the one most affected by where your audio comes from and what else touches it. The browser hands the APM the microphone capture and the playback reference, and the APM hands back clean audio for the Opus encoder — Opus being the standard voice codec WebRTC uses. The IETF specification that governs WebRTC audio, RFC 7874, recommends that implementations apply acoustic echo cancellation precisely because a browser cannot assume the user is wearing a headset. That recommendation is why the canceller is on by default, which we will see next.
AEC3: The Canceller Inside Every Browser
The specific echo canceller WebRTC ships today is called AEC3 — the "3" marks it as the third design, which replaced an older desktop canceller and a separate lightweight mobile one called AECM around 2017–2018. It runs in Chrome, Edge, and any application built on the WebRTC library, which means it has been tested across an enormous number of real calls. You almost never write AEC3 yourself; you turn it on and feed it correctly.
AEC3 cleans audio in five stages, and knowing them turns vague echo complaints into specific diagnoses. The first stage is delay estimation: before it can subtract anything, AEC3 has to figure out the time gap between the reference audio and the echo arriving at the microphone. That gap comes from playback buffers, the speaker, the air in the room, and the microphone — on a phone it can range from twenty to two hundred milliseconds and even drift during the call. AEC3 estimates it continuously by matching the two signals against each other, and if it locks onto the wrong gap, nothing downstream can work.
The second stage is the linear adaptive filter, which does the heavy lifting. "Adaptive" means it keeps adjusting a model of the room as the room changes; "linear" means it handles the well-behaved part of the echo. AEC3 runs this filter in the frequency domain in small blocks — sixty-four audio samples at a time, about four milliseconds each — which is both faster to compute and quicker to react than processing one sample at a time. A good linear filter removes roughly twenty to forty decibels of echo, a figure measured by a metric we will define shortly.
The third stage is double-talk detection, which guards against the hardest case in all of echo cancellation: both people talking at once. When only the far-end person speaks, the filter can safely learn from the microphone signal. But when you speak at the same time, your voice contaminates what the filter is trying to learn, and if it adapts to your voice it breaks. AEC3 watches for double-talk and freezes the filter's learning during it, then resumes once you stop. The fourth stage, residual echo suppression, is a safety net that turns down whatever faint echo the linear filter could not model — the distortion a cheap speaker adds at high volume, for instance. The fifth stage, comfort noise, fills the tiny silences that aggressive suppression creates with a soft synthetic hiss, so the call does not sound like it keeps cutting out.
Figure 2. The five stages of WebRTC AEC3: delay estimation, the linear adaptive filter, double-talk detection, residual echo suppression, and comfort noise generation.
A Number You Will Actually Use: ERLE
Engineers measure an echo canceller with a number called ERLE, which stands for Echo Return Loss Enhancement. In plain terms, it is how many times quieter the echo is after cancellation than before, written in decibels. Decibels compress big ratios into small numbers, so a little arithmetic makes it concrete.
Suppose the echo arriving at the microphone carries a certain amount of power, and after AEC3 runs, the leftover echo carries one ten-thousandth of that power. The ratio is 10,000 to 1. ERLE turns that ratio into decibels with a standard formula:
ERLE = 10 × log10(echo power before ÷ echo power after)
ERLE = 10 × log10(10,000)
ERLE = 10 × 4 = 40 dB
So a canceller that removes 10,000-to-1 of the echo scores 40 dB of ERLE — the top of the range a good linear filter reaches. The practical value of this number is debugging. If you log AEC3's ERLE and it sits below 10 dB, the filter is not converging — usually because the delay estimate is wrong or the reference wire is broken. A healthy call shows ERLE climbing to the twenties or thirties within a second or two of the call starting. When someone says "the AI just isn't good enough", an ERLE reading often shows the real problem is timing, not intelligence.
The Browser Switches You Control
For a web app, almost all of this is governed by one line. When you ask the browser for microphone access through getUserMedia — the standard web call that grants a page the microphone — you pass a constraint called echoCancellation. The W3C Media Capture and Streams specification defines it, and it is set to true by default in Chrome, Firefox, and Safari. Leaving it on gives you AEC3 (or the system's own canceller) with no extra work.
// Ask for the microphone with echo cancellation on (this is the default).
const stream = await navigator.mediaDevices.getUserMedia({
audio: { echoCancellation: true, noiseSuppression: true }
});
There is a second, lesser-known switch that decides which canceller runs. Chrome added a constraint called echoCancellationType that can be set to browser or system. "Browser" means AEC3, the software canceller inside Chrome. "System" means hand the job to the operating system's own canceller — the Voice Capture DSP on Windows, for example, which sits physically closer to the audio hardware. By default Chrome uses the system or hardware canceller when a good one exists and falls back to AEC3 otherwise, so you rarely set this by hand. It matters mainly when you are debugging a device whose hardware canceller behaves badly and you want to force the predictable software path.
The one switch you must respect, rather than set, is consistency. If you turn echoCancellation on in the browser and also run a second canceller of your own in the audio path, you double-process the sound and usually make it worse — the same stacking trap that ruins noise suppression. The rule is one canceller in the chain, chosen deliberately. We will return to this when we discuss adding an AI canceller.
Mobile Is A Different Story: The Phone Does The Work
A fact that surprises many teams: on a phone, the echo canceller you actually use is usually not AEC3 but the phone's own hardware. Both Apple and Google build acoustic echo cancellation into the operating system, tuned for that exact device's speaker and microphone, and a tuned hardware canceller generally beats a generic software one because it knows the hardware it is correcting.
On iPhone and iPad, the relevant piece is a system audio component Apple calls the Voice-Processing I/O unit. When an app sets its audio session to the "voice chat" mode, iOS switches on this unit, which performs echo cancellation, noise suppression, and gain control together, tuned per device. The cost is a constraint: that mode assumes a phone-call-style setup, so an app that needs unusual audio routing — music mixed with voice, for instance — may find the system canceller fights it, and has to choose between the tuned hardware path and full control.
On Android the picture is the same shape but messier. Android exposes a class called AcousticEchoCanceler that switches on the device maker's built-in canceller — but the quality of that canceller varies enormously across the thousands of Android models in the wild. Some flagship phones cancel echo beautifully; some budget phones barely do. Because of this inconsistency, many serious voice apps ignore the device canceller on Android and run AEC3 in software instead, accepting a little more battery use in exchange for the same behaviour on every phone. The deciding factor is audio latency: a flagship phone might play and re-capture sound in ten milliseconds, a cheap one in over a hundred, and AEC3's delay estimator has to absorb that swing.
When To Reach Past The Built-In Canceller
AEC3 and the phone's hardware canceller handle the great majority of calls, so the honest default is to ship them and move on. There is a real set of cases, though, where they leave audible echo, and those cases share a cause: non-linear distortion. The classical filter assumes the echo is a tidy, predictable transformation of the reference. Cheap or loud speakers break that assumption — they distort the sound in ways no linear model can subtract — and so do hard double-talk and very reverberant rooms. This is exactly where a neural canceller earns its place.
A neural, or AI, echo canceller replaces or augments the classical filter with a small trained network that has learned, from thousands of hours of real echoey audio, to recognise and remove the messy echo the math leaves behind. The research community has pushed this hard through an annual contest, the ICASSP Acoustic Echo Cancellation Challenge, which ran in 2021, 2022, and 2023 and set the modern benchmarks. One standout result, Microsoft's DeepVQE, is a single real-time network that does echo cancellation, noise suppression, and de-reverberation at once, and was reported tested for Microsoft Teams — a sign that neural cancellers have moved from papers into shipping products. Newer work pushes toward tiny, efficient models that can run on a phone without draining it.
In practice you do not train one of these yourself; you buy or borrow it. Krisp, the most widely deployed voice-AI vendor, ships echo cancellation alongside its noise removal in an SDK that plugs into a WebRTC pipeline through WebAssembly and runs on the user's own device. NVIDIA's Maxine — renamed AI for Media in 2026 — includes an acoustic echo cancellation effect that runs on NVIDIA graphics cards, with a Windows build aimed at the user's machine and a Linux build aimed at your servers, which is how a cloud meeting product like Tencent Meeting uses it. For the full comparison of these vendors and the build-versus-buy decision, see our Krisp, Maxine, and Dolby article; for the open-model end of the spectrum, the noise suppression model deep-dive covers the same families that increasingly bundle echo handling.
Where An AI Canceller Attaches In WebRTC — And Why It Is Awkward
Deciding to use a neural canceller is the easy half; wiring it into WebRTC is the hard half, and it is worth being honest about why. The browser's AEC3 is buried inside the APM, before your code ever sees the audio, and you cannot simply swap a different model into that slot from JavaScript. So a custom canceller has to attach somewhere you can reach, and there are only two such places.
The first is on the user's device, in front of the browser's own processing. Using the Web Audio API's AudioWorklet — a standard mechanism for running your own audio code off the main thread — a vendor SDK like Krisp's processes the raw microphone audio and the playback reference itself, then you turn the browser's own echoCancellation off so the two do not stack. This is the on-device path, and its catch is that you must now supply the reference signal correctly yourself, the same wiring discipline AEC3 normally handles for you.
The second place is on the server. In a group call, audio flows through a media server called an SFU, which forwards each person's stream to the others — the same server-side layer where the broader WebRTC + AI integration APIs live. A neural canceller running there — NVIDIA Maxine's Linux build, or a managed model offered by a platform like LiveKit — cleans the audio centrally, which is the natural home when callers dial in from phone lines or devices you do not control and cannot run code on. The trade is cost and latency: server-side cleaning runs on hardware you pay for and adds a hop, both of which must fit the sub-100-millisecond latency budget a live conversation demands. Whichever place you pick, the iron rule holds: exactly one canceller in the chain. Turn the others off.
A Common Mistake: Two Cancellers Fighting
The most frequent self-inflicted echo bug in WebRTC is not too little cancellation but too much — two cancellers running on the same audio. It happens naturally because each layer defaults to "on". The browser turns on AEC3. The phone turns on its hardware canceller. A developer adds Krisp for good measure. Now three models are each trying to model and subtract the echo, and because each one alters the audio the next one sees, they confuse each other.
The symptom is distinctive and easy to misread. Instead of clean two-way audio, the call goes half-duplex — it behaves like a walkie-talkie, cutting off your voice whenever the other person speaks, because the stacked suppressors are collectively too aggressive and treat your speech as more echo to remove. Teams often respond by adding yet another layer to "fix" it, which makes it worse. The fix is subtraction: turn off every canceller but one. If you adopt a vendor SDK, set the browser's echoCancellation to false. If you trust the phone's hardware, do not also run AEC3. One canceller, chosen on purpose, beats three fighting by default.
Figure 3. The double-processing trap: every layer defaults to on, so the safe pattern is to run exactly one canceller and disable the rest.
The Options Side By Side
With each option placed, the comparison reads as a set of trade-offs rather than a ranking. The right choice depends on where your audio comes from and how much control you need, not on which one is "best".
| Option | Where it runs | What it costs | Best fit | Watch out for |
|---|---|---|---|---|
Browser AEC3 (echoCancellation: true) |
In the browser, on the user's device | Free, built in | The default for any browser call | Misaligned reference; non-linear echo |
System / hardware AEC (echoCancellationType: 'system') |
OS / device hardware | Free, built in | Desktops with good hardware AEC | Quality varies by machine |
| iOS Voice-Processing I/O | iPhone / iPad hardware | Free, built in | Native iOS voice apps | "Voice chat" mode limits routing |
Android AcousticEchoCanceler |
Device hardware | Free, built in | Native apps on good devices | Wildly inconsistent across models |
| AEC3 in software on mobile | Your app, on device | Engineering + battery | Consistency across Android | More CPU than hardware path |
| Krisp / on-device neural SDK | User's device (WASM) | Annual SDK licence | Cross-platform, non-linear echo, privacy | Must wire the reference yourself |
| NVIDIA Maxine AEC | Your NVIDIA GPUs (server or client) | GPU + licence | GPU-served cloud meetings, broadcast | Cost scales with concurrent streams |
Read the table by your constraints. If you are in a browser and the calls are ordinary, the first row is your answer and you are done. If you are on iOS, the system already does the right thing. If you are on Android and quality is uneven, software AEC3 buys consistency. Only when classical cancellation leaves audible echo — loud or cheap speakers, heavy double-talk, reverberant rooms — does a neural canceller justify its licence and wiring.
Figure 4. A decision path: start with the free built-in canceller, treat stubborn echo as a wiring problem, and reach for a neural model only for non-linear echo.
Where Fora Soft Fits In
We build the live-video products that depend on clean two-way audio — video conferencing platforms, telemedicine consultations where a doctor must catch every word, e-learning classrooms, and AI voice agents — and echo is one of the first things we tune on any of them. The pattern we apply is the one this article argues for, settled in order: start with the browser's free AEC3, verify it with a real measurement rather than a single test call, and treat stubborn echo as a wiring-and-timing investigation before reaching for a different model. On mobile we decide per platform — trusting iOS's tuned hardware, and choosing between Android's device canceller and software AEC3 based on the devices a product actually targets. We move to a neural canceller only when non-linear echo survives the classical one, and when we do, we hold the line on a single canceller in the chain so a well-meaning second layer never turns a clean call into a walkie-talkie.
What To Read Next
- Echo cancellation — classical AEC + AI hybrid
- Real-time noise suppression in WebRTC — RNNoise, DeepFilterNet, and Krisp integration
- Krisp, NVIDIA Maxine, and Dolby — build vs buy for real-time voice enhancement
Talk To Us / See Our Work / Download
- Talk to a video engineer about diagnosing or integrating echo cancellation in your WebRTC product → /services/webrtc-development
- See our case studies in video conferencing, telemedicine, and e-learning → /cases
- Download the WebRTC Echo Cancellation Integration Cheat Sheet (one page, printable) → Download the cheat sheet
References
- Switchboard (Synervoz). Acoustic Echo Cancellation: How WebRTC AEC3 Works, accessed 2026-06-02.
https://switchboard.audio/hub/how-webrtc-aec3-works/. Engineering deep-dive used for the AEC3 pipeline (delay estimation, partitioned-block frequency-domain adaptive filter, double-talk detection, residual echo suppression, comfort noise), the 64-sample/4 ms block size, the 20–40 dB linear-filter ERLE range, the 1–2 s convergence time, the 20–200 ms mobile delay range, the AEC3 → replaced AEC/AECM (~2017–2018) lineage, iOS Voice-Processing / AVAudioSession.voiceChat, AndroidAcousticEchoCancelervariability, and the reference-signal wiring and double-talk failure modes. - WebRTC Project (Google). WebRTC AEC3 source —
modules/audio_processing/aec3/, accessed 2026-06-02.https://webrtc.googlesource.com/src/+/refs/heads/main/modules/audio_processing/aec3/. First-party source for AEC3's location in the codebase,echo_canceller3.ccas the orchestrating entry point, and the C++ with SSE2/NEON SIMD implementation. Used to ground the claim that AEC3 is the production canceller inside the WebRTC library. - W3C. Media Capture and Streams (Recommendation), accessed 2026-06-02.
https://www.w3.org/TR/mediacapture-streams/. Primary standards source forgetUserMediaand theechoCancellationconstrainable property — the default-on switch that enables WebRTC echo cancellation and the switch you must disable to avoid double-processing with an added canceller. - Chrome for Developers. More native echo cancellation, 2018-06-19, accessed 2026-06-02.
https://developer.chrome.com/blog/more-native-echo-cancellation. First-party source for theechoCancellationTypeconstraint (browservssystem), Chrome's default selection of hardware/system AEC when available with fallback to the software canceller, and the Windows Voice Capture DSP / macOS native-AEC behaviour. - IETF. RFC 7874 — WebRTC Audio Codec and Processing Requirements, May 2016, accessed 2026-06-02.
https://www.rfc-editor.org/rfc/rfc7874. Primary standards source for the mandatory WebRTC audio codecs (Opus, G.711) and for the recommendation that WebRTC implementations apply acoustic echo cancellation as part of audio processing — the basis for AEC being on by default and running before the encoder. - ITU-T. Recommendation G.168 — Digital network echo cancellers (2015, Cor.1 12/2022), accessed 2026-06-02.
https://www.itu.int/rec/T-REC-G.168. Primary standards source establishing the distinction between line/network echo cancellation (G.168's scope) and acoustic echo cancellation, which G.168 explicitly does not cover — used to separate the telecom-network echo problem from the room-acoustic echo problem WebRTC solves. - ITU-T. Recommendation P.340 — Transmission characteristics and speech quality parameters of hands-free terminals, accessed 2026-06-02.
https://www.itu.int/rec/T-REC-P.340. Primary standards source for acoustic-echo and speech-quality requirements of hands-free terminals — the standards-body reference point for the acoustic (room-and-speaker) echo that WebRTC AEC addresses, as distinct from G.168's network echo. - ITU-T. Recommendation G.131 — Talker echo and its control, accessed 2026-06-02.
https://www.itu.int/rec/T-REC-G.131. Primary standards source for why echo becomes intolerable as round-trip delay grows — the human-perception basis for treating echo as a defect that must be cancelled in real-time calls. - ITU-T. Recommendation G.114 — One-way transmission time, accessed 2026-06-02.
https://www.itu.int/rec/T-REC-G.114. Primary standards source for the ≤150 ms one-way mouth-to-ear delay target that any added (e.g. server-side neural) canceller's latency must fit inside. - Ristea, Saabas, Cutler, et al. (Microsoft). DeepVQE: Real-Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation, Interspeech 2023, accessed 2026-06-02.
https://www.isca-archive.org/interspeech_2023/ristea23_interspeech.pdf. Peer-reviewed source for the joint neural AEC+NS+dereverberation model, its state-of-the-art results on the ICASSP 2023 AEC Challenge test sets, real-time operation, and reported testing for Microsoft Teams. - Cutler, et al. (Microsoft). ICASSP 2023 Acoustic Echo Cancellation Challenge, IEEE ICASSP 2023, accessed 2026-06-02.
https://www.researchgate.net/publication/366205532_ICASSP_2023_ACOUSTIC_ECHO_CANCELLATION_CHALLENGE. Source for the ICASSP AEC Challenge series (2021–2023) that established modern neural-AEC benchmarks and datasets. - NVIDIA. About the Acoustic Echo Cancellation Effect — NVIDIA Audio Effects (AFX) SDK User Guide, accessed 2026-06-02.
https://docs.nvidia.com/maxine/afx/latest/AboutTheEffects/AboutAcousticEchoCancellation.html. First-party source for Maxine / AI for Media's acoustic echo cancellation effect, the Windows (client-side) vs Linux (server-side/datacenter) SDK split, and the GPU-based deployment model. - Krisp. Web Browser SDK — Developer Hub, accessed 2026-06-02.
https://sdk-docs.krisp.ai/docs/introduction. First-party source for Krisp's JavaScript SDK integrating with WebRTC and Web Audio as an audio filter via WebAssembly, on the user's device, offering noise and echo cancellation across Windows, macOS, Linux, Web, iOS, and Android. - LiveKit. Noise & echo cancellation (documentation), accessed 2026-06-02.
https://docs.livekit.io/transport/media/noise-cancellation/. Vendor source for server-side managed neural models (Krisp, ai-coustics) on a real-time platform and the explicit rule to run at most one client-side and one server-side processor to avoid over-suppression. - Apple. What's new in voice processing — WWDC23, accessed 2026-06-02.
https://developer.apple.com/videos/play/wwdc2023/10235/. First-party source for the iOS/macOS Voice-Processing I/O audio unit, its built-in echo cancellation tuned per device, and the AVAudioSession.voiceChatmode that enables it. - Android Developers. AcousticEchoCanceler — API reference, accessed 2026-06-02.
https://developer.android.com/reference/android/media/audiofx/AcousticEchoCanceler. First-party source for Android's platform AEC effect, its availability being device-dependent (supported only if the device implements AEC), and the basis for the cross-device-variability point.


