Why this matters

If you build a telemedicine platform, a video conferencing tool, a contact centre, an online classroom, or any product where two people talk over the internet, echo is the defect your users notice first and forgive least. A frozen video frame is annoying; hearing your own voice bounce back half a second later makes a conversation physically impossible. This article is for the product manager, founder, or operations lead who needs to understand echo cancellation well enough to read a bug report, ask an engineer the right question, and set realistic expectations with a customer. A senior engineer will also find every claim traced to the relevant ITU-T Recommendation or the WebRTC source. By the end you will be able to explain to a colleague why echo happens, how the canceller removes it, and why it sometimes fails for exactly one second after someone moves their laptop.


The problem: why a room creates an echo

Start with the physical situation, because the whole solution follows from it. You are on a call. The other person — the far end — speaks. Their voice arrives at your device as a signal, your device plays it through the speaker, and it fills the air in your room. That is normal and necessary; you need to hear them.

Here is the trouble. Your microphone is in the same room as your speaker. It cannot tell the difference between sound it should capture — your voice — and sound it should not — the far end's voice replaying from your speaker. So the microphone picks up both, mixes them into one signal, and sends the whole thing back. The far-end talker now hears a delayed copy of their own voice. That delayed copy is the acoustic echo.

The delay is what makes it unbearable. If the echo came back instantly, the brain would treat it like the natural reflection you hear in any room and ignore it. But the round trip over the internet adds time. By the time the far end hears their echo, it lags their original speech by a noticeable gap, and the brain can no longer fuse the two. The longer the gap, the worse it gets.

This is not a new observation. The telephone network has fought echo for a century, and the international standard that quantifies it, ITU-T Recommendation G.131 (Talker echo and its control, 11/2003), makes the dependence on delay explicit: the amount of echo a listener will tolerate drops as the one-way delay grows. G.131 expresses the tolerable echo as a "talker echo loudness rating" plotted against the mean one-way transmission time, and the practical reading is simple — past roughly 25 to 30 milliseconds of one-way delay, you cannot get away with attenuating the echo by hand; you need a dedicated device to remove it. That device is the echo canceller.

Two kinds of echo exist, and people confuse them. Line echo (or network echo) comes from electrical reflections inside old telephone hardware, at the point where a four-wire circuit meets a two-wire one. Acoustic echo comes from the speaker-to-microphone path in a room. This article is about acoustic echo, the kind that dominates modern voice and video apps. The standard for the line variety is ITU-T G.168; the standard for the acoustic variety was ITU-T G.167, since folded into the hands-free terminal standard ITU-T P.340.

The core idea: cancel by subtraction

The clean insight behind every echo canceller is that your device already knows the one thing it needs. It knows exactly what it is about to play through the speaker, because it generated that audio in the first place. That known signal — the far end's voice on its way to your speaker — is called the reference signal, or sometimes the far-end signal.

An echo canceller is, in one sentence, a noise-cancelling headphone for the room. A noise-cancelling headphone listens to the outside world and plays the opposite of it so the two cancel. An echo canceller listens to the reference — the sound it is feeding the speaker — predicts what version of that sound will return into the microphone after bouncing around the room, and subtracts the prediction from the microphone signal. If the prediction is good, what remains is only your voice.

So the canceller sits at the exact point where two signals meet:

  • The microphone signal: your voice, plus the echo of the far end, plus room noise.
  • The reference signal: the far end's voice, exactly as sent to the speaker.

The canceller's job is to figure out how the reference turns into the echo, build a copy of that echo, and subtract it. The leftover after subtraction is called the error signal — and confusingly, that error signal is the good output. It is what gets sent to the far end. "Error" here means "what is left after we removed the part we could predict," and ideally the only thing we could not predict is your actual voice.

Diagram of the acoustic echo problem and the cancel-by-subtraction solution. The far-end voice arrives, is played through the speaker into the room, bounces off walls, and re-enters the microphone as echo. The reference signal is tapped before the speaker and fed to an adaptive filter, which builds an estimated echo that is subtracted from the microphone signal at a summing node, leaving the near-end voice as the error output sent back to the far end. Figure 1. The echo loop and the cancel-by-subtraction fix. The reference signal (what we are about to play) is tapped and used to predict the echo, which is then subtracted from the microphone. What survives the subtraction — the near-end voice — is the only thing sent back.

The hard part: nobody hands you the room

If the canceller knew exactly how the room transforms the reference into the echo, the problem would be trivial: predict, subtract, done. But it does not know. Every room is different, every laptop is different, and the path changes the moment anything moves. So the canceller has to learn the room while the call is happening, and keep re-learning it as conditions change.

The thing it learns is called the echo path — the full journey from the reference signal, through the device's playback buffer and digital-to-analog converter, out the speaker, across the air, off the walls and the desk and the coffee mug, and back into the microphone and its analog-to-digital converter. Engineers describe this path as an impulse response: if you played one infinitely short click through the speaker, the impulse response is the exact pattern of that click and all its reflections arriving back at the microphone over the next tens of milliseconds.

A typical room's echo path is long. Sound travels about 343 metres per second, so a reflection off a wall three metres away and back arrives roughly 17 milliseconds later. At a 48,000-samples-per-second rate — the rate WebRTC and most voice systems use — 17 milliseconds is about 800 samples. Real rooms produce reflections out to 100, 200, even 400 milliseconds in a reverberant space. To model that, the canceller needs a filter with hundreds or thousands of "taps," one weight per sample of delay it wants to account for.

Let us walk the arithmetic once, out loud, because it explains why echo cancellation is computationally expensive:

filter length (in samples) = echo path duration (s) × sample rate (samples/s)
                            = 0.128 s × 48,000 samples/s
                            = 6,144 samples (taps)

A 128-millisecond echo path at 48 kHz needs roughly 6,000 filter taps, and the canceller must update those thousands of numbers continuously, for every block of audio, in real time. That is the engineering challenge underneath the simple "predict and subtract" picture.

The adaptive filter: guess, check, correct

The filter that models the echo path is adaptive, meaning it adjusts its own weights automatically. It does not need to know the room in advance. It starts with a blank guess, plays the prediction game, looks at how wrong it was, nudges its weights to be a little less wrong, and repeats — thousands of times per second. This guess-check-correct loop is the beating heart of every echo canceller.

The standard recipe for the correction step is an algorithm called the Normalized Least Mean Squares method, abbreviated NLMS. The name describes what it does. "Least mean squares" means it tries to make the average squared error as small as possible — it treats a big mistake as much worse than a small one, so it works hardest on the loudest leftover echo. "Normalized" means it scales each correction by how loud the reference currently is, so a shout and a whisper both produce sensible-sized adjustments instead of the shout slamming the filter around. NLMS is the workhorse because it is stable, cheap enough to run in real time, and good enough for speech. Survey literature on echo cancellation consistently names NLMS the primary adaptive algorithm in the field.

The speed at which the filter learns is set by a single dial called the step size. A big step size means the filter learns fast but jitters around the right answer; a small step size means it learns slowly but settles precisely. This is the central trade-off of adaptation, and it is why modern cancellers use a variable step size: large when the room has just changed and the filter is far off, small once it has converged and only needs fine adjustments. The moment from "filter is blank" to "filter matches the room" is called convergence, and a good canceller converges within a fraction of a second.

There is one more refinement that matters for understanding real systems. Doing the math sample by sample in the time domain is slow. Real cancellers transform blocks of audio into the frequency domain — using the Fast Fourier Transform, which splits a sound into its component pitches — because filtering is far cheaper there. To keep latency low while still modelling a long echo path, they chop the long filter into several shorter pieces, each handling a different slice of delay, and process them as overlapping blocks. This structure is the partitioned block frequency-domain adaptive filter, and the foundational design is the multidelay block frequency-domain adaptive filter described by Soo and Pang (IEEE Transactions on Acoustics, Speech, and Signal Processing, 1990). You do not need the math; you need to know that "frequency-domain block processing" is why a canceller can model a 200-millisecond room without adding 200 milliseconds of delay to your call.

Diagram of the adaptive filter convergence loop shown as a four-state cycle: predict the echo from the reference, subtract it from the microphone to get the error, measure how large the error is, and update the filter weights via NLMS, then repeat. A side panel shows echo return loss enhancement rising from zero toward 35 to 45 decibels over the first few hundred milliseconds as the filter converges, with a dip and recovery marked when the room changes. Figure 2. The guess-check-correct loop. The filter predicts, subtracts, measures the leftover error, and updates its weights — repeating thousands of times a second. The curve on the right shows the echo being removed (rising ERLE) as the filter converges, with the dip when the room changes.

Measuring success: ERLE

How do you know the canceller is working? You measure how much quieter the echo got. The standard metric is Echo Return Loss Enhancement, abbreviated ERLE, and it is simply the reduction in echo level, in decibels, that the canceller adds on top of whatever natural attenuation the room already provides.

A worked example makes it concrete. Suppose the far end's voice arrives at your microphone, as echo, at a level we will call 0 decibels for reference. The canceller subtracts its prediction, and the leftover echo measures −40 decibels — that is, ten thousand times weaker in power. The ERLE is the difference:

ERLE = echo level before cancellation − echo level after cancellation
     = 0 dB − (−40 dB)
     = 40 dB

A well-tuned modern acoustic echo canceller achieves somewhere around 35 to 45 dB of ERLE in steady single-talk conditions. That is enough to push the residual echo below the level of normal room noise, where the brain stops noticing it. The international test methodology for echo canceller performance, ITU-T G.168, is built around metrics like this; it is written for line echo cancellers but its measurement philosophy — drive the canceller with defined signals and verify the echo is suppressed by a target margin — is the template the whole industry uses.

When subtraction is not enough: the residual suppressor

Here is an awkward truth: the adaptive filter alone almost never removes all the echo. Real speakers distort the sound a little — they add nonlinear effects that no linear filter can predict, because a linear filter can only model "the echo is some delayed, scaled copy of the reference." Cheap laptop speakers, phone speakers driven loud, and especially Bluetooth speakers introduce distortion the filter cannot match. Whatever survives the subtraction is called residual echo.

So every serious canceller has a second stage: a residual echo suppressor, sometimes built as a nonlinear processor (NLP). Where the adaptive filter subtracts, the suppressor attenuates — it turns down the volume in the moments and frequency bands where it believes residual echo is present and your voice is not. Think of the adaptive filter as the surgeon removing the bulk of the tumour and the suppressor as the follow-up treatment cleaning up the margins. ITU-T G.168 actually requires a nonlinear processing stage in a compliant echo canceller, precisely because subtraction alone leaves audible residue.

The suppressor is also where cancellers can do damage. If it attenuates too aggressively, it clips the front of your words or makes your voice sound like it is underwater. Tuning the suppressor is a balance between "let no echo through" and "let all of the near-end voice through," and those two goals pull in opposite directions.

The double-talk problem: the failure mode you will actually chase

Now the single most important thing to understand about echo cancellation, because it is the symptom engineers spend the most time on. Everything above works beautifully when only one person talks at a time. The hard case is double-talk: both people speaking at once.

Recall how the filter learns — it compares its prediction to the microphone signal and nudges its weights toward whatever reduces the error. During single-talk, the microphone contains only echo, so "reduce the error" means "model the echo better," which is exactly right. But during double-talk, the microphone contains echo plus your voice. If the filter keeps adapting, it will mistake your voice for prediction error and start contorting its weights to try to cancel you. The result is wrecked: the filter diverges, the echo leaks back, and the next time only the far end talks, the echo is suddenly loud again because the filter trained itself on garbage.

The fix is a double-talk detector, abbreviated DTD. Its one job is to notice when both people are talking and tell the filter to freeze — stop adapting, hold the weights steady, ride out the double-talk on the room model it already has, and resume learning only when the near-end goes quiet again. The canceller still subtracts during double-talk; it just stops updating its prediction.

Detecting double-talk is harder than it sounds, and the history of the field is largely a history of better detectors. The classic method is the Geigel algorithm, which simply compares the microphone level to the recent reference level: if the microphone is suddenly louder than the echo could explain, someone near must be talking. It is cheap but blunt — research shows it misses double-talk when the near-end voice is not much louder than the echo, and it is fooled by background noise. Modern detectors use normalized cross-correlation between the reference and the error signal, which is far less dependent on the exact echo path strength and catches double-talk the Geigel method misses.

The pitfall, stated plainly. When a customer says "it cuts off the beginning of my words whenever the other person is also talking," that is a double-talk-detection problem, not a network problem. The detector either froze too late (echo leaked) or the suppressor clamped down on the near-end voice (your words got chopped). It is the number-one echo complaint in production, and it is why you should test every voice product with both people deliberately talking over each other, not just polite turn-taking.

Diagram contrasting three speaking situations side by side. In single-talk far-end, only the far end speaks, the microphone holds only echo, and the filter adapts and the suppressor is active. In single-talk near-end, only the near end speaks, the microphone holds only voice, the filter holds and the suppressor passes audio. In double-talk, both speak, the microphone holds echo plus voice, the double-talk detector fires, the filter freezes, and the suppressor must pass the near-end voice without chopping it. Figure 3. The three situations a canceller must handle. The filter only adapts safely during far-end single-talk. During double-talk the detector must freeze the filter, or it will train itself on your voice and ruin the room model.

How WebRTC does it: AEC3 in plain English

Most browser-based and many native voice apps run the same open-source code, libwebrtc, and its echo canceller is called AEC3 — the third generation. If you have joined a video call in Chrome, Edge, or a mobile app built on WebRTC, AEC3 cleaned up your microphone. It is worth seeing how the pieces above map onto a real, shipping system, because the names recur in every engineer's bug report.

AEC3 lives inside the Audio Processing Module (APM), the capture-side clean-up stage that also holds noise suppression and gain control. We walk the whole pipeline in The WebRTC Audio Pipeline End-to-End; here we zoom into the echo box. The WebRTC source describes AEC3's top-level class as doing four things: it receives 10-millisecond frames of band-split audio, optionally applies an anti-hum high-pass filter, runs the lower-level echo cancellation on blocks of 64 samples, and partially handles the timing jitter between when audio is played and when it is captured. That last point matters more than it looks.

Delay estimation comes first. Before the filter can subtract anything, AEC3 has to line up the reference with the echo in time — it has to know how late the echo arrives relative to when the reference was played. That delay is not fixed: it includes the playback buffer, the digital-to-analog converter, the speaker-to-microphone air path, the analog-to-digital converter, and the capture buffer, and on phones it can range from about 20 to 200 milliseconds. AEC3 estimates this delay continuously by cross-correlating the reference and capture signals — sliding one against the other to find the lag where they line up best — and feeds that alignment to the filter. Get the delay wrong and the filter is trying to cancel echo against the wrong slice of reference; nothing works. This is why a sudden audio-route change (plugging in headphones mid-call) can cause a half-second of echo while the delay estimator re-locks.

Then the linear adaptive filter models the echo path and subtracts its estimate, exactly as described above, using frequency-domain block processing for speed.

Then a residual echo suppressor attenuates whatever the filter could not remove — the nonlinear distortion from cheap or loud speakers — using a model of how confident it is that residual echo is present in each frequency band at each moment.

Throughout, a double-talk-aware control logic decides when it is safe to adapt the filter and how hard the suppressor should clamp, so that near-end speech survives.

AEC3 also exposes a hook called the echo leakage status: if some other detector notices echo getting through, it can flag the canceller to react. The design is modular on purpose — delay control, filter, and suppressor are separate pieces that can be tuned independently, which is exactly what you end up doing on a hard device.

A comparison you can use: what changes the difficulty

Not all echo problems are equal. The table below summarises why some setups are easy and some are nightmares — useful when you are scoping which devices your product must support.

Situation Echo path Difficulty for the canceller What you'll observe
Headset or earbuds (wired) Speaker sealed at the ear; almost no path to the mic Trivial — there is barely any echo to cancel Clean audio; AEC has little to do
Laptop, moderate volume Short, stable air path; some linear distortion Easy — filter converges fast and stays converged Good after a fraction of a second
Laptop, high volume Louder echo, more speaker distortion Moderate — more residual for the suppressor Occasional leak on loud passages
Open-room speakerphone Long, reverberant path; many reflections Hard — long filter, slow convergence Echo on movement; double-talk clipping
Bluetooth speaker or headset Long, variable loop delay; codec distortion Hardest — delay estimate keeps moving Intermittent echo; detector struggles

The pattern is clear: the canceller's life gets harder as the echo path gets longer, louder, more distorted, and more variable in delay. Bluetooth is the worst on every axis at once, which is why the practical advice in real products is blunt — encourage headphones, but design for the day the user forgets them. We go deep on the Bluetooth and AirPods case in Echo Cancellation on Speakerphones, Bluetooth, and AirPods.

Where AEC sits relative to its neighbours

Echo cancellation does not work alone. In the WebRTC capture chain it runs in a fixed order with two siblings, and the order is not arbitrary. The high-pass filter sweeps out low-frequency rumble first so the canceller reasons about a cleaner signal. Then AEC removes the echo. Then noise suppression lifts your voice out of background sound — and it must run after AEC, because trying to suppress noise while echo is still present confuses the noise estimate. Finally automatic gain control brings your voice to a consistent level.

Each sibling has its own deep dive: Automatic Gain Control (AGC): Keeping Voice Level Consistent and Noise Suppression: Classical NS, RNNoise, Krisp, NVIDIA RTX Voice. The codec that carries the cleaned-up voice afterward is covered in Opus: The Open Codec That Ate WebRTC. The reason the whole chain exists is to fight the three enemies of a live call — echo, noise, and an unreliable network — and AEC owns the first one. When echo and packet loss combine, the symptoms blur together; the troubleshooting order is covered in Diagnosing Audio Problems in Production: A Runbook.

Where Fora Soft fits in

We have built real-time audio into video conferencing, telemedicine, e-learning, and live-shopping products since 2005, and echo is the first thing every one of those products has to get right. In telemedicine especially, a clinician on a clinic speakerphone with the patient on a laptop is the exact double-talk-plus-long-path scenario that breaks naive setups, and getting it clean is the difference between a usable consultation and a frustrated callback. Our work is mostly in choosing the right stack, configuring the WebRTC Audio Processing Module sensibly per device class, testing deliberately under double-talk and Bluetooth conditions, and knowing when a problem is the canceller versus the network. We do not rewrite AEC3; we make it behave on the messy devices your users actually own.

What to read next

Call to action

References

  1. ITU-T Recommendation G.131, Talker echo and its control (11/2003). The standard relating tolerable echo to one-way delay; defines the talker echo loudness rating and the "acceptable" (R = 74) versus "limiting case" (R = 60) curves. https://www.itu.int/rec/T-REC-G.131
  2. ITU-T Recommendation G.168, Digital network echo cancellers (04/2000 and later revisions). The performance test methodology for echo cancellers; defines ERLE-style metrics and requires a nonlinear processing stage. https://www.itu.int/rec/T-REC-G.168
  3. ITU-T Recommendation G.167, Acoustic echo controllers (03/1993). The original acoustic-echo-control recommendation for hands-free terminals; later superseded by ITU-T P.340. https://www.itu.int/rec/T-REC-G.167-199303-W/en
  4. ITU-T Recommendation P.340, Transmission characteristics and speech quality parameters of hands-free terminals (03/2004). The successor framework for hands-free acoustic echo and full-duplex behaviour. https://www.itu.int/rec/T-REC-P.340
  5. ITU-T Recommendation G.114, One-way transmission time (05/2003). The companion delay standard; the basis for why echo tolerance is a function of latency. https://www.itu.int/rec/T-REC-G.114
  6. WebRTC source, modules/audio_processing/aec3/echo_canceller3.h. The canonical AEC3 top-level class: 10 ms band-split frames, optional anti-hum high-pass filter, 64-sample block processing, render/capture jitter handling, and the echo-leakage status hook. Accessed 2026-06-06. https://webrtc.googlesource.com/src/+/main/modules/audio_processing/aec3/echo_canceller3.h
  7. J. S. Soo and K. K. Pang, "Multidelay block frequency domain adaptive filter," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 38, no. 2, pp. 373–376, 1990. The foundational partitioned block frequency-domain (multidelay) adaptive-filter design used in commercial cancellers. https://ieeexplore.ieee.org/document/52157
  8. EURASIP Journal on Advances in Signal Processing, "An overview on optimized NLMS algorithms for acoustic echo cancellation" (2015). Survey establishing NLMS as the primary adaptive algorithm for AEC and reviewing variable-step-size variants. https://asp-eurasipjournals.springeropen.com/articles/10.1186/s13634-015-0283-1
  9. J. Benesty et al., work on the normalized-cross-correlation double-talk detector and the limitations of the Geigel detector (the Geigel DTD fails when the near-end signal is not sufficiently louder than the echo). Representative: "The fast normalized cross-correlation double-talk detector," Signal Processing (2006). https://www.sciencedirect.com/science/article/abs/pii/S0165168405003166
  10. W3C, Media Capture and Streams — the echoCancellation constraint on getUserMedia that switches AEC on in browsers. https://www.w3.org/TR/mediacapture-streams/