Published: 2026-06-06 · Reading time: 24 min read · Author: Nikolay Sapunov, CEO at Fora Soft
Why this matters
If you are building a telemedicine app, a video conferencing tool, a contact centre, a live-shopping stream, or an online classroom, the audio pipeline is the part users judge you on first. People forgive a frozen video frame; they hang up on echo, robotic dropouts, or a voice that fades in and out. This article is for the product manager, founder, or operations lead who needs to understand what happens to a voice between two phones well enough to make decisions, read an engineer's bug report, and know which knob to ask about when a customer says "the audio is broken." A senior engineer will also find every claim sourced to the WebRTC source documentation or the relevant RFC. By the end you will be able to draw the pipeline yourself and explain why each box exists.
What "the pipeline" actually means
Before the diagram, one idea has to be solid. A real-time audio system is not a single program; it is a relay race. A tiny slice of sound — usually 10 to 20 milliseconds of it, called a frame — is handed from one runner to the next, and the sound only arrives intact if every runner does their leg correctly and quickly. The word "pipeline" captures exactly this: sound goes in one end as a continuous physical wave, gets chopped into frames, each frame is passed down a line of processing stages, and a continuous wave comes out the other end into someone's ear.
The technology that makes this work in a browser, with no plugin and no install, is called WebRTC — Web Real-Time Communication — a set of standards and a shared codebase that every major browser ships. When you join a video call in Chrome, Safari, or Firefox, the audio half of that call runs through the pipeline described below. Native mobile apps usually run the same open-source code (libwebrtc) under the hood, so the pipeline is nearly identical whether the call is in a browser tab or a phone app.
Here is the whole relay on one line, which the rest of the article unpacks one runner at a time:
Microphone → capture → high-pass filter → echo cancellation → noise suppression → gain control → encode (Opus) → packetize (RTP) → network → jitter buffer (NetEQ) → decode → render → speaker.
Figure 1. The end-to-end WebRTC audio pipeline. Sound enters at the microphone on the left and leaves at the speaker on the right. The dashed loop is the echo reference — a copy of what the speaker is playing, fed back so the echo canceller knows what to remove.
The pipeline splits cleanly into three regions, and it helps to hold them apart in your head. The send side (capture through packetize) lives on the talker's device and is mostly about cleaning up the sound and squeezing it small. The network is the part nobody controls, where packets are delayed, reordered, and lost. The receive side (jitter buffer through render) lives on the listener's device and is mostly about hiding what the network did to the audio.
The send side: capture and clean-up
Capture — turning air into numbers
The microphone is a sensor that turns air pressure into a fluctuating voltage. The sound card's analog-to-digital converter then measures that voltage many thousands of times per second and writes down each measurement as a number. WebRTC works at 48,000 measurements per second — a 48 kHz sample rate — because that is the rate Opus, its codec, runs at natively. A sample rate is like the tick of a clock: 48,000 times every second, the system asks the microphone "what level are you at right now?" and records the answer. (If sample rate is new to you, the foundations piece covers it: Sample Rate: 44.1, 48, 96, 192 kHz and Why 48 kHz Won Video.)
In a browser, capture starts when your code calls getUserMedia, the function defined by the W3C Media Capture and Streams standard that asks the user for microphone permission and hands back a live audio track. That one call hides a lot: device selection, the operating system's audio path, and the constraints you can request (sample rate, channel count, and whether the browser's built-in clean-up is on).
// Ask for the mic with WebRTC's built-in clean-up enabled.
// echoCancellation, noiseSuppression, autoGainControl are the three
// capture-side processors described below; all default to true in browsers.
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true
}
});
Those three flags — echoCancellation, noiseSuppression, autoGainControl — switch on the three processors that do the heavy lifting next. They are on by default in every browser, which is why a plain video call usually sounds acceptable even on a laptop with the speakers turned up.
The Audio Processing Module: one box, several jobs
In libwebrtc, the capture-side clean-up lives in a component called the Audio Processing Module, or APM. Its job, in the project's own words, is to apply speech-enhancement effects to the microphone signal, and the canonical examples it lists are echo cancellation, noise suppression, and automatic gain control. The APM processes the captured audio in a fixed order, and the order matters: high-pass filter → acoustic echo cancellation (AEC3) → noise suppression (NS) → automatic gain control (AGC2). Each stage assumes the previous stage has already done its work, so reordering them would break the assumptions each one relies on.
High-pass filter — sweep out the rumble
The first stage is a high-pass filter, which removes very low-frequency energy — the desk thump, the air-conditioning hum, the handling rumble below the range of speech. Human speech lives roughly between 80 Hz and 8 kHz; energy below that band carries no useful voice and only makes the later stages work harder. A high-pass filter is like a sieve that lets the high frequencies through and holds the low rumble back. Removing it early gives the echo canceller and noise suppressor a cleaner signal to reason about.
Acoustic echo cancellation — the hardest job in the pipeline
Echo is the single most damaging audio defect in a call, and the reason is geometry. When you are on speakerphone, the sound coming out of your speaker — the far end's voice — is picked up again by your own microphone and sent back. The far-end talker then hears themselves, delayed by the round trip, which is maddening and makes conversation almost impossible.
Acoustic echo cancellation, abbreviated AEC, solves this with a clever trick. The device already knows exactly what it is about to play through the speaker, because it generated that signal. That known signal is called the reference. The echo canceller takes the reference, models how the room and the speaker-to-microphone path distort and delay it, and subtracts the predicted echo from the microphone signal. What remains, ideally, is only your voice. WebRTC's modern echo canceller is called AEC3, and it is the default in the pipeline. The dashed loop in Figure 1 is exactly this reference signal being routed from the playback path back to the canceller.
The math is an adaptive filter. The canceller does not know the room in advance, so it guesses, compares its prediction to reality, and adjusts thousands of times per second until the prediction matches. When you move the laptop, cup the mic, or someone walks between you and the speaker, the room changes and the filter has to re-converge — which is why echo sometimes leaks for a second after a disturbance.
Common pitfall: the double-talk problem. Echo cancellation is easy when only one person talks. It gets hard when both people talk at once — "double-talk" — because the microphone now carries your voice and the echo of theirs, and the canceller must remove one without damaging the other. A weak canceller will chop your words when the far end is also speaking. This is the number-one symptom engineers chase, and Bluetooth speakerphones make it worse because they add unpredictable delay to the loop. We cover this fully in Acoustic Echo Cancellation (AEC): How It Really Works.
Noise suppression — lift the voice out of the room
Noise suppression, abbreviated NS, runs after echo cancellation. Its job is to attenuate steady background sound — fans, traffic, keyboard hiss, café murmur — while leaving speech alone. Classical noise suppression estimates the noise during the gaps between words, then subtracts that estimate from the whole signal. Modern deep-learning suppressors (RNNoise, and commercial systems such as Krisp and NVIDIA's RTX Voice) instead learn what speech looks like and keep only that, which lets them remove non-steady noise like a dog bark or a slamming door. The trade-off is always the same: push suppression too hard and the voice itself starts to sound underwater. The full ladder of techniques is in Noise Suppression: Classical NS, RNNoise, Krisp, NVIDIA RTX Voice.
Automatic gain control — keep the level steady
The last APM stage is automatic gain control, abbreviated AGC, and WebRTC's current version is AGC2. People sit at different distances from their microphones and speak at different volumes; AGC's job is to bring every voice to a consistent target level so the listener never has to ride the volume knob. AGC2 combines several controllers — an input-volume controller that nudges the operating system's mic gain, an adaptive digital gain that follows the speech level, and a fixed limiter that catches sudden peaks. The failure mode here is the "AGC chase": if the control loop is too aggressive, it pumps the level up and down audibly, and a quiet room's background noise swells every time you pause. Automatic Gain Control (AGC): Keeping Voice Level Consistent covers the tuning.
The send side: encode and packetize
Encode — squeeze the frame small with Opus
After clean-up, the audio is still raw numbers — about 768 kilobits per second for a single 48 kHz, 16-bit channel. Sending that over a phone network would be wasteful and fragile, so a codec compresses each frame. WebRTC's default codec is Opus, defined in IETF RFC 6716, and it is the reason WebRTC audio sounds good at low bitrates. Opus works on frames as short as 2.5 ms and as long as 60 ms, supports bitrates from 6 kbit/s to 510 kbit/s, and runs at 48 kHz internally. In practice WebRTC packetizes Opus in 20 ms frames, which is the standard trade-off between latency (shorter is better) and efficiency (longer compresses better).
Opus is unusual because it contains two engines: a speech engine (SILK) and a music engine (CELT), with a switch that picks the right one or blends them. That is why the same codec handles a whispered telemed consult and background music on a live shopping stream. Two Opus features matter for the pipeline and reappear on the receive side:
- In-band forward error correction (FEC). Opus can embed a low-bitrate copy of the previous frame inside the current packet (these are called LBRR frames). If one packet is lost, the next packet carries a rough copy of what was missed, so the decoder can fill the gap. FEC only works in Opus's speech mode.
- Discontinuous transmission (DTX). When you stop talking, Opus can stop sending full packets and instead emit a tiny "silence descriptor" roughly every 400 ms. This cuts the bandwidth of a silent talker by around 85–90%, because in a two-person call each person is silent about half the time.
The deep dive on the codec itself is Opus: The Open Codec That Ate WebRTC.
Packetize — wrap the frame for the network with RTP
A compressed frame is not yet ready to travel. It needs an address label so the receiver can reassemble the stream in order and at the right time. That label is the Real-time Transport Protocol header, defined in IETF RFC 3550, and the wrapping step is packetization. The Opus-specific rules for how a frame maps into an RTP packet are in IETF RFC 7587. Three fields on every RTP header do the real work:
- A sequence number that increments by one per packet, so the receiver can detect loss and reordering.
- A timestamp that marks where the frame sits on the media clock, so the receiver knows how to space the audio out for playback. (Note: the RTP timestamp is not wall-clock time — aligning audio and video uses a separate report, covered in RTP Timestamps, RTCP Sender Reports, and NTP Synchronization.)
- A synchronization source identifier (SSRC) that names which stream the packet belongs to, so a client receiving several people's audio can keep them apart.
The packet is then encrypted — WebRTC mandates SRTP, secured by a DTLS handshake (IETF RFC 5764) — and pushed onto the network. From here the send side is done; the packet is in the wild.
The network: the part you do not control
Once a packet leaves the device, three things can happen to it, and all three are bad. It can be delayed more than the packet before it, so packets that left evenly spaced arrive in clumps — this is called jitter. It can be reordered, arriving after a packet that was sent later. And it can be lost entirely, dropped by a congested router or a flaky Wi-Fi link. A typical mobile connection loses a small percentage of packets and varies its delay by tens of milliseconds; a bad one does far worse.
The send side cannot fix this, because by the time a packet is lost it is already gone. Everything the receive side does is damage control: smoothing out the jitter, reordering what arrived out of order, and inventing plausible audio to cover what never arrived at all. This is why the receive side, not the send side, is where most of WebRTC's audio cleverness lives.
The receive side: NetEQ, the brain of WebRTC audio
Why a buffer is unavoidable
Packets arrive on an irregular schedule, but playback must be perfectly regular — the speaker needs a fresh 10 ms of audio every 10 ms, forever, with no gaps. A jitter buffer is the device that bridges these two worlds. It is like the waiting area at an airport gate: passengers (packets) arrive at unpredictable times, but flights (playback) leave on a fixed schedule, so the waiting area holds people until it is time to board. The buffer holds arriving packets just long enough to release them in order, on time.
The hard part is choosing how long to wait. Wait too little and a slightly late packet misses its slot and counts as lost. Wait too much and you add audible delay to the conversation, which makes people talk over each other. A fixed buffer cannot win this trade-off because networks change minute to minute. The answer is an adaptive jitter buffer that continuously measures the network and adjusts its target delay.
What NetEQ does
WebRTC's adaptive jitter buffer is called NetEQ, and in the project's own description it is "the audio jitter buffer and packet loss concealer," continuously optimizing buffering delay against network conditions to keep playback smooth with as little delay as possible. It has two doors. Packets come in through InsertPacket, which discards anything too late to be useful and otherwise stores the packet, updating its statistics about how packets are arriving. Audio goes out through GetAudio, which the playback device calls to pull exactly 10 ms of sound whenever it needs it.
The trick that lets NetEQ hold delay low and avoid gaps is that it can quietly stretch or compress the audio in time. When the buffer is filling up (the network sped up), NetEQ plays a decoded frame slightly faster to drain the backlog. When the buffer is running dry (the network slowed down), it plays slightly slower to buy time. These are small enough to be inaudible. Every time the playback device asks for audio, NetEQ returns the result of one of six operations:
| NetEQ operation | When it happens | What the listener hears |
|---|---|---|
| Normal | A packet is available and the buffer level is fine | Clean decoded audio |
| Acceleration | Buffer too full (network sped up) | Audio played slightly faster to drain the buffer |
| Preemptive expand | Buffer too empty (network slowed) | Audio played slightly slower to add delay |
| Expand | No packet available — it is late or lost | Packet loss concealment: invented audio |
| Merge | A real packet arrives right after concealed audio | Concealed audio stitched smoothly to the real frame |
| Comfort noise (CNG) | The talker is silent and using DTX | Soft synthetic background hiss instead of dead silence |
Table 1. The six NetEQ playout operations. The first three manage delay; the last three hide loss and silence. Source: WebRTC NetEQ documentation.
Figure 2. How NetEQ decides what to play. Delay problems are solved by time-stretching (top band); missing audio is solved by concealment (bottom band). The full deep dive is in the NetEQ article.
Hiding lost packets: PLC, FEC, and RED
When a packet never arrives, NetEQ's Expand operation generates packet loss concealment — it extrapolates from the audio it already has, repeating and fading the recent waveform so the gap sounds like a brief continuation rather than a click or silence. This works well for one lost packet and degrades as losses pile up.
NetEQ also cooperates with the two redundancy schemes from the send side. If the Opus stream carries in-band FEC, NetEQ can recover a lost frame from the low-bitrate copy in the next packet. WebRTC's NetEQ documentation lists "forward error correction (RED or codec inband FEC)" among its responsibilities, alongside negative acknowledgement (NACK) tracking of lost packets and even an audio/video sync duty — NetEQ can be told to add latency deliberately to keep audio lined up with video. RED, defined in IETF RFC 2198, is a generic scheme that bundles copies of older frames into newer packets. The trade-offs between PLC, FEC, and RED — when each one saves a call and when it just wastes bandwidth — are covered in Packet Loss Concealment (PLC): Hiding the Missing Frames and Forward Error Correction (FEC), In-Band FEC, and RED Redundancy.
Decode and render
Once NetEQ has decided what 10 ms of audio to play, the Opus decoder turns the compressed frame back into raw samples, and the render stage hands those samples to the operating system's audio output, which drives the speaker or headphones. The relay is complete: a wave that started in one room is now a wave in another.
The latency budget you can put on a wall
Every hop costs time, and the sum is what determines whether a conversation feels natural. The industry rule of thumb, derived from the old telephone-network guidance (ITU-T Recommendation G.114), is that one-way mouth-to-ear delay under about 150 ms is unnoticeable, up to about 200 ms is comfortable for most conversation, and beyond roughly 400 ms people start talking over each other. Here is a representative budget for a good WebRTC call; real numbers vary with network and device.
| Stage | Typical one-way contribution | Why |
|---|---|---|
| Capture + APM clean-up | ~10–20 ms | Buffering plus echo/noise/gain processing |
| Opus encode | ~20 ms | One 20 ms frame must be collected before it can be encoded |
| Packetize + send | ~1 ms | RTP header, encryption, hand-off to the OS |
| Network (one way) | ~10–150 ms | The dominant and least controllable term |
| Jitter buffer (NetEQ) | ~20–100 ms | The adaptive delay it chooses to absorb jitter |
| Decode + render | ~10–30 ms | Decoding plus the output device's own buffer |
| Total (good network) | ~70–120 ms | Comfortably below the 200 ms target |
Table 2. A representative one-way latency budget for a WebRTC audio call. The two terms you can least control — the network and the jitter buffer — are also the two largest, which is why receive-side cleverness matters more than send-side optimization. Let us add the controllable terms: 15 + 20 + 1 + 10 + 10 = 56 ms of device-side delay on a good network, leaving the rest to NetEQ and the wire.
Figure 3. The same budget as a stacked bar. The network and the jitter buffer are the long, variable segments; the device-side stages are short and fixed. A healthy call finishes well left of the 200 ms comfort ceiling.
The arithmetic is worth doing out loud once. On a good local network the controllable, device-side hops add up to roughly 15 ms (capture) + 20 ms (encode) + 1 ms (packetize) + 10 ms (render) = 46 ms, plus a small decode cost. The network adds maybe 10–30 ms each way, and NetEQ adds 20–60 ms of deliberate buffering. That lands a healthy call near 90–120 ms one-way — well inside the comfortable band. The moment the network degrades, NetEQ grows its buffer to protect smoothness, and that is the delay you feel creep in on a bad connection.
Where Fora Soft fits in
We build the kinds of products that live or die by this pipeline: telemedicine platforms where a clinician must hear a patient clearly, video conferencing and e-learning tools where dozens of people share a room, contact-centre and live-shopping apps where every second of audio is the product. Across these projects we tune the same stages this article describes — choosing whether to lean on Opus FEC or RED for a lossy mobile audience, sizing NetEQ for the latency a use case can tolerate, and deciding when to mix audio in a server versus forward it untouched. The pipeline is standard; making it sound right for a specific product and a specific network is the engineering.
What to read next
- Acoustic Echo Cancellation (AEC): How It Really Works
- Jitter Buffer: NetEQ, the Brain of WebRTC Audio
- Opus: The Open Codec That Ate WebRTC
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your webrtc audio pipeline plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the WebRTC audio pipeline — cheat sheet — One-page reference: every pipeline stage from microphone to speaker, the real component at each stage (getUserMedia, HPF, AEC3, NS, AGC2, Opus, RTP, NetEQ), the six NetEQ operations, and a representative one-way latency budget with the….
References
- Audio Processing Module (APM) — WebRTC source documentation,
modules/audio_processing/g3doc/audio_processing_module.md, webrtc.googlesource.com (accessed 2026-06-06). Tier 2 (reference implementation documentation). Supports the APM's role and the processing order high-pass filter → AEC3 → NS → AGC2, and the AGC2 controller structure. - NetEq — WebRTC source documentation,
modules/audio_coding/neteq/g3doc/index.md, chromium.googlesource.com (accessed 2026-06-06). Tier 2 (reference implementation documentation). Supports NetEQ's definition as an adaptive jitter buffer and packet-loss concealer, the InsertPacket/GetAudio API, the six playout operations (Normal, Acceleration, Preemptive expand, Expand, Merge, Comfort noise), and its FEC/RED, NACK, and A/V-sync responsibilities. - IETF RFC 6716 — Definition of the Opus Audio Codec (J. Valin, K. Vos, T. Terriberry), September 2012. Tier 1 (official specification). Supports Opus frame sizes 2.5–60 ms, bitrate range 6–510 kbit/s, 48 kHz operation, SILK/CELT hybrid structure, in-band FEC (LBRR), and DTX. Note: RFC 6716 is updated by RFC 8251 (2017); a 2026 article on Opus cites both.
- IETF RFC 7587 — RTP Payload Format for the Opus Speech and Audio Codec (J. Spittka, K. Vos, J. Valin), June 2015. Tier 1 (official specification). Supports the mapping of Opus frames into RTP packets and SDP signalling (e.g.
usedtx,useinbandfec). - IETF RFC 3550 — RTP: A Transport Protocol for Real-Time Applications (H. Schulzrinne et al.), July 2003. Tier 1 (official specification). Supports the RTP header fields used by the pipeline: sequence number, timestamp, and SSRC.
- IETF RFC 2198 — RTP Payload for Redundant Audio Data (C. Perkins et al.), September 1997. Tier 1 (official specification). Supports the RED generic audio-redundancy scheme.
- IETF RFC 5764 — Datagram Transport Layer Security (DTLS) Extension to Establish Keys for SRTP (DTLS-SRTP) (D. McGrew, E. Rescorla), May 2010. Tier 1 (official specification). Supports the encryption step (DTLS-SRTP) applied before packets hit the network.
- W3C Media Capture and Streams —
getUserMedia, W3C (Candidate Recommendation as of 2026). Tier 1 (official specification). Supports the capture entry point and theechoCancellation/noiseSuppression/autoGainControlconstraints. - W3C WebRTC 1.0: Real-Time Communication Between Browsers — W3C Recommendation. Tier 1 (official specification). Supports the browser-level audio transport model. Note: where popular blog posts describe browser audio behaviour, the article defers to the W3C and WebRTC source documentation and flags any discrepancy.
- ITU-T Recommendation G.114 — One-way transmission time, ITU-T (current revision). Tier 1 (official specification). Supports the latency-budget guidance (≤150 ms unnoticed, ≤400 ms usable). Many blogs quote "150 ms" without the source; the controlling document is G.114.


