Why this matters
If you ship video calling, telemedicine, contact-centre, or online-classroom software, packet loss is not an edge case — it is the normal weather of mobile and home networks, and the first thing your users notice when it hits. The audio either degrades gracefully or it turns into a glitchy, robotic mess, and the difference between those two outcomes is almost entirely the quality of your packet loss concealment. This article is for the product manager, founder, or operations lead who needs to understand what PLC can and cannot do — so they can read a diagnostic stat, judge whether a "bad audio" complaint is a concealment problem or something upstream, and decide whether a newer neural PLC is worth the integration cost. A senior engineer will also find every claim traced to the WebRTC source documentation, the relevant RFCs and ITU recommendations, or the original research papers.
The problem: the show must go on
Start with the constraint that makes PLC necessary. In a real-time call the speaker is merciless: it demands a fresh, unbroken slice of audio — in WebRTC, exactly 10 milliseconds of it — to play on time, 100 times a second, forever. The audio device does not pause politely while a late packet finishes its journey across the network. If the next slice of sound is not ready when the speaker asks for it, something has to come out of the speaker anyway.
On a real network, packets go missing constantly. They are dropped by congested routers, lost on a weak Wi-Fi link, corrupted on a cellular hop, or they simply arrive too late to be useful. For one-way streaming this is annoying but recoverable, because the player can buffer several seconds and wait. For a live conversation it is not recoverable that way: waiting means delay, and delay past roughly 200 milliseconds one-way makes natural back-and-forth fall apart (we explain that latency ceiling in our article on the WebRTC audio pipeline). So when a packet is missing, the receiver has only bad options, and PLC is the least bad one.
Packet loss concealment is the family of techniques that fill the gap left by a missing packet with manufactured audio, so the listener hears something plausible instead of a hole. The name is precise: it does not recover the lost audio — the original sound is gone — it conceals the loss by generating a substitute. The whole art is making that substitute close enough to what was actually said that the listener never notices, or at least is not jarred.
The simplest answer, and why everyone abandoned it
The crudest PLC is zero insertion, also called silence substitution: when a packet is missing, play silence for those milliseconds. It is trivial to build and costs nothing.
It also sounds terrible. A sudden silence in the middle of a vowel is not heard as a quiet moment — it is heard as a sharp click or pop at each edge of the gap, because the waveform jumps abruptly to zero and back. The human ear is exquisitely sensitive to those discontinuities. A run of silence-filled gaps turns continuous speech into a stuttering, clicky mess that is harder to follow than the loss itself would suggest. Zero insertion is the baseline every other method is measured against, and every serious system has abandoned it.
Waveform substitution: repeat the recent past
The next idea, and the one that powered VoIP for two decades, is waveform substitution: instead of silence, fill the gap by repeating a piece of the audio you just received. Speech is locally repetitive — a held vowel is essentially the same short waveform copied over and over — so a recent fragment is usually a good guess for what comes next.
The simplest form just repeats the last received frame. Better versions, like the one standardised in the ITU's recommendation G.711 Appendix I, work at the level of the pitch period — the length of one cycle of the speaker's voice, the thing that makes a voice sound high or low. Instead of repeating an arbitrary chunk, the algorithm finds the most recent pitch period and loops it, so the manufactured sound keeps the speaker's natural pitch rather than buzzing at the frame rate (ITU-T G.711 Appendix I, 1999). G.711 Appendix I is deliberately cheap — its published complexity is roughly 0.5 MIPS per channel, light enough for the simplest telephony hardware (ITU-T G.711 Appendix I, 1999).
To find the pitch period, the algorithm slides the recent waveform against itself and looks for the lag at which it best lines up — a calculation called a normalized cross-correlation. The lag that produces the highest correlation is the pitch period, and that fragment is what gets looped (US patent literature on waveform-alignment PLC, used in WebRTC's Expand). Two refinements make it sound less mechanical. First, the repeated fragment is faded down in volume the longer the gap lasts, because a loop held too long becomes an obvious drone; by fading toward silence, a long loss degrades gracefully instead of buzzing. Second, when the real audio finally returns, the manufactured tail and the fresh real audio are cross-faded together so the join does not click.
Waveform substitution is genuinely good for short, isolated losses — a single missing 20-millisecond packet is usually inaudible. Its limits show up under two conditions. It cannot invent new sounds: if a packet that contained the start of a new word is lost, looping the previous vowel produces the wrong sound, not the missing word. And it degrades fast across a burst of consecutive losses, because each repeated-and-faded fragment drifts further from reality. Stretch a single pitch period across 200 milliseconds of loss and you get the watery, robotic smear everyone recognises from a bad call.
Figure 1. The four families of packet loss concealment, from the cheap-and-poor zero insertion to the expensive-and-excellent neural methods, on the same missing segment.
Model-based concealment: extend the speech, not the waveform
The third family steps up a level of abstraction. Instead of copying the raw waveform, model-based (or parametric) concealment uses the fact that the codec already describes speech as a small set of parameters — roughly, a model of the vocal tract (which shapes vowels and consonants) plus an excitation signal (the buzz of the vocal cords or the hiss of breath). When a packet is lost, the decoder extrapolates those parameters forward — holding the vocal-tract shape, continuing the pitch, gently decaying the energy — and synthesises new audio from the extended model rather than from a looped waveform.
This is how the concealment built into modern speech codecs works. The Opus codec's SILK layer, the codec WebRTC uses for most voice, has a model-based concealment path that the decoder runs automatically when it is told a frame is missing (IETF RFC 6716, §4.4, September 2012). Because it works on the speech model, it can ride out a slightly longer loss than raw waveform substitution before it starts to sound wrong, and it transitions more naturally between sounds. It still cannot invent words — a model extended into a gap produces a plausible continuation of the current sound, not the actual next syllable — but it is a meaningful step up in quality for the same conversation.
Neural concealment: let a network imagine the missing sound
The fourth family is the one that changed what is possible, and it arrived in production around 2020. Neural PLC replaces hand-tuned extrapolation rules with a deep neural network trained on enormous amounts of real speech. The network has, in effect, learned what human speech sounds like, so when it is handed the audio leading up to a gap it can generate a continuation that follows the natural statistics of speech — the right pitch contour, the right way a vowel decays into a consonant — far more convincingly than a fixed rule.
The first widely deployed example was Google's WaveNetEQ, shipped in Google Duo in 2020. It is a generative model built on DeepMind's WaveRNN that produces a chunk of audio to fill each gap. Because Duo is end-to-end encrypted, the model runs entirely on the receiver's phone, and Google engineered it to be fast enough to do so while still producing high-quality audio. It was trained on speech from more than 100 speakers across 48 languages, mixed with varied background noise so it would hold up in noisy rooms (Google Research, "Improving Audio Quality in Duo with WaveNetEQ", April 2020). Its honest, stated limitation is instructive: WaveNetEQ "learns how to plausibly continue speech on a short scale" — it can finish a syllable, but it does not predict words (Google Research, April 2020). Neural PLC, like every PLC before it, conceals; it does not read minds.
The most important recent development is in Opus 1.5, released in March 2024. Opus added a deep-learning PLC that, instead of hand-tuned heuristics, lets a DNN generate the missing audio; its authors placed second in Microsoft's 2022 Audio Deep Packet Loss Concealment Challenge with the underlying technique (Xiph.Org, "Opus 1.5 Released", March 2024). Crucially, deep PLC is a pure decoder upgrade — it changes nothing on the wire, so it is fully compatible with the existing Opus standard (RFC 6716) and any Opus encoder can talk to a deep-PLC-enabled decoder. You enable it at build time with --enable-deep-plc (about 1 MB of binary) and at run time by setting the decoder complexity to 5 or higher; the extra cost at a high loss rate is about 1 percent of one laptop CPU core (Xiph.Org, March 2024).
The neural vocoder that makes this cheap enough to ship is itself a piece of engineering worth naming: Opus 1.5 introduced FARGAN (a framewise autoregressive generative adversarial network) with a complexity of about 600 MFLOPS — one-fifth of the LPCNet vocoder it replaced — which is what lets it run in under 1 percent of a CPU core on a recent phone (Xiph.Org, March 2024).
When the loss is too long to imagine: DRED
Pure PLC, even neural PLC, has a hard ceiling: it can only continue from what it already heard. When packets go missing in a long burst — and on real networks they usually do, several in a row — entire phonemes or words disappear, and no amount of clever continuation can recover words that were never received. The honest fix is redundancy: send the audio more than once so a burst can be filled with what was actually said, not a guess.
Opus 1.5 ships a striking answer called Deep REDundancy (DRED). A normal codec keeps packets short — typically 20 milliseconds — to stay low-latency, which limits how much history it can pack in. DRED drops that constraint for the redundant copy: it uses a neural compressor (a rate-distortion-optimised variational autoencoder) to squeeze up to one full second of recent audio into about 12 to 32 kilobits per second of overhead, carried inside the padding of a normal Opus packet so older decoders simply ignore it. The effect is that every 20-millisecond frame is effectively transmitted about 50 times, spread across many later packets, at a cost similar to the older redundancy mechanism (Xiph.Org, March 2024). In the project's own testing with the full WebRTC stack, DRED kept speech intelligible even at 90 percent packet loss (Xiph.Org, March 2024). DRED is not yet a finalised standard — it is being worked through the IETF's mlcodec working group — so treat the 1.5 version as experimental, but the direction is clear: the line between "concealment" and "redundancy" is where the future of resilient audio is being decided.
What WebRTC actually does: NetEQ's Expand and Merge
In WebRTC, packet loss concealment is not a separate module you wire in — it lives inside NetEQ, the same adaptive jitter buffer that smooths uneven packet arrival, which we cover in depth in our article on the NetEQ jitter buffer. Every 10 milliseconds the audio device calls NetEQ's GetAudio function and NetEQ must return a slice of sound. When the packet it needs has not arrived — lost or merely late — it has no data to decode, and rather than play silence it performs the operation the WebRTC documentation calls Expand: it generates packet loss concealment "by extrapolating the remaining audio in the sync buffer or by asking the decoder to produce it" (WebRTC NetEq documentation, chromium.googlesource.com).
That single sentence captures the two-tier design. NetEQ has its own built-in waveform-substitution concealment — the pitch-period extrapolation described above — that works for any codec. But for codecs that ship their own, better concealment, NetEQ instead asks the codec's decoder to produce the missing audio. For Opus, that means NetEQ can invoke Opus's own model-based (and, in Opus 1.5, neural) concealment, which is why a WebRTC stack built on a recent Opus sounds dramatically better under loss than the generic path alone.
When a real packet finally arrives after Expand has been manufacturing audio, the two cannot simply be stapled together — the seam would click, exactly like the edges of a zero-insertion gap. So NetEQ performs a second operation, Merge, which stitches the concealment output smoothly into the freshly decoded real audio so the join is inaudible (WebRTC NetEq documentation). Expand fills the hole; Merge hides the patch's edges. Together they are WebRTC's complete after-the-fact loss handling.
Figure 2. How NetEQ conceals a lost packet: Expand manufactures audio into the gap; Merge cross-fades the real packet back in when it returns.
A worked example: how fast quality falls with the loss pattern
Numbers make the burst problem concrete. Suppose your audio uses 20-millisecond frames, so each lost packet is a 20-millisecond hole. Consider two networks that both lose 5 percent of packets, but in different patterns.
On the first network, losses are random and isolated — one packet here, one there, never two in a row. With 20-millisecond frames, that means a typical gap is a single 20-millisecond hole surrounded by good audio on both sides. Waveform substitution loops one recent pitch period across 20 milliseconds, and the result is usually inaudible. Do the arithmetic on the worst single gap:
frame size = 20 ms
isolated loss = 1 frame
concealed span = 1 × 20 ms = 20 ms ← well within "inaudible" range
On the second network, the same 5 percent of loss arrives in bursts — when one packet drops, the next few usually drop with it. A burst of, say, six consecutive losses is a single continuous gap:
frame size = 20 ms
burst loss = 6 frames
concealed span = 6 × 20 ms = 120 ms ← far past where looping sounds natural
The same loss percentage produces a barely noticeable artifact in one case and an obvious robotic smear in the other, purely because of the pattern. This is why measuring only the loss percentage is misleading, and why the industry moved to realistic, bursty loss models for testing PLC; it is also exactly the regime where neural redundancy like DRED earns its keep, because 120 milliseconds of missing speech is too much to imagine but can be recovered if it was sent redundantly. The practical lesson: a "5 percent loss" headline tells you almost nothing about how your audio will actually sound — the burstiness does.
PLC is one of three tools — know which job it does
Packet loss concealment is the last line of defence, and on purpose it is the one that involves a guess. It is worth placing it against the two other resilience tools, because teams routinely confuse them and reach for the wrong one.
Retransmission (RTX) asks the sender to send the lost packet again. It recovers the exact audio — no guessing — but it costs at least one network round trip, so it only helps when the round trip is short relative to the jitter buffer's holding time. On a high-latency link the retransmitted packet arrives too late to play (defined in IETF RFC 4588).
Forward error correction (FEC), including Opus's in-band low-bitrate redundancy and the RED format, sends redundant data ahead of time so a lost packet can be reconstructed without asking for it again. It costs bandwidth on every packet, whether or not loss occurs, but it adds no round-trip delay. We cover its trade-offs in forward error correction, in-band FEC, and RED redundancy.
Packet loss concealment (PLC) does neither of those — it manufactures a substitute at the receiver, with zero added latency and zero extra bandwidth, but it is a guess and can only continue from what was already heard. The three are complementary, not alternatives: a well-built WebRTC stack uses FEC to avoid most losses, RTX to recover some of the rest when latency permits, and PLC to conceal whatever still gets through. DRED is interesting precisely because it blurs the line — it is redundancy delivered into the concealment path.
| Tool | What it does | Latency cost | Bandwidth cost | Recovers exact audio? |
|---|---|---|---|---|
| Retransmission (RTX) | Resends the lost packet on request | One round trip | Only when loss occurs | Yes |
| Forward error correction (FEC/RED) | Sends redundant data ahead of time | None | On every packet | Yes (within the redundancy) |
| Packet loss concealment (PLC) | Manufactures a substitute at the receiver | None | None | No — it is a guess |
A common pitfall: treating a high conceal rate as a PLC bug
The most expensive mistake teams make is reading a high concealment rate and concluding "our PLC is broken". PLC sits at the very end of the pipeline; a flood of concealment almost always means packets are genuinely not arriving — real network loss, a congestion-control collapse, or a sender that throttled its bitrate — not that the concealment algorithm chose badly. Swapping in a fancier PLC when the real problem is 8 percent bursty loss makes the smear sound slightly better but does not remove the gaps. Measure the loss first; fix the network or add redundancy; tune concealment last.
A second trap is judging PLC by the wrong metric. The old habit is to report a loss percentage and a generic audio score, but neither captures how concealment actually performs, because the same loss percentage sounds completely different depending on burstiness (as the worked example showed). The field now uses purpose-built measures — Microsoft's PLCMOS, a data-driven metric designed specifically to score concealment quality, was introduced alongside the 2022 Deep PLC Challenge precisely because general audio metrics missed concealment artifacts (Xiph.Org, March 2024). If you are comparing two PLC implementations, compare them on realistic bursty loss traces with a concealment-aware metric, not on uniform random loss with PESQ.
A third trap is assuming neural PLC is free. It is cheap — about 1 percent of a CPU core for Opus deep PLC — but "cheap" is not "zero", and it requires a build that compiled it in and a decoder complexity set high enough to switch it on (5 or more for deep PLC) (Xiph.Org, March 2024). A team that expects neural concealment but ships a stock decoder build gets the generic path and wonders why the audio still smears.
Where Fora Soft fits in
Fora Soft has built real-time audio into video conferencing, telemedicine, e-learning, and live-shopping products since 2005, and concealment behaviour is a routine part of how we diagnose and tune those systems. When a customer reports robotic or watery audio, the first move is to read the concealment counters in getStats, look at whether the loss is isolated or bursty, and decide whether the fix is upstream (network, congestion control, adding FEC) or in the decoder build (enabling a better Opus concealment path). A telemedicine call on hospital Wi-Fi and a 500-person webinar on mixed consumer connections lose packets in different patterns, and the right resilience mix — FEC, retransmission, concealment, and when it matters, neural redundancy — comes from the measured loss profile, not a guess. We instrument those statistics from the start so "the audio sounds broken" becomes a number we can read.
What to read next
- Jitter Buffer: NetEQ, The Brain Of WebRTC Audio
- Forward Error Correction (FEC), In-Band FEC, And RED Redundancy
- The WebRTC Audio Pipeline End-to-End
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your packet loss concealment plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the PLC decision & diagnostics cheat sheet — One-page reference: the four PLC families (zero insertion, waveform substitution / G.711 App I, model-based, neural), what WebRTC's NetEQ Expand and Merge operations do, how PLC compares with FEC/RED and retransmission (latency,….
References
- WebRTC project, NetEq (g3doc), chromium.googlesource.com — the canonical maintainer documentation: the GetAudio decision sequence, the six output operations including Expand (packet loss concealment generated by NetEq or the decoder) and Merge (audio stitched from concealment to decoded data after a loss), and the rule that NetEq either extrapolates from the sync buffer or asks the decoder to produce the concealment. Tier 2 (reference-implementation documentation); the controlling source for what WebRTC actually does on a lost packet. https://chromium.googlesource.com/external/webrtc/+/master/modules/audio_coding/neteq/g3doc/index.md
- ITU-T Recommendation G.711 Appendix I, A high quality low-complexity algorithm for packet loss concealment with G.711 (1999) — the standard waveform-substitution PLC: pitch-period repetition, energy fade for long losses, cross-fade on recovery, and the ~0.5 MIPS-per-channel complexity figure. Tier 1 (official ITU-T standard). https://www.itu.int/rec/T-REC-G.711-199909-I!AppI/en
- IETF RFC 6716, Definition of the Opus Audio Codec (J.-M. Valin, K. Vos, T. Terriberry), September 2012 — §4.4 on Opus/SILK packet loss concealment (model-based extrapolation invoked when a frame is flagged lost); the codec whose decoder NetEQ drives for concealment. Standards-track; updated by RFC 8251 (2017). Tier 1. https://www.rfc-editor.org/rfc/rfc6716
- Xiph.Org Foundation, Opus 1.5 Released (Opus development team), March 2024 — the primary source on Opus deep PLC (
--enable-deep-plc, decoder complexity ≥ 5, ~1% CPU, ~1 MB), the FARGAN neural vocoder (~600 MFLOPS, one-fifth of LPCNet), Deep REDundancy / DRED (up to 1 s of audio in 12–32 kb/s, every 20 ms frame effectively sent ~50×, intelligible at 90% loss), the second-place finish in the 2022 Audio Deep PLC Challenge, PLCMOS, and the IETFmlcodecstandardisation path. Tier 3 (first-party maintainer publication). https://opus-codec.org/demo/opus-1.5/ - Google Research, Improving Audio Quality in Duo with WaveNetEQ (April 2020) — the first widely deployed neural PLC: a WaveRNN-based generative model running on-device, trained on 100+ speakers across 48 languages with noise augmentation; the explicit limitation that it continues speech on a short scale (finishes syllables, does not predict words). Tier 4 (vendor engineering blog from the deployer). https://research.google/blog/improving-audio-quality-in-duo-with-waveneteq/
- L. Diener et al., INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge (arXiv:2204.05222), 2022 — the challenge framing and the PLCMOS data-driven concealment-quality metric; establishes why concealment must be evaluated on realistic bursty loss rather than uniform random loss. Tier 5 (peer-reviewed/academic; primary source for PLCMOS). https://arxiv.org/abs/2204.05222
- IETF RFC 4588, RTP Retransmission Payload Format (J. Rey et al.), July 2006 — defines RTX, used here to distinguish exact-recovery retransmission from concealment. Standards-track. Tier 1. https://www.rfc-editor.org/rfc/rfc4588
- IETF RFC 2198, RTP Payload for Redundant Audio Data (C. Perkins et al.), September 1997 — the RED redundancy format referenced when contrasting PLC with sender-side redundancy. Standards-track. Tier 1. https://www.rfc-editor.org/rfc/rfc2198
- W3C, Identifiers for WebRTC's Statistics API (
webrtc-stats), Candidate Recommendation — definesconcealedSamples,concealmentEvents, andsilentConcealedSampleson the inbound RTP audio stream report; the controlling source for the production concealment statistics in this article. https://www.w3.org/TR/webrtc-stats/
Source-conflict note (per our standards-first rule): popular articles often equate "PLC" with "the fix for packet loss" and lump it together with FEC and retransmission. The WebRTC documentation (ref 1) and RFCs 4588/2198 (refs 7–8) draw the precise boundary this article uses — PLC conceals at the receiver with no recovery of the original audio, while RTX and FEC recover it — so the article follows the specs and treats the looser usage as imprecise.


