Why this matters
If you build video conferencing, telemedicine, contact-centre, or online-classroom software, most of your participants are silent most of the time — listening, not talking. Sending a continuous stream of audio packets for every silent microphone wastes bandwidth and server capacity, and in a large call it is the difference between a system that scales to 500 people and one that falls over at 50. This article is for the product manager, founder, or operations lead who needs to understand that trade-off well enough to make a decision and ask an engineer a precise question — not to write the detector themselves. A senior engineer will also find every claim traced to the original RFC, ITU-T Recommendation, or the maintainer's own implementation.
The problem: a microphone that never stops talking
Start with what actually travels across the network on a voice call. The microphone captures sound continuously, the encoder chops it into small chunks — typically 20 milliseconds of audio each, called a frame — and the system wraps each frame in a packet and sends it. Twenty milliseconds per frame means fifty frames every second, fifty packets every second, for as long as the call is open.
Here is the catch: in a normal two-person conversation, each person is actually speaking less than half the time. In a meeting with ten people, any one participant might talk for two minutes out of an hour. The rest of the time their microphone is sending packets that carry nothing but the hum of their room — the fan, the air conditioner, the faint hiss of the office. Every one of those packets costs bandwidth on the sender's connection, bandwidth on the receiver's connection, and processing time on whatever server sits in the middle. Multiply that by every silent participant in every call, and the waste is enormous.
The fix has two halves that always travel together. The first half is a decision: is this frame speech, or is it silence? That decision is the job of voice activity detection, which from here we will call VAD — the part of the system that listens to each frame and labels it "speech" or "not speech". The second half is an action: given that this frame is silence, stop sending full packets. That action is discontinuous transmission, which we will call DTX — literally, transmission that is allowed to stop and start rather than running continuously. VAD decides; DTX acts on the decision. You cannot have useful DTX without a VAD feeding it, which is why the two are almost always discussed as one topic.
How voice activity detection makes its call
A VAD has one job and a hard constraint. The job: look at the current frame and output a single bit — speech or not. The constraint: do it now, with almost no peek at the future, because a live conversation falls apart once the round trip climbs past roughly 300 to 400 milliseconds, so the detector cannot sit and wait for more audio before deciding.
The simplest VAD just measures loudness. If the frame's energy is above a threshold, call it speech; below, call it silence. This is fast and tiny, and it fails the moment the room is noisy, because a loud air conditioner clears the threshold just as easily as a quiet voice does. Every serious VAD therefore looks at more than raw loudness. It looks at how the energy is distributed across frequencies, because speech and steady noise have different shapes. A voice has structure — it concentrates energy in the bands where human speech lives, and that structure changes from moment to moment as the talker forms different sounds. A fan or an air conditioner is flat and unchanging by comparison. One useful measure of this is spectral flatness: a single number that is high when energy is spread evenly across all frequencies, the way noise is, and low when energy is peaked in a few bands, the way voice is.
Three generations of VAD use that same idea with increasing sophistication.
The classical telecom VAD, standardised in ITU-T Recommendation G.729 Annex B (1996), was built for the phone network. It combines four features — full-band energy, low-band energy, a measure of the spectral shape, and the zero-crossing rate — into a speech/non-speech decision, and it continuously tracks the background noise so the threshold adapts as the room gets louder or quieter (ITU-T G.729 Annex B, 1996). It is the reference that later designs measured themselves against.
The WebRTC VAD, which ships inside Chrome, Edge, and most native voice apps, is the one most engineers actually touch. It splits each frame into six frequency bands, computes the energy in each, and feeds those into a small statistical model — a Gaussian mixture model, a method that holds one rough template for "speech energy" and one for "noise energy" and asks which the current frame resembles more. It accepts frames of 10, 20, or 30 milliseconds, and it exposes a single tuning knob — an "aggressiveness" setting from 0 to 3, where 0 reports speech readily and 3 is the most eager to call something noise. Raising the aggressiveness rejects more noise but risks chopping the quiet edges of real words (WebRTC common_audio/vad, libwebrtc source).
The neural VAD, the modern default for accuracy, replaces the hand-built rules with a small network trained on thousands of hours of speech and noise. The widely used open example, Silero VAD, is a compact model — about 309 thousand parameters, roughly 1 to 2 megabytes on disk in its version 5 (2024) — that processes the audio in 512-sample chunks (32 milliseconds at a 16 kHz sample rate) and carries a little context from the previous chunk so it can read the flow of speech rather than judging each chunk in isolation. On a normal CPU it runs far faster than real time — its maintainers report on the order of 1 millisecond per chunk — and it holds up across many languages and noise types where the energy-based detectors stumble (Silero VAD v5 release notes, 2024). The price is a model file to ship and a little more computation than the six-band approach.
The takeaway for a non-engineer: all three answer the same yes/no question. The newer the design, the better it tells a quiet voice from a loud room — and the better it is, the less it clips your users' words.
The two mistakes every VAD can make
A VAD is a classifier, and like any classifier it has exactly two ways to be wrong. Naming them is the single most useful thing in this article, because every complaint you will ever receive about DTX maps to one of them.
A false negative is calling real speech "silence". The system then stops transmitting in the middle of a word, and the listener hears the first syllable of a sentence get swallowed — "...orning, everyone" instead of "Good morning, everyone". This is the clipped-speech complaint, and it is the one users hate most, because losing words breaks understanding.
A false positive is calling noise "speech". The system keeps transmitting full packets during silence, so you simply lose some of the bandwidth savings DTX was supposed to give you. Annoying for your cloud bill, invisible to the user.
These two errors trade off against each other directly, and the aggressiveness knob is the lever. Turn the VAD more aggressive and it rejects more noise — fewer false positives, more savings — but it also starts clipping quiet word-endings — more false negatives. Turn it gentler and speech is safe but the savings shrink. There is no setting that eliminates both; every real system picks a point on that line. The engineering trick that buys back most of the clipped-speech problem is hangover: after the VAD stops detecting speech, the system keeps transmitting for a short tail — a few extra frames — in case the talker is only pausing between words rather than truly finished. The classical G.729 Annex B VAD uses exactly this kind of noise-dependent hangover to protect the trailing edge of speech (ITU-T G.729 Annex B, 1996). It costs a little bandwidth to hold the line open, and it saves you most of your clipped-word tickets.
Discontinuous transmission: acting on the decision
Once the VAD says "this stretch is silence", DTX decides what to actually put on the wire. The naive answer — send nothing at all — has a problem the telephone industry learned decades ago: total digital silence sounds broken. When the faint room hiss that was there a moment ago suddenly drops to a dead, absolute nothing, listeners assume the call has dropped and start saying "Hello? Are you still there?" The fix is to not go fully silent on the wire, but to send a tiny packet that says, in effect, "play a quiet hiss that matches the room until I tell you otherwise". That tiny packet is called a silence insertion descriptor, or SID — and the matched hiss the receiver plays from it is comfort noise, generated by a comfort noise generator (CNG).
The structure of that SID packet is standardised. IETF RFC 3389 (September 2002) defines an RTP payload that carries comfort-noise parameters: a single byte for the noise level (how loud the hiss should be, expressed in dBov, the level relative to the system's maximum) plus optional bytes describing the noise's spectral shape (its colour — bassy hum versus bright hiss) as reflection coefficients (RFC 3389, §3). The receiver feeds those parameters into its comfort noise generator and synthesises an appropriate hiss locally. Note what is not in the packet: any actual audio. RFC 3389 was written specifically for older codecs that have no built-in silence handling — G.711, G.722, G.726 — and it was modelled on the comfort-noise scheme in Appendix II of ITU-T Recommendation G.711 (February 2000), which in turn borrowed the VAD and DTX of G.729 Annex B (RFC 3389, §2). The payload carries a fixed RTP payload type of 13 at the 8 kHz clock rate (RFC 3389, §4).
Here is the arithmetic that makes the whole exercise worth it. A normal voice packet on a WebRTC call carries 20 milliseconds of encoded audio — for Opus speech, on the order of 40 to 80 bytes of payload, sent 50 times a second. A SID/comfort-noise packet carries only a level byte and a few spectral bytes — a handful of bytes — and it is sent far less often, because the room's hiss barely changes from moment to moment.
Active speech: 20 ms per frame → 50 packets/second, ~40–80 bytes payload each
Silence (DTX): ~2–3 byte descriptor, sent only every few hundred ms
Result: the idle microphone's payload drops by roughly 90%
A worked example. Suppose one participant carries 60 bytes of Opus payload every 20 ms while speaking, which is 60 × 50 = 3,000 bytes per second of payload. The same microphone, idle under DTX sending one 3-byte descriptor every 400 ms, carries 3 × 2.5 = 7.5 bytes per second. That is a payload reduction of about 99.75 percent on the idle stream — the savings are dramatic precisely because most microphones in a call are idle most of the time. (Once you add the fixed RTP, UDP, and IP packet headers back in, the total per-stream saving is smaller than the payload saving, but still large, which is why the rule of thumb people quote for whole calls is "tens of percent" rather than "ninety-nine percent".)
How Opus does it — and one trap to avoid
Opus, the codec that carries almost all WebRTC audio, has VAD, DTX, and comfort noise built in, which is why a modern stack rarely uses the separate RFC 3389 machinery. Opus's own VAD lives in the speech half of the codec (the part called SILK) and judges each frame on band energy and spectral flatness, the same family of features described above. When DTX is enabled and the VAD reports silence, Opus encodes a frame only about once every 400 milliseconds instead of every 20 — and the decoder fills the gaps using its built-in loss-concealment and comfort-noise path (Opus implementation; getstream.io WebRTC engineering notes, 2024).
Three precise facts from the specification matter to anyone configuring this:
First, DTX is off by default. The Opus RTP payload format defines a session parameter usedtx, and "if no value is specified, the default is 0" — meaning continuous transmission (RFC 7587, §6.1, June 2015). You have to ask for DTX explicitly.
Second, when Opus drops silent frames it must drop whole frames only, sized so that successive RTP timestamps differ by a multiple of 120, so the receiver can tell the gap is intentional DTX rather than lost packets, and so it can use whole frames for concealment (RFC 7587, §3.1.3). A receiver distinguishes DTX from packet loss by reading the gap in timestamps against the sequence numbers (RFC 7587, §3.1.3).
Third — and this is the trap — do not pair Opus with the old RFC 3389 comfort-noise payload. Opus generates its own comfort noise internally; bolting RFC 3389 on top is redundant and is explicitly discouraged by the standard: "Using Comfort Noise as defined in [RFC3389] with Opus is discouraged" (RFC 7587, §3.1.3). The RFC 3389 path is for the codecs that lack their own silence handling, not for Opus.
A second trap worth naming: Opus DTX engages in its speech-oriented mode, not in the pure music mode. If you force the codec into music mode, or set it up so the silence path never activates, you will pay full freight on idle microphones even with usedtx=1 set. Match the codec mode to the content — voice for a meeting, not music.
A common pitfall: turning DTX on for the wrong call
DTX is close to free money for a conversation, where the whole premise is that people take turns and most microphones are idle. It is the wrong default for a broadcast of continuous sound — a music performance, a DJ set streamed over WebRTC, an ambient-audio installation, a baby monitor whose entire purpose is to transmit the quiet room. In those cases the "silence" the VAD detects is the signal the listener actually wants, and DTX will strip the soft passages and replace them with synthetic hiss. The standard says as much: continuous transmission is recommended unless network constraints are severe, because DTX carries a slightly lower audio quality than sending every frame (RFC 7587, §3.1.3). The rule of thumb: DTX on for talking, DTX off for listening to sound that is meant to be quiet.
There is also a tuning interaction worth flagging to your engineers. DTX sits downstream of noise suppression and gain control in the capture chain. If the noise suppressor is aggressive, the VAD sees a cleaner signal and makes better silence calls; if the gain control pumps up a near-silent room, the VAD may mistake amplified hiss for speech and keep the stream alive. The pieces are not independent — they are tuned together.
VAD has a second career: it is not only about saving bandwidth
It is worth knowing that the same speech/silence decision is reused all over a modern voice product, often more visibly than the bandwidth job.
A transcription or speech-recognition pipeline uses VAD to find where utterances begin and end, so it can cut the audio into sentence-sized pieces to feed the recogniser — this is the "endpointing" that decides when your voice assistant thinks you have finished a sentence. A recording system uses VAD to skip dead air, producing a shorter file and a cleaner transcript. A noise-gate in a conferencing UI uses VAD to drive the "you are speaking" indicator and to mute a participant's stream until they actually talk. A voice agent uses VAD for turn-taking — deciding the human has stopped so the agent may start. Silero VAD's popularity comes largely from these uses, not from DTX. So when an engineer says "we need a good VAD", they may be talking about transcription accuracy or agent responsiveness, not bandwidth at all — worth clarifying which problem is on the table.
Where Fora Soft fits in
Fora Soft has built real-time audio into video conferencing, telemedicine, e-learning, and live-shopping products since 2005. In those systems the VAD and DTX configuration is a routine but consequential tuning decision: a telemedicine call wants a gentle VAD with generous hangover so a doctor's quiet aside is never clipped, while a 500-person webinar wants aggressive DTX on every audience microphone so the media server is not drowning in idle packets. We tune that balance per product, watch the clipped-speech and packet-rate metrics together, and keep DTX off where the content is continuous sound rather than conversation. The same speech/silence signal then feeds the transcription and "active speaker" features users actually see.
What to read next
- The WebRTC Audio Pipeline End-to-End
- Opus: The Open Codec That Ate WebRTC
- Jitter Buffer: NetEQ, The Brain Of WebRTC Audio
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your voice activity detection plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the VAD & DTX tuning cheat sheet — One-page reference: VAD generations (energy / G.729-B classical / WebRTC GMM / Silero neural), the two failure modes (false negative clips words, false positive wastes bandwidth), hangover, the Opus DTX settings (usedtx default 0, ~400….
References
- IETF RFC 6716, Definition of the Opus Audio Codec (J.-M. Valin, K. Vos, T. Terriberry), September 2012 — the normative Opus definition, including the SILK speech path that hosts Opus's VAD. Updated by RFC 8251 (2017). https://www.rfc-editor.org/rfc/rfc6716
- IETF RFC 7587, RTP Payload Format for the Opus Speech and Audio Codec, June 2015 — §3.1.3 (DTX: whole-frame dropping, timestamp multiple of 120, DTX-vs-loss detection, "Using Comfort Noise as defined in [RFC3389] with Opus is discouraged") and §6.1 (
usedtxdefault 0). The controlling source for how Opus DTX is signalled over RTP. https://www.rfc-editor.org/rfc/rfc7587 - IETF RFC 3389, RTP Payload for Comfort Noise (CN) (R. Zopf), September 2002 — the SID payload format: one level byte in dBov plus optional reflection-coefficient spectral bytes; static payload type 13; the VAD/DTX/CNG division of labour (§5). Standards-track. https://www.rfc-editor.org/rfc/rfc3389
- ITU-T Recommendation G.729 Annex B (1996), A silence compression scheme for G.729 optimized for terminals conforming to ITU-T V.70, the reference classical VAD/DTX/CNG: four-feature VAD, SID frames, noise-dependent hangover. https://www.itu.int/rec/T-REC-G.729
- ITU-T Recommendation G.711 Appendix II (February 2000), A comfort noise payload definition for ITU-T G.711 use in packet-based multimedia communication systems — the comfort-noise model RFC 3389 is based on. https://www.itu.int/rec/T-REC-G.711
- libwebrtc,
common_audio/vad— the open-source WebRTC VAD: six frequency bands, Gaussian mixture model, frame sizes 10/20/30 ms, aggressiveness modes 0–3. Reference implementation shipping in Chrome and Edge. https://webrtc.googlesource.com/src/+/refs/heads/main/common_audio/vad/ - Silero VAD, version 5 release notes (2024) — a compact neural VAD (~309K parameters, ~1–2 MB), 512-sample (32 ms at 16 kHz) streaming chunks, sub-millisecond per-chunk inference on CPU, multilingual. https://github.com/snakers4/silero-vad
- GetStream WebRTC engineering notes, Opus Discontinuous Transmission (DTX) (2024) — deployer description of Opus's ~400 ms SID interval and the SFU large-call use case. Used for the production "what actually ships" framing; where it touches a spec fact, the spec (RFC 7587) governs. https://getstream.io/resources/projects/webrtc/advanced/dtx/


