Voice activity detection decides, frame by frame, whether the audio contains speech or only silence and noise. That one bit drives a surprising amount of a real-time stack: it triggers discontinuous transmission to stop sending packets during silence, gates noise suppression, marks segments for transcription, and powers 'who is speaking' indicators in a conference UI. Detectors range from simple energy thresholds to neural classifiers; the engineering tension is between false negatives, which clip the start of words, and false positives, which keep transmitting noise and waste the bandwidth VAD was meant to save. Good VAD is fast, slightly biased toward keeping speech, and tuned to the noise it will face.

