Why this matters
Audio failures are the support tickets that make users abandon a product, because a video glitch is annoying but a call you cannot hear is unusable. If you build or operate a video conferencing tool, a telemedicine platform, an e-learning classroom, or a live-shopping app, your on-call engineer needs a calm, ordered procedure to run when the audio breaks at 2 a.m. — not a guess. This runbook is written for the engineer holding the pager and the product manager who needs to understand what that engineer is doing. It turns a vague "no sound" report into a five-step diagnosis with a named cause and a known fix, so the incident ends in minutes instead of an afternoon of trial and error.
How to use this runbook
A production audio failure feels chaotic because the symptom — "I can't hear them" — is the same for a dozen different root causes. The cure for chaos is a fixed order. This runbook lists five checks, and the order is deliberate: the earliest checks catch the most common and cheapest-to-fix problems, so most incidents resolve before you reach step three. Run the checks top to bottom. At each one, read a specific number or flag, compare it to the expected value, and either fix the problem or move to the next check. Do not skip ahead, and do not start with packet captures — that is step five for a reason.
Before anything else, capture two facts from the reporting user, because they cut the search space in half:
First, is the failure one-way or both-way? "I can hear them but they can't hear me" is a completely different problem from "neither of us hears anything." A one-way failure points at one specific endpoint's capture or the path in one direction; a two-way failure points at a shared cause like a server or a codec mismatch.
Second, is it total silence, distorted audio, or echo? Total silence means no signal is arriving or being rendered. Distortion (robotic, choppy, underwater) means the signal arrives but is damaged in transit — a network problem. Echo means the signal arrives fine but a processing stage failed. These three symptoms route to different steps, as the decision tree below shows.
Figure 1. The runbook as a decision tree. Classify the symptom first, then run the five checks in order. The symptom tells you which check is most likely to fire, but you still run them in sequence from the top.
The rest of this article is those five checks, each with the symptom it explains, the number to read, the tool that reads it, and the fix.
The browser already knows: a one-minute primer on getStats
Four of the five checks read a number from the same place, so it is worth understanding that place before we use it. Every live WebRTC connection — the technology browsers use to carry real-time audio and video — keeps a running scorecard of itself. You read that scorecard by calling a function named getStats() on the connection, which is defined in the W3C "Identifiers for WebRTC's Statistics API" specification. It returns a report containing dozens of named numbers, and a handful of them tell you everything about whether audio is flowing.
Think of getStats() as the dashboard of a car. You do not need to open the engine; the dashboard already shows you the fuel level, the speed, and the warning lights. The audio failure is almost always visible on the dashboard if you know which three gauges to read.
Here is the minimal code that reads the dashboard for the audio that is arriving at a receiver. It works in any modern browser's developer console while a call is connected.
// Read the inbound audio stats for a live RTCPeerConnection named `pc`.
const report = await pc.getStats();
report.forEach((stat) => {
if (stat.type === "inbound-rtp" && stat.kind === "audio") {
console.log("packets received:", stat.packetsReceived);
console.log("packets lost:", stat.packetsLost);
console.log("jitter (s):", stat.jitter);
console.log("jitter buffer delay (s):", stat.jitterBufferDelay);
console.log("concealed samples:", stat.concealedSamples);
}
});
The five numbers this prints map almost one-to-one to the failure classes in this runbook. packetsReceived climbing means audio is arriving at all; packetsLost and jitter rising mean a network problem; concealedSamples rising fast means the receiver is inventing audio to cover gaps, which is what distortion sounds like. The W3C specification defines a concealed sample precisely: it "is a sample that was lost or arrived too late to be played out, and therefore had to be replaced with a locally generated synthesized sample." We will use each of these numbers in the checks below. The companion article The WebRTC Audio Pipeline End-to-End explains the full chain these numbers describe.
A note on reading accumulators. Most
getStats()numbers are totals counted since the call began —concealedSamples,packetsLost,totalAudioEnergy. A single reading tells you little; the rate of change between two readings a second apart tells you everything. Always sample twice and subtract. ApacketsLostof 4,000 sounds alarming until you learn it accumulated over a two-hour call with no audible effect; the same 4,000 in ten seconds is a dead call.
Check 1 — Device, permissions, and routing (the most common cause)
Symptom it explains: total silence, usually one-way, often "it worked yesterday."
The number to read: audioLevel on the sender's media source, and the microphone permission state.
The tool: navigator.permissions, getUserMedia constraints, getStats() media-source report.
More audio tickets are caused by the operating system, the browser, and the hardware than by your code. The microphone is muted in the OS, the browser permission was denied, the user picked the wrong input device, or audio is routing to a Bluetooth headset that is connected but not selected. None of these are bugs in your application, and all of them masquerade as "your app is broken."
Start at the source: is the microphone actually producing signal? The sender's own statistics answer this. The W3C statistics API exposes an audioLevel on the audio media source — "a number between 0 and 1 (linear), where 1.0 represents 0 dBov, 0 represents silence." If the user is speaking and audioLevel sits at or near 0, no sound is entering the pipeline, and nothing downstream can fix that. You are looking at a muted or wrong-device problem, not a transport problem.
// Is the microphone producing any signal? Read the sender-side media source.
const report = await pc.getStats();
report.forEach((stat) => {
if (stat.type === "media-source" && stat.kind === "audio") {
console.log("audio level (0-1):", stat.audioLevel);
console.log("total audio energy:", stat.totalAudioEnergy);
}
});
If audioLevel is flat at zero, walk the source chain outward in this order, because each is more common than the next: the microphone is muted at the operating system level (the user's physical mute switch or the OS sound settings); the browser tab is muted or the site's microphone permission was revoked; the wrong input device is selected (the laptop's built-in mic instead of the headset); or the selected device was unplugged mid-call and the browser did not fail over. The permission state is one line to read:
// Check whether the page is actually allowed to use the microphone.
const status = await navigator.permissions.query({ name: "microphone" });
console.log("microphone permission:", status.state); // "granted" | "denied" | "prompt"
The fixes are operational, not code: prompt the user to check the OS mute, re-grant permission, or pick the right device from your in-app device selector. The lasting fix is product design — every serious call app shows a live input-level meter in its pre-call screen, driven by exactly the audioLevel number above, so the user sees a dead microphone before the call instead of during it. The constraints that request the microphone, and the echoCancellation, autoGainControl, and noiseSuppression flags, are defined in the W3C "Media Capture and Streams" specification; getting the device and its constraints right at getUserMedia() time prevents most of these tickets.
Common mistake: blaming the network for a muted mic. The instinct on a "no audio" report is to check the connection. But if the sender's
audioLevelis zero, the network is irrelevant — there is nothing to send. Reading the source level first saves you from a fruitless packet capture. This is why device is Check 1 and capture is Check 5.
Check 2 — The network path: loss, jitter, and the buffer
Symptom it explains: choppy, robotic, "underwater," or intermittent audio — the signal is there but damaged.
The number to read: packetsLost rate, jitter, jitterBufferDelay, and concealedSamples rate.
The tool: getStats() inbound-rtp report, sampled twice.
If audio is arriving but sounds wrong, the network is the prime suspect, and the receiver's inbound statistics name the problem precisely. Three numbers work together here, and reading them as a group is the whole skill.
Packet loss is the fraction of audio packets that never arrived. Read packetsLost and packetsReceived twice, a few seconds apart, and compute the loss rate over that window:
loss rate = (packetsLost_now - packetsLost_before)
/ (packetsReceived_now - packetsReceived_before
+ packetsLost_now - packetsLost_before)
Plug in real numbers. Suppose between two readings ten seconds apart you see 120 newly lost packets and 4,880 newly received: loss rate = 120 / (4,880 + 120) = 120 / 5,000 = 0.024, or 2.4%. For voice, anything under about 1% is inaudible thanks to packet loss concealment; 2–5% is noticeable as occasional clicks; above 5% the call degrades sharply unless redundancy is in play. The repair tools for loss live in two companion articles: Packet Loss Concealment (PLC) hides the missing frames after the fact, and Forward Error Correction (FEC) and RED Redundancy prevents the gap by sending the audio twice.
Jitter is the variation in packet arrival timing — packets that should arrive evenly every 20 milliseconds instead bunch up and straggle. The jitter field reports this, and its calculation is defined in IETF RFC 3550, Appendix A.8, as a running average of the difference in transit time between consecutive packets. One subtlety trips up everyone: jitter is reported in seconds in getStats(), but RTP measures time internally in clock-rate units, and for Opus — the codec nearly every WebRTC call uses — that clock runs at 48,000 ticks per second, fixed by IETF RFC 7587. A jitter of 0.03 in the stats means 30 milliseconds of timing spread, which the jitter buffer must absorb.
The jitter buffer is the receiver's shock absorber. It holds arriving packets briefly so it can play them out on a smooth schedule despite their uneven arrival — like an airport waiting area that holds passengers who arrive on irregular flights so the bus can leave on a fixed timetable. The jitterBufferDelay field, divided by jitterBufferEmittedCount, gives the average time audio spent waiting. A larger buffer hides more jitter but adds delay; the buffer trades latency for smoothness automatically. When loss or jitter overwhelm it, the buffer gives up and the receiver synthesizes replacement audio, which it counts in concealedSamples. A concealedSamples count climbing fast — more than a percent or two of totalSamplesReceived over a window — is the numeric signature of audible distortion. The full machinery is the subject of Jitter Buffer: NetEQ, the Brain of WebRTC Audio.
The fix depends on which number is high. High loss with low jitter is a lossy link — enable FEC or RED. High jitter is an unstable link — the buffer will grow to compensate, at the cost of latency, and there is little the application can do beyond ensuring the buffer is allowed to adapt. Both high at once is a saturated or congested path, which points at the bitrate-control story in Bitrate and Bandwidth Control in Real-Time Audio.
Figure 2. The network check, by the numbers. Loss happens in the cloud; jitter is uneven arrival; the buffer absorbs jitter at the cost of delay; concealed samples are the receiver inventing audio when the buffer runs dry. Each is a named getStats field.
Check 3 — Echo cancellation convergence
Symptom it explains: echo — one party hears their own voice come back a moment later.
The number to read: the echoCancellation setting on the capture track; whether headphones are in use.
The tool: MediaStreamTrack.getSettings(), plus a quick environment question.
Echo is not a network problem and not a device-silence problem; it is a processing problem, and it sits in its own check because its symptom is unmistakable. Echo happens when one participant's speaker plays the other participant's voice, that sound travels back into the first participant's microphone, and it gets sent back — so the second participant hears themselves a beat late. The job of acoustic echo cancellation, AEC, is to recognize the speaker output inside the microphone input and subtract it. The modern WebRTC implementation, AEC3, does this well, but it has a failure mode worth knowing.
First, confirm AEC is even switched on. The capture track carries its current settings, and the W3C Media Capture and Streams specification defines echoCancellation as a boolean you can read back:
// Confirm echo cancellation is actually enabled on the live capture track.
const [track] = localStream.getAudioTracks();
const settings = track.getSettings();
console.log("echoCancellation:", settings.echoCancellation); // expect true
If it reads false, something in your capture code requested raw audio — a common mistake when a developer copies a getUserMedia snippet meant for music recording, where echo cancellation is deliberately off. Re-request the track with echoCancellation: true.
If AEC is on and echo persists, the cause is usually one of three convergence problems, and the full mechanics are covered in Acoustic Echo Cancellation (AEC): How It Really Works. The canceller needs a moment at the start of a call to "converge" — to learn the acoustic delay between speaker and microphone — and during those first seconds echo can leak through. A long or variable delay defeats it: a Bluetooth speakerphone adds a hundred-plus milliseconds of unpredictable delay between the reference signal and the echo, which is exactly the case classical AEC assumptions break on. And "double-talk" — both people speaking at once — is the hardest moment for any canceller to hold its estimate. The deep version of the Bluetooth and speakerphone problem is Echo Cancellation on Speakerphones, Bluetooth, and AirPods.
The reliable operational fix is the oldest one: headphones eliminate echo at the source, because the speaker output never reaches the microphone. Design for the user who refuses to wear them — keep AEC on, prefer wired or low-latency audio paths — but when an echo ticket arrives, "are you on speakerphone or Bluetooth?" resolves it more often than any code change.
Check 4 — Sample-rate mismatch
Symptom it explains: persistent distortion or pitch shift that the network check could not explain — audio that sounds slightly fast, slow, or "chipmunk."
The number to read: the sample rate at capture, at the codec, and at the audio context.
The tool: AudioContext.sampleRate, track settings, and your server's codec configuration.
This is the subtle one, and it is Check 4 because it is rarer than the first three but invisible to them. Digital audio is a stream of samples taken at a fixed rate — for video and WebRTC, the standard is 48,000 samples per second, 48 kHz. Trouble appears when one stage of the pipeline assumes a different rate than another. If audio is captured at 44.1 kHz but a downstream stage treats it as 48 kHz, every second of sound is replayed in less than a second, and the pitch rises — the "chipmunk" effect. The reverse drops the pitch. A subtler version produces periodic clicks or slow drift as the two clocks slip against each other.
WebRTC standardizes on 48 kHz precisely to avoid this — Opus operates internally at 48 kHz and its RTP payload clock is fixed at 48,000 Hz by RFC 7587 — so in a pure WebRTC call the mismatch is usually introduced at the edges: a Web Audio graph running at the hardware's native rate, a recording branch written at the wrong rate, or a gateway bridging to a PSTN leg that runs at 8 kHz. The check is to read the rate at each boundary and confirm they agree or that an explicit resampler sits between them:
// What sample rate is the Web Audio graph actually running at?
const ctx = new AudioContext();
console.log("AudioContext sample rate:", ctx.sampleRate); // commonly 48000, sometimes 44100
The fix is never to "force" a rate and hope; it is to insert a proper resampler at the boundary where the rate legitimately changes, so 44.1 kHz capture is resampled to 48 kHz before it reaches the 48 kHz codec. The background on why 48 kHz won video, and what resampling costs, is in Sample Rate: 44.1, 48, 96, 192 kHz. A recording or transcription branch is a frequent offender because it is bolted on after the call works; see Recording and Transcription Pipelines.
Common mistake: reading distortion as a network problem. Choppy audio with high
concealedSamplesis the network. Audio that is continuously the wrong pitch or speed, with clean network stats, is a sample-rate mismatch. The tell is that the network check in Step 2 comes back clean — low loss, low jitter — yet the audio is still wrong. When the numbers say the path is healthy but the ears say otherwise, suspect the clock.
Check 5 — Packet capture: the last resort
Symptom it explains: anything the first four checks could not, especially server-side and signalling problems.
The tool: chrome://webrtc-internals (or the browser equivalent), and Wireshark on a PCAP when you need the wire itself.
If a problem survives the first four checks, you have exhausted what the application can see and you need to watch the bytes. There are two depths.
The first, and the one to try before reaching for anything heavier, is the browser's own dump. Chrome exposes chrome://webrtc-internals, a live page that records every getStats() value as a graph for every active connection, and lets you export the whole session as a file. This is the same data the code above reads, but plotted over time and captured automatically — invaluable when a problem is intermittent and you cannot reproduce it on demand. Ask the user to open chrome://webrtc-internals in a second tab before the call, reproduce the failure, and send you the dump. The graph of concealedSamples or packetsLost over the exact second the audio broke usually names the cause without further work.
The second depth is a true packet capture with Wireshark, the standard network protocol analyzer. You reach for it when the question is below the application — whether RTP packets are even leaving the machine, whether they carry the codec you expect, whether the negotiation (SDP) agreed on the payload format both sides assumed. Wireshark decodes RTP and RTCP, can show the per-stream loss and jitter it computes independently of the browser, and on a capture of audio it can even play the RTP stream back so you can hear what actually crossed the wire. This is how you catch the failures that never reach getStats() at all: a firewall dropping media while signalling succeeds (so the call "connects" but is silent), a codec mismatch from a broken SDP negotiation, or one-way media from an asymmetric NAT problem. The architecture that decides where these packets flow — and where to capture them — is covered in Audio in SFU vs MCU vs P2P.
Capture is Check 5 because it is the most expensive in time and skill, and because four times out of five one of the earlier checks has already found the answer. Reaching for Wireshark first is the classic rookie move — it is like dismantling the engine to find out the car was simply in neutral.
The five checks, side by side
The table is the runbook in one view. Print it, pin it next to the on-call dashboard, and work top to bottom.
| # | Check | Symptom | Number / signal to read | Typical fix |
|---|---|---|---|---|
| 1 | Device & permissions | Total silence, one-way | Sender audioLevel ≈ 0; permission denied |
Unmute OS, re-grant, pick right device |
| 2 | Network path | Choppy, robotic, intermittent | packetsLost rate, jitter, concealedSamples rate |
Enable FEC/RED; let buffer adapt |
| 3 | Echo cancellation | Caller hears themselves | echoCancellation = false; on speakerphone? |
Enable AEC; use headphones |
| 4 | Sample-rate mismatch | Wrong pitch/speed, clean network | Rates at capture vs codec vs context | Insert a resampler at the boundary |
| 5 | Packet capture | Anything unexplained | webrtc-internals dump; Wireshark RTP |
Find dropped/mismatched media on the wire |
The order encodes hard-won field experience: roughly seven in ten "broken audio" tickets are resolved at Check 1 or Check 2, because device confusion and bad networks are simply the most common things that go wrong. Each step down the list is rarer and costlier to investigate. Honoring the order is what turns a two-hour debugging session into a two-minute diagnosis.
Figure 3. Why the order matters. Most incidents resolve at the first two checks; each later check is rarer and more expensive. Running them in order is the difference between a two-minute and a two-hour diagnosis.
A worked incident: "the patient can't hear the doctor"
Concrete beats abstract, so here is the runbook applied to a real shape of ticket from a telemedicine call. The report: the doctor can hear the patient, but the patient says the doctor is silent. That single fact — one-way, doctor-to-patient — already tells us the patient's receive path or the doctor's send path is at fault, and that the network in the other direction is fine.
Check 1, on the doctor's side: read the doctor's sender audioLevel while they speak. It reads 0.34 — a healthy level. The microphone is producing signal, so this is not a muted-mic problem on the doctor's end. Move on.
Check 2, on the patient's side: read the patient's inbound-rtp stats for the doctor's audio. packetsReceived is not climbing between two readings. No packets are arriving at all. This is not loss or jitter — those would show packets arriving and being damaged. Zero arrival with a connected call points past the application, to the path itself.
We skip to Check 5, because zero packets arriving despite a "connected" call is the classic signature of media being blocked while signalling succeeded. A webrtc-internals dump from the patient confirms the audio inbound stream exists but shows zero bytes received. A short Wireshark capture on the patient's network shows the doctor's RTP never arrives — a corporate firewall on the patient's hospital network is permitting the signalling channel but dropping the UDP media. The fix is operational (a TURN relay over TCP/443 to traverse the firewall), not a code change to the audio pipeline. Total time to a named cause: under ten minutes, because the one-way classification and the "packets not climbing" reading routed us straight to the right depth.
The lesson is the runbook's whole thesis: the symptom classification plus one or two getStats() readings almost always point at the right check, and the right check names the cause.
Where Fora Soft fits in
We have operated real-time audio in telemedicine, video conferencing, e-learning, and live-event products since 2005, which means we have run this runbook in anger more times than we can count. The pattern holds across verticals: the overwhelming majority of "audio is broken" reports are device, permission, or network problems that the user's environment created, not defects in the media code — and the teams that struggle are the ones without a fixed order to check them in. In the products we ship, the pre-call screen always shows a live microphone meter so a dead device is caught before the call, and the in-call client samples getStats() continuously so support can read the loss and concealment numbers from a session that already ended. Building the observability in from the start is the difference between diagnosing from data and guessing from a screenshot.
What to read next
- The WebRTC Audio Pipeline End-to-End
- Jitter Buffer: NetEQ, the Brain of WebRTC Audio
- Acoustic Echo Cancellation (AEC): How It Really Works
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your webrtc audio troubleshooting plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the WebRTC Audio Troubleshooting — Field Runbook — One page: the five ordered checks (device, network, echo, sample rate, capture), the exact getStats number to read at each, the expected value, and the fix.
References
- W3C — Identifiers for WebRTC's Statistics API (Candidate Recommendation, accessed 2026-06-06). Defines
getStats(),RTCInboundRtpStreamStats(packetsReceived,packetsLost,jitter,jitterBufferDelay,jitterBufferEmittedCount,concealedSamples,totalSamplesReceived) andRTCAudioSourceStats(audioLevel,totalAudioEnergy,totalSamplesDuration). The normative definition of every metric this runbook reads; the "concealed sample" definition is quoted from here. https://www.w3.org/TR/webrtc-stats/ - W3C — Media Capture and Streams (Recommendation, accessed 2026-06-06). Defines
getUserMedia()and theechoCancellation,autoGainControl, andnoiseSuppressionconstrainable properties read back viaMediaStreamTrack.getSettings(). Source for the device/constraint facts in Checks 1 and 3. https://www.w3.org/TR/mediacapture-streams/ - IETF RFC 3550 — RTP: A Transport Protocol for Real-Time Applications, H. Schulzrinne et al., STD 64, July 2003. Appendix A.8 defines the interarrival jitter calculation (
J(i) = J(i−1) + (|D(i−1,i)| − J(i−1))/16) that thejitterfield reports. Read from rfc-editor.org. https://www.rfc-editor.org/rfc/rfc3550.html - IETF RFC 7587 — RTP Payload Format for the Opus Speech and Audio Codec, J. Spittka, K. Vos, JM. Valin, June 2015. Fixes the Opus RTP timestamp clock at 48,000 Hz — the basis for Check 4's "48 kHz is the WebRTC standard rate" and for converting
jitterseconds to clock units. Read from rfc-editor.org. https://www.rfc-editor.org/rfc/rfc7587.html - IETF RFC 6716 — Definition of the Opus Audio Codec, JM. Valin, K. Vos, T. Terriberry, September 2012 (updated by RFC 8251). The codec carried by virtually every WebRTC call; its internal 48 kHz operation and built-in PLC underpin Checks 2 and 4. https://www.rfc-editor.org/rfc/rfc6716.html
- IETF RFC 8825 — Overview: Real-Time Protocols for Browser-Based Applications, H. Alvestrand, January 2021. The WebRTC architecture overview defining the connection whose statistics
getStats()reports. https://www.rfc-editor.org/rfc/rfc8825.html - MDN Web Docs — RTCInboundRtpStreamStats (
jitterBufferDelay,concealedSamples) and RTCAudioSourceStats (audioLevel,totalAudioEnergy), accessed 2026-06-06. Developer-facing companion to the W3C spec; source for theMath.sqrt(totalAudioEnergy/totalSamplesDuration)RMS formula used to detect a silent microphone. Where MDN and the spec differ, the spec (ref 1) governs. https://developer.mozilla.org/en-US/docs/Web/API/RTCInboundRtpStreamStats - The WebRTC Project / Google —
chrome://webrtc-internalsdocumentation and the libwebrtc AEC3 source. Source for the browser stats-dump workflow in Check 5 and the AEC3 convergence behaviour in Check 3. https://webrtc.org/ - Wireshark User's Guide — RTP analysis, RTP stream player, and RTCP statistics, accessed 2026-06-06. The packet-capture tooling of Check 5; the independent loss/jitter computation cited there. https://www.wireshark.org/docs/wsug_html_chunked/
- T. Levent-Levi (BlogGeek.me) — Making sense of getStats in WebRTC, accessed 2026-06-06. Practitioner guide to which
getStats()fields matter in production; treated as a tier-4 vendor/practitioner source, subordinate to the W3C spec (ref 1) where they differ. https://bloggeek.me/getstats/


