Bitrate and bandwidth control in real-time audio

Why This Matters

Every real-time product — a video call, a webinar, a telehealth visit, a live-shopping stream — runs over networks the product team does not control. The link between two people is shared, congested, and changing, and the software has only a few hundred milliseconds to react before the call sounds broken. If you build or buy this kind of product, you need to understand who decides the audio bitrate, what signals they use, and which two knobs (minimum and maximum bitrate) actually change the experience. This article is written so a product manager or founder can follow the whole loop and ask an engineer the right questions.

The Problem: Capacity You Cannot See

Imagine a single-lane road between two towns. Some moments it is empty; other moments a truck pulls out and everything slows. You cannot phone ahead to ask how busy the road is right now — you can only watch how long your own cars take to arrive and infer the traffic from that.

A network path works the same way. The amount of data it can carry per second, called the available bandwidth, is never published anywhere. It shifts as other people on the same Wi-Fi start a download, as a phone moves between cell towers, as a home router fills its queue. The application has to estimate the capacity from indirect clues, then choose a sending rate that fits underneath it. Sending too much causes a queue to build, which adds delay and then drops packets; sending too little wastes quality you could have had.

This continuous guess-and-adjust loop is called congestion control, and the number it produces — the best estimate of how many bits per second the path can carry right now — is called the bandwidth estimate.

Block diagram of the real-time audio congestion-control loop. The sender's encoder feeds packets through a bitrate allocator into the network; the receiver measures arrival timing and packet loss; feedback travels back over RTCP; the bandwidth estimator turns that feedback into a target bitrate that is split between video and audio, with audio receiving a small protected floor. Figure 1. The closed loop: send, measure, feed back, re-estimate, re-allocate. The audio encoder sits at the far left and receives whatever slice the allocator leaves it.

Two Clues the Network Leaves Behind

Congestion control reads two signals, and good systems read both.

The first clue is delay. The departure gap between two packets at the sender should match the arrival gap at the receiver. When a queue starts to form somewhere on the path, the second packet waits a little longer, so the arrival gap stretches. That stretching — the inter-arrival delay growing over time — is the earliest warning that the path is filling up, well before anything is lost. A delay-based estimator watches this trend and eases the bitrate down before packets start to drop.

The second clue is loss. When a queue overflows, packets are discarded. Loss is a blunt, late signal — by the time you see it, the damage is done — but it is unambiguous. A loss-based estimator reacts to the fraction of packets that went missing: above roughly 10% loss it cuts the rate hard, below roughly 2% it carefully probes upward, and in between it holds steady.

The system uses the lower of the two estimates. The delay signal usually moves first and keeps the path from ever filling; the loss signal is the safety net for paths where delay measurement is unreliable.

Google Congestion Control: The Algorithm in the Middle

The algorithm that combines these two clues in nearly every browser and WebRTC stack is Google Congestion Control, usually shortened to GCC. It is described in the IETF Internet-Draft draft-ietf-rmcat-gcc-02 (Holmer, Lundin, Carlucci, De Cicco, Mascolo, July 2016). Note that this is a draft, not a finished standard — it expired without becoming an RFC, yet it remains the de-facto reference because the shipping implementation in libwebrtc follows its structure. The code has evolved past the draft, but the two-part design — a delay-based controller and a loss-based controller, taking the minimum — is intact.

GCC runs as a fast inner loop. Several times a second it ingests the latest timing and loss feedback, updates both estimates, and emits one number: the target send rate for all media on that connection. A separate stage then divides that number between the video stream and the audio stream. That division is where audio's special treatment lives.

REMB, transport-cc, and Who Does the Math

GCC needs feedback from the receiver to work, and there are two generations of how that feedback is carried. The difference matters because it changed where the estimate is computed.

REMB — Receiver Estimated Maximum Bitrate. This is the older design, proposed in draft-alvestrand-rmcat-remb-03. The receiver does the delay math itself and then sends back one RTCP message that says, in effect, "the total I can take on this path is N bits per second." The sender trusts that number and caps its output to it. REMB relies on an RTP header extension called absolute send time so the receiver knows when each packet actually left.

transport-cc — Transport-Wide Congestion Control. This is the newer design, proposed in draft-holmer-rmcat-transport-wide-cc-extensions-01. Here the roles flip. The sender stamps every outgoing packet — audio and video alike — with a single transport-wide sequence number. The receiver does almost no thinking: it simply reports back, for each sequence number, when the packet arrived. All the estimation happens at the sender. This "keep the receiver dumb" approach is now the default in modern WebRTC because the sender has the full picture and can react faster.

The IETF later standardized a clean feedback format for the same idea in RFC 8888 (RTP Control Protocol (RTCP) Feedback for Congestion Control, January 2021), which registers the congestion-control feedback message and is the spec-track successor to the transport-cc draft. RFC 8888 also flags a real trade-off: because the feedback is transport-wide and lumps all flows together, it is harder to tell which stream — audio or video — actually lost packets, so per-stream repair decisions need extra care.

Side-by-side comparison of REMB and transport-cc. On the left, REMB: the receiver computes the bandwidth estimate and sends back a single maximum-bitrate number. On the right, transport-cc: the sender numbers every packet, the receiver reports raw arrival times, and the sender computes the estimate. A label notes that transport-cc is the modern default because all logic lives at the sender. Figure 2. The shift from receiver-side estimation (REMB) to sender-side estimation (transport-cc). Same goal, opposite location for the math.

Why Audio Adapts Later Than Video

Here is the question every engineer eventually asks: a call gets bad, the video goes to mush, but the voice holds — why?

The answer is bitrate allocation priority. When the allocator splits the GCC target between streams, audio is served first and protected by a floor. A voice call needs very little to stay intelligible: the Opus codec that carries almost all WebRTC audio runs from a low floor up to high fidelity, and a clear voice call lives comfortably in the 16–40 kbps range. Video, by contrast, is hungry — it can usefully absorb hundreds or thousands of kbps and degrades gracefully as you starve it.

So the allocator's logic is simple and deliberate: give audio its small protected slice, then hand whatever remains to video. When the total estimate falls, video is squeezed across a wide range while audio stays untouched until the estimate drops near audio's own floor. Only on a truly broken link — when the whole path cannot carry even tens of kbps — does audio finally start to drop its rate, switch to its most aggressive low-bitrate mode, and lean on packet-loss concealment to fill gaps.

This ordering is the right call. People tolerate frozen or blocky video far better than choppy, dropping speech. A conversation survives on audio; it dies without it.

There is a second reason audio feels slower to react: audio carries fewer packets per second than video, so the feedback loop has fewer data points to work with and the estimator updates audio's situation less often. The protection is mostly deliberate design, but the sparser feedback reinforces it.

How Opus Actually Changes Its Bitrate

The bandwidth estimate is only useful if the codec can act on it instantly, and Opus is built for exactly this. Per RFC 6716 (the Opus specification, September 2012, updated by RFC 8251), Opus supports every bitrate from 6 kbps to 510 kbps and can change rate on a per-frame basis — every 20 milliseconds — without any glitch, gap, or need to renegotiate the session. When the allocator hands Opus a new target, the next frame simply comes out at the new rate.

Opus defaults to variable bitrate (VBR), where each frame uses as many bits as the sound in it needs — silence and simple tones use few, complex speech and music use more — for the best quality at a given average. It also supports constant bitrate (CBR), where every frame is the same size. CBR is rarely the right choice for calls; RFC 6716 notes its two real uses are transports that demand fixed-size frames and certain encryption scenarios. For real-time audio you almost always want VBR and let the allocator move the target.

A worked example. Suppose GCC's total estimate drops to 250 kbps on a degrading link. The allocator protects audio with, say, a 24 kbps floor and hands the rest to video:

total estimate        = 250 kbps
audio floor (Opus VBR) =  24 kbps   ← protected, served first
video gets             = 250 − 24 = 226 kbps

If the link collapses further to 40 kbps, audio still gets its 24 kbps and video is left with 16 kbps — barely a slideshow, but the voice is intact. Push the estimate below ~24 kbps and only then does Opus begin to drop, eventually narrowing its audio bandwidth and bitrate to keep something flowing.

Setting Minimum and Maximum Bitrate Sensibly

You get two knobs that genuinely shape the experience, and most teams set them wrong by leaving them at defaults.

The maximum bitrate caps how good audio can get when the network is generous. For a voice-only product, capping Opus around 32–40 kbps spends nothing on inaudible quality and frees bandwidth for video or for more participants. For a music or high-fidelity product, raise it toward 64–128 kbps so a good link is actually used.

The minimum bitrate sets the floor the allocator will protect — the rate below which audio refuses to go. Set it too high and a weak link that could have carried a thin voice stream instead drops the call; set it too low and audio degrades sooner than it needed to. A floor around 16 kbps keeps speech intelligible on bad links; do not push voice below ~12 kbps unless you have tested how it sounds.

Use case	Suggested Opus min	Suggested Opus max	Why
Voice call / telehealth	16 kbps	32–40 kbps	Intelligible voice; leave room for video and resilience
Webinar / one-to-many voice	12–16 kbps	32 kbps	Many listeners; protect the floor, cap the ceiling
Music / high-fidelity	24 kbps	64–128 kbps	A good link should sound excellent
Conference with screen share	16 kbps	40 kbps	Audio floor protected; bandwidth steered to the shared screen

Set both as part of session setup; the floor is what the allocator guards when the network gets tight.

A Common Pitfall: Adding Redundancy Into Congestion

The single most expensive mistake in this area is fighting loss with more data when the loss was caused by too much data. If packets are dropping because the path is congested, turning up forward error correction or redundancy sends more bits into an already-full pipe and makes the congestion — and the loss — worse. Bandwidth control must come first: bring the rate under the estimate, let the queue drain, and only then decide whether the remaining loss is random (worth protecting against) or congestion-driven (which protection cannot fix). The order is always estimate first, protect second.

Where Fora Soft Fits In

We build real-time audio and video systems — video conferencing, telemedicine, e-learning, live streaming, and surveillance — where the call has to survive whatever network the user happens to be on. Tuning the audio bitrate floor and ceiling, choosing between REMB and transport-cc on a given stack, and getting the allocator to protect voice before video are the unglamorous decisions that separate a product that holds a conversation on hotel Wi-Fi from one that drops it. We have shipped these systems across conferencing, OTT, and telehealth since 2005, on browsers, native apps, and SFUs alike.

Call to action

Talk to a audio engineer — book a 30-minute scoping call to talk through your real-time audio bitrate control plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Real-time audio bitrate control cheat sheet — One-page reference: the two clues (delay, loss), REMB vs transport-cc vs RFC 8888, why audio survives while video breaks first, the Opus numbers (6-510 kbps, per-20ms rate change, VBR default), the worked 250 kbps split, and the min/max….

References

Holmer, Lundin, Carlucci, De Cicco, Mascolo. A Google Congestion Control Algorithm for Real-Time Communication. IETF Internet-Draft draft-ietf-rmcat-gcc-02, July 2016. https://datatracker.ietf.org/doc/html/draft-ietf-rmcat-gcc-02 — primary source for the delay-based + loss-based two-estimator design; cited as a draft because it expired without becoming an RFC yet remains the libwebrtc reference.
Alvestrand. RTCP message for Receiver Estimated Maximum Bitrate. IETF Internet-Draft draft-alvestrand-rmcat-remb-03. https://datatracker.ietf.org/doc/html/draft-alvestrand-rmcat-remb-03 — definition of REMB and the absolute-send-time dependency.
Holmer et al. RTP Extensions for Transport-wide Congestion Control. IETF Internet-Draft draft-holmer-rmcat-transport-wide-cc-extensions-01. https://datatracker.ietf.org/doc/html/draft-holmer-rmcat-transport-wide-cc-extensions-01 — transport-wide sequence numbers and sender-side estimation.
Sarker, Perkins, Singh, Ramalho. RTP Control Protocol (RTCP) Feedback for Congestion Control. IETF RFC 8888, January 2021. https://www.rfc-editor.org/rfc/rfc8888.html — spec-track congestion-control feedback message (CCFB, FMT value 11); notes the per-stream repair trade-off of transport-wide feedback.
Valin, Vos, Terriberry. Definition of the Opus Audio Codec. IETF RFC 6716, September 2012 (updated by RFC 8251). https://www.rfc-editor.org/rfc/rfc6716.html — Opus 6–510 kbps range, per-frame rate change, VBR default and CBR use cases.
Valin, Maxwell, Terriberry, Vos. Updates to the Opus Audio Codec. IETF RFC 8251, October 2017. https://www.rfc-editor.org/rfc/rfc8251.html — corrections that update RFC 6716; both are cited together for any 2026 Opus claim.
Fora Soft. Bandwidth Estimation and Congestion Control in WebRTC. https://www.forasoft.com/learn/video-streaming/articles-streaming/webrtc-bandwidth-estimation — companion article covering the general (video-side) bandwidth-estimation loop; this article covers the audio half.
Mozilla / WebRTC project. libwebrtc bitrate allocation and Opus integration notes. https://webrtc.googlesource.com/src/ — reference implementation showing the audio-floor-first allocation behaviour described in the "audio adapts later" section.

Bitrate and bandwidth control in real-time audio

Why This Matters

The Problem: Capacity You Cannot See

Two Clues the Network Leaves Behind

Google Congestion Control: The Algorithm in the Middle

REMB, transport-cc, and Who Does the Math

Why Audio Adapts Later Than Video

How Opus Actually Changes Its Bitrate

Setting Minimum and Maximum Bitrate Sensibly

A Common Pitfall: Adding Redundancy Into Congestion

Where Fora Soft Fits In

What to Read Next

Call to action

References

Related glossary terms

Bitrate and bandwidth control in real-time audio

Why This Matters

The Problem: Capacity You Cannot See

Two Clues the Network Leaves Behind

Google Congestion Control: The Algorithm in the Middle

REMB, transport-cc, and Who Does the Math

Why Audio Adapts Later Than Video

How Opus Actually Changes Its Bitrate

Setting Minimum and Maximum Bitrate Sensibly

A Common Pitfall: Adding Redundancy Into Congestion

Where Fora Soft Fits In

What to Read Next

Call to action

References

Related glossary terms

Bitrate

Opus

REMB

Transport-CC

RTCP

RTP

WebRTC audio

Audio codec