Why this matters
If you are planning a virtual classroom, an all-hands meeting tool, a town-hall webinar, a live-shopping stream, or a 911-style emergency bridge, the number you put in the "maximum participants" field on your product spec quietly decides your entire audio architecture and your server bill. Teams routinely design for the demo — twelve people in a grid — and then discover that the 300-person company meeting either melts the server or sounds like a stadium. This article is for the product manager, founder, or engineering lead who needs to understand why audio cost explodes with scale and what changes at each size threshold, so you can size your infrastructure honestly and ask your engineers the right question before launch instead of during the incident. A senior engineer will find every scaling claim traced to the Jitsi Last-N paper, the relevant IETF RFCs, and the published behaviour of Janus, mediasoup, and LiveKit. By the end you will know, for your specific participant target, whether you forward, select, or mix — and what each choice costs.
The number that decides everything: who hears whom
Before any architecture, hold one piece of arithmetic in your head, because every scaling decision in this article follows from it. In a call, each person who is talking produces one stream of audio, and each person listening needs to receive it. If everyone can both talk and hear everyone else, and there are K people in the room, then the total number of voice deliveries the system must handle is K people each receiving the other K−1 voices: K × (K − 1) deliveries.
Let us put numbers on that, because the shape of the growth is the whole story. The formula is the count, the multiplication is on the next line, the result on the third.
- 5 people: 5 × 4 = 20 voice deliveries.
- 50 people: 50 × 49 = 2,450 voice deliveries.
- 500 people: 500 × 499 = 249,500 voice deliveries.
- 5,000 people: 5,000 × 4,999 = 24,995,000 voice deliveries.
Notice what happened. Going from 5 to 50 people — ten times more people — produced not ten times but roughly 122 times more deliveries. This is quadratic growth: the work rises with the square of the number of people, so doubling the room nearly quadruples the load. The Jitsi engineering team, in their 2015 paper on conferencing scale, put the same point plainly: in a system where each endpoint sends media to every other, "the required resources (CPU, network capacity) grow quadratically with the number of endpoints," and the architecture "does not scale well when the number of participants is substantial (≈ 50)." Fifty is the number where the naive approach starts to hurt; that is not a coincidence in this article's title.
A useful mental image. Picture a dinner party. With six guests, everyone can follow everyone — six people, thirty possible one-to-one listening relationships, and it still works. Now imagine a banquet hall of five hundred where everyone tries to talk to everyone at once: it is not a conversation, it is a roar, and no amount of clever wiring fixes the fact that there are a quarter of a million simultaneous listening relationships to service. The fix is not better wiring. The fix is deciding that most of those relationships do not need to exist at any given moment, because in any real meeting only a few people are talking at a time. That single realisation — only forward what someone is actually saying — is the lever that turns an impossible number back into a manageable one.
A note on "active" versus "passive". Throughout this article, an active participant is someone whose microphone is live and could be heard; a passive participant is listening only. A 5,000-person town hall usually has 3 active people and 4,997 passive ones. The active count, not the total count, drives most of the audio cost — and separating the two is the first thing a well-designed large call does.
The protocol family browsers use to carry live audio, WebRTC, does not solve this scaling problem for you. It gives you the pipe; deciding which audio flows through it at each scale is your architecture's job. To understand the three building blocks we combine below — sending direct, forwarding, and mixing — start with the companion article Audio in SFU vs MCU vs P2P, which explains each one in isolation. This article is about what happens when you push them to 50, 500, and 5,000.
Regime one — up to ~50 people: forward everything
The first regime is the small-to-medium group call: a team standup, a classroom, a board meeting. Here the right answer is a Selective Forwarding Unit, an SFU — a server that receives each person's voice and forwards it onward without ever decoding the sound. Each participant uploads one stream and downloads one stream per person they are listening to. The server's job is light because it never decodes audio; its main cost is encryption, which we will return to.
Here is the audio-specific fact that surprises even experienced video engineers: in this regime, all audio is forwarded to everyone, all the time. The famous "Last-N" scaling trick — where a server forwards only the few most relevant video streams and pauses the rest — was designed for video, and the original Jitsi paper is explicit that "the Last N scheme only applies to forwarding video streams and not to audio; audio from all endpoints is always forwarded." The reason is that audio is cheap and latency-sensitive. A voice stream using the Opus codec runs around 24 to 32 thousand bits per second — a fraction of a video stream's size — and if you withhold someone's audio waiting to decide whether they are "relevant," you clip the first word they say. So in a 50-person call, the server forwards every voice to every listener, and that is fine: the streams are small and the client can decode a few dozen of them.
Where does the cost actually go, then? Two places. First, the server's encryption load. WebRTC requires every media packet to be encrypted end-to-server, so the SFU must decrypt each incoming packet and re-encrypt it for each outgoing copy. The Jitsi measurements showed CPU usage rising roughly in step with total bitrate, "mainly contributed by the decrypting and re-encrypting of the RTP packets" — not by the audio itself, which is never decoded. Second, the client's decode and mixing load. Your device must decode every incoming voice and combine them for your speakers, and that work climbs with the number of active talkers.
The fix that scales audio without breaking it
If a 50-person room had 50 people all unmuted and breathing into open microphones, the result would be an unlistenable wash of background noise — fifty rooms' worth of fans, keyboards, and traffic, summed together. Two mechanisms keep that from happening, and both are pure audio engineering.
The first is discontinuous transmission (DTX): a quiet microphone effectively stops sending. Opus DTX drops a silent participant from around 24–32 kbit/s to roughly 1 kbit/s, sending only a tiny "comfort noise" hint every few hundred milliseconds. So in a 50-person call where three people are talking, only those three are sending full-rate audio; the other 47 are nearly silent on the wire. The room's real audio load is set by the active speakers, not the total. (How DTX and comfort noise work is covered in Voice Activity Detection and Discontinuous Transmission.)
The second is active-speaker awareness. Each sender stamps the loudness of every audio packet into a small RTP header field defined by IETF RFC 6464 — a number from 0 to 127 in decibels below full scale — and the server can read it without decoding the sound. Even though it forwards all audio, the server uses these numbers to tell each client who the dominant speaker is, so the UI can highlight them and the client can choose to render only the loudest few. This is the foundation that the next regime builds on.
Figure 1. Why scale breaks audio. Forwarding every voice to everyone (orange) grows with the square of the room size. Forwarding only the few active speakers (green) grows with the number of talkers, which barely moves as the audience grows. The gap between the curves is the entire engineering problem.
Regime two — 50 to ~500 people: forward only the talkers
Past roughly fifty people, two things change. The room almost always becomes asymmetric — a few presenters and a large, mostly-muted audience — and even forwarding only the active speakers' audio to every listener starts to add up, because the number of listeners is now large. A single presenter's voice going to 499 listeners is 499 outbound copies the server must encrypt. With three presenters, that is nearly 1,500 encrypt-and-send operations per audio frame, every 20 milliseconds. The audio is still cheap to route, but the fan-out is now the cost.
The architecture for this regime is still an SFU, but now it forwards only the active speakers' audio, not everyone's. The server ranks participants by their RFC 6464 loudness numbers, identifies the few who are genuinely holding the floor, and forwards only those streams. A 200-person all-hands where one VP is presenting forwards essentially one audio stream to 199 listeners; when someone asks a question, their stream is added to the forwarded set and the VP's may drop out. The decision of who is the active speaker is the heart of this regime, and getting it right is harder than "whoever is loudest right now."
Picking the active speaker without picking wrong
A cough, a dropped phone, a single loud "yeah" of agreement — none of these should steal the floor from the person presenting. The industry-standard solution is the dominant-speaker identification algorithm published by Ilana Volfin and Israel Cohen, which the Jitsi paper adapted to run purely on RFC 6464 audio levels with no audio decoding. Instead of reacting to one loud packet, it scores each participant's speech activity over three time windows at once — an immediate window of a few phonemes, a medium window of a word or two, and a long window of a short sentence — and only declares a speaker switch when all three rise together on the same channel. The designers set explicit behaviour goals: transient noise and one-word interjections must not cause a switch, a switch can only be triggered by the start of a genuine speech burst, and a tolerable transition delay between speakers is "up to one second." The relative loudness of a person's voice must not bias their chance of being chosen — a soft-spoken presenter is detected as reliably as a loud one.
This algorithm runs inside mediasoup's ActiveSpeakerObserver, Jitsi Videobridge's dominant-speaker logic, and the equivalent in every serious SFU. The output is a stable "active speaker" signal that does two jobs at once: it tells the SFU which audio to forward, and it tells every client which tile to highlight.
The arithmetic of the saving
The payoff is the difference between quadratic and linear growth. The Jitsi measurements quantified it for video: in a 30-participant conference, forwarding a fixed number of streams instead of everyone cut outbound bandwidth by 72% to 89% depending on how many were forwarded, turning quadratic growth into linear growth in the number of participants. For audio the same logic applies once you forward only active speakers: the cost scales with the number of talkers (which stays near 1–5 in any real meeting) multiplied by the number of listeners, instead of with the square of the total. Concretely, in a 500-person webinar with 3 active speakers, you forward on the order of 3 × 500 = 1,500 audio deliveries instead of the 500 × 499 = 249,500 that "forward everything" would demand — a 166-fold reduction, achieved entirely by not forwarding silence.
The remaining ceiling in this regime is client-side. Even with only the active speakers forwarded, a single SFU has a finite number of outbound connections it can encrypt and serve, and a single client receiving a handful of audio streams plus the UI plus video has its own limit. Around several hundred to a couple of thousand participants on one server, you run out of one or the other. That ceiling is what defines the third regime.
Figure 2. Regime two: forward only the talkers. A few active speakers are detected from RFC 6464 loudness numbers and fanned out to all listeners; the muted majority cost almost nothing thanks to DTX. The server never decodes the audio — it only reads the loudness stamp and forwards.
Regime three — 500 to 5,000+ people: mix the few, fan out one
At a few thousand participants, the fan-out itself becomes the wall. Sending three speakers' audio to 5,000 listeners is 15,000 encrypted outbound streams per audio frame, and a single client that is also rendering video cannot comfortably receive and decode several separate audio streams while everything else is happening. The escape is to stop sending separate streams and mix the active speakers into one combined stream on the server, then send that single stream to everyone. This is the Multipoint Control Unit approach — an MCU — applied surgically to just the audio, and just the active speakers.
The mechanics matter. The server decodes only the few active speakers' audio (not all 5,000 — the rest are silent and excluded by the active-speaker logic), sums them into one mix, re-encodes it once, and ships that single stream to every listener. A listener now downloads exactly one audio stream regardless of whether the room has 500 people or 50,000. Janus's AudioBridge plugin is a production example of exactly this: it is "an audio MCU," and "all the audio streams are mixed rather than relayed," so "a single PeerConnection will be created no matter how many participants join the room," each receiving "a mix of the other" participants. The cost moves decisively from network fan-out to server CPU for the mix — but because only the active few are decoded and the mix is produced once, that CPU is bounded by the number of talkers, not the number of listeners.
The N-minus-1 rule, and why you never sum all 5,000
Two pieces of audio engineering make server mixing work. First, the mix sent to each speaker must exclude that speaker's own voice — otherwise they hear themselves echoed back a beat late. For N participants, each person receives the sum of the other contributors, the N-minus-1 mix. For pure listeners this is trivial (they contribute nothing, so they get the full mix), and for the handful of active speakers the server produces a slightly different mix per speaker. Second — and this is the rule that makes scale possible — the server mixes only the loudest few channels, never all of them. Summing 5,000 audio signals would pile up 5,000 noise floors into a hiss louder than any single voice, and would cost 5,000 decodes. A scalable conference mixer selects the top few active channels by their audio level and mixes only those, typically three to six voices. The same RFC 6464 loudness numbers that drove forwarding in regime two now drive selection for mixing in regime three.
When even one server is not enough: cascading
A single server has a participant ceiling regardless of architecture. To go beyond it — into the tens of thousands — production systems cascade: they run many media servers that relay to each other, so a participant connected to a server in Frankfurt can hear a speaker connected to a server in São Paulo without either server handling all participants directly. LiveKit's published architecture is a worked example: edge servers form a distributed mesh, coordinate through a shared state layer (Redis), and relay tracks between regions, which is how the platform advertises support for sessions of up to 100,000 participants with sub-100-millisecond latency between them. Cascading does not change the per-room audio logic — you still forward the active speakers or mix the loudest few — it just spreads the fan-out across many machines so no single one hits its ceiling.
The honest endgame: at broadcast scale, it stops being a "call"
There is a threshold past which the right answer is to admit you are no longer running a conversation. A 50,000-viewer product launch with three presenters is a broadcast, not a group call. The standard production pattern is a hybrid: the few presenters and any audience members invited "on stage" run on a real-time SFU/MCU for genuine two-way interaction, while the passive audience receives the mixed program over a streaming protocol such as Low-Latency HLS through a CDN, exactly as they would a live video stream. The interactive core stays small and real-time; the massive passive tier is served by infrastructure built for massive passive tiers. Knowing where to draw that line — how many people genuinely need to talk back — is the most important scaling decision of all, and it is a product decision before it is an engineering one. The streaming half of this is its own discipline; see Audio in HLS, DASH, CMAF and Audio in Low-Latency Streaming.
Figure 3. Choosing by scale. Up to ~50: forward everything. Up to ~500: forward only the active speakers. Beyond that: mix the loudest few and either cascade across servers or, if almost no one needs to talk back, fan the program out over low-latency streaming.
The three regimes, side by side
The three regimes trade the same resources — server CPU, server fan-out bandwidth, client decode load, and latency — in different directions as the room grows. The table is the decision in one view. Read the participant counts as soft thresholds, not hard cutoffs; your real numbers depend on your codec settings, your hardware, and how many people talk at once.
| Property | ~50 people | ~500 people | 5,000+ people |
|---|---|---|---|
| Architecture | SFU, forward all audio | SFU, forward active speakers | MCU mix of loudest few; cascade or hybrid |
| Server decodes audio? | Never | Never | Only the active few |
| What the server sends each listener | Every voice | The active speakers' voices | One mixed stream |
| Dominant cost | Client decode + server encryption | Server fan-out (encrypt per copy) | Server mix CPU (bounded by talkers) |
| Streams a listener downloads | One per active talker | One per active talker | Exactly one (the mix) |
| Added latency | Lowest (forward only) | Low (forward only) | Higher (decode + mix + re-encode) |
| Per-speaker control (mute, levels, spatial) | Full | Full | Lost once mixed |
| Single-file recording | Needs separate mix | Needs separate mix | The mix already exists |
| Real-world example | Jitsi, mediasoup, LiveKit room | Webinar SFU with active-speaker mode | Janus AudioBridge; LiveKit cascade; SFU+LL-HLS hybrid |
The pattern reads cleanly once you see it. As the room grows, the cost migrates from the client (decoding many streams) to the server's network (fanning out copies) to the server's CPU (mixing). Each migration buys you another order of magnitude of scale, and each one costs you something — usually per-speaker control and a little latency. The skill is changing regimes at the right moment, not too early (wasting the simplicity of forwarding) and not too late (after the incident).
The common mistake: sizing for the average, paying for the peak
Here is the pitfall that catches the most teams, and it is specific to audio at scale. A meeting's audio cost is not driven by the total participant count or even the average talker count — it is driven by the worst-case simultaneous-talker count, and that number spikes at exactly the wrong moments. A 500-person all-hands sits at one or two active speakers for 55 minutes, then the CEO says "any questions?" and forty people unmute at once. If your active-speaker logic forwards or mixes everyone who is momentarily loud, that one moment can demand twenty times the audio resources of the steady state, and that is the moment your call degrades in front of your most important audience.
The fix is a hard cap on simultaneously forwarded or mixed speakers — typically three to six — enforced by the dominant-speaker ranking, plus a UI affordance that makes the cap legible (a "raise hand" queue, or a moderator-controlled stage). The Volfin–Cohen rule that "when simultaneous speech occurs on more than one channel, the dominant speaker is the one who began speaking first" is exactly what keeps a forty-person unmute from turning into forty forwarded streams: the system picks the few who started first and holds the rest. Design the cap deliberately; do not let it be an accident of whatever your SFU defaults to.
A second, quieter pitfall: DTX gaps confuse naive active-speaker logic. Because a quiet microphone under Opus DTX stops sending packets, a level reader that expects a packet every 20 milliseconds can misread the silence as "this speaker left" and flicker the active-speaker UI. If your speaker highlight stutters when people pause between sentences, DTX timing — not the algorithm — is the first place to look. This is documented behaviour in real SFUs, and the fix lives in how the level observer handles missing packets, not in the loudness numbers themselves.
Where Fora Soft fits in
We have built real-time audio into video conferencing, e-learning, telemedicine, and live-event products since 2005, and the "how many participants?" question is one we pin down before writing a line of media code, because it decides the architecture more than any other requirement. A 30-student virtual classroom is a straightforward SFU forwarding the teacher plus whoever unmutes; a 400-person company all-hands is an SFU in active-speaker mode with a capped speaker count and a moderated question queue; a 10,000-attendee product launch is a small interactive stage feeding a mixed program out over low-latency streaming to the passive audience. We size for the peak simultaneous-talker moment, not the comfortable average, because we have seen what the "any questions?" spike does to a system designed for the steady state. When a product needs both genuine large-scale interactivity and clean recordings, we plan the mixing CPU and the egress path up front rather than discovering them under load.
What to read next
- Audio in SFU vs MCU vs P2P
- Voice Activity Detection (VAD) and Discontinuous Transmission (DTX)
- Audio in Low-Latency Streaming (LL-HLS, LL-DASH, CMAF-LL)
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your group video call scalability plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the Group Call Scale — Field Guide — One page: the three scaling regimes (50 / 500 / 5,000), the quadratic-vs-linear arithmetic, the active-speaker and N-1 mixing rules, and the speaker-cap pitfall.
References
- B. Grozev, L. Marinov, V. Singh, E. Ivov, Last N: Relevance-Based Selectivity for Forwarding Video in Multimedia Conferences, NOSSDAV 2015 (ACM), DOI 10.1145/2736084.2736094. The primary source for the quadratic-vs-linear scaling result, the "audio is always forwarded, Last-N applies to video only" rule, and the measured CPU/bandwidth gains (e.g., outbound-bitrate gains of 72–89% at K=30; SFU CPU dominated by SRTP encrypt/decrypt). Read in full from jitsi.org. https://jitsi.org/wp-content/uploads/2016/12/nossdav2015lastn.pdf
- IETF RFC 6464, A Real-time Transport Protocol (RTP) Header Extension for Client-to-Mixer Audio Level Indication, J. Lennox, E. Ivov, E. Marocco, December 2011. The 0–127 -dBov audio-level header the server reads without decoding; the basis for both active-speaker forwarding (regime 2) and loudest-channel selection for mixing (regime 3). Read in full from rfc-editor.org. https://www.rfc-editor.org/rfc/rfc6464.html
- IETF RFC 7667, RTP Topologies, M. Westerlund, S. Wenger, November 2015. The normative catalogue of RTP topologies; defines the Selective Forwarding Middlebox (the SFU) and the Mixer (the MCU) and their differences. Obsoletes RFC 5117 and the draft cited in the Last-N paper. https://www.rfc-editor.org/rfc/rfc7667.html
- IETF RFC 3550, RTP: A Transport Protocol for Real-Time Applications, H. Schulzrinne et al., STD 64, July 2003. The base RTP/RTCP framework; defines the mixer and translator middlebox roles the MCU and SFU implement, and the CSRC list a mixer uses to indicate contributing sources. https://www.rfc-editor.org/rfc/rfc3550.html
- I. Volfin and I. Cohen, Dominant Speaker Identification for Multipoint Videoconferencing, Computer Speech & Language 27(4):895–910, 2013. The three-window (immediate/medium/long) dominant-speaker algorithm, its design goals (no false switch on transient noise; ≤1 s transition; loudness-independent), used by production SFUs on RFC 6464 levels. https://israelcohen.com/wp-content/uploads/2018/05/IEEEI2012_Volfin.pdf
- IETF RFC 6716, Definition of the Opus Audio Codec, JM. Valin, K. Vos, T. Terriberry, September 2012. The codec carried in all three regimes; the basis for the ~24–32 kbit/s active-voice and ~1 kbit/s DTX figures (per the codec's full 6–510 kbit/s range). https://www.rfc-editor.org/rfc/rfc6716.html
- Janus WebRTC Server — AudioBridge plugin documentation, Meetecho. Production MCU example: decodes and mixes Opus, one PeerConnection per participant regardless of room size, mixer thread shared across participants, per-participant encode thread. https://janus.conf.meetecho.com/docs/audiobridge
- Meetecho engineering blog, Improving the AudioBridge plugin. Detail on the decode → buffer → shared mixer thread → per-participant encode pipeline and its CPU characteristics. https://www.meetecho.com/blog/improving-the-audiobridge-plugin/
- mediasoup documentation, ActiveSpeakerObserver and AudioLevelObserver (v3 API), versatica/mediasoup. Dominant-speaker detection from RFC 6464 levels with no server-side audio decoding. https://mediasoup.org/documentation/v3/mediasoup/api/
- LiveKit engineering blog, How we built a globally distributed mesh network to scale WebRTC, and LiveKit SFU internals. Cascaded SFU mesh, Redis coordination, cross-region track relay, and the published 100,000-participant / sub-100 ms scaling target. https://livekit.com/blog/scaling-webrtc-with-distributed-mesh
- T. Levent-Levi (BlogGeek.me), WebRTC Conferences — To Mix or to Route Audio? Practitioner framing of when audio routing (SFU) gives way to audio mixing (MCU) as participant counts grow. https://bloggeek.me/webrtc-conferences-mix-or-route-audio/
- IETF RFC 8825, Overview: Real-Time Protocols for Browser-Based Applications, H. Alvestrand, January 2021. The WebRTC architecture overview; defines the protocol from a single browser's perspective and leaves multi-party topology and scaling to the application. https://www.rfc-editor.org/rfc/rfc8825.html


