Why this matters

If you build a video conferencing tool, a telemedicine platform, an online classroom, or a contact centre, the way your server handles audio sets your monthly server bill, your scaling ceiling, and whether a 200-person webinar sounds clean or like a crowded room. The audio decision is quieter than the video decision, so teams copy whatever their video architecture does and never look at it again — and then discover at scale that audio mixing is eating their CPU budget, or that selective forwarding broke their recording feature. This article is for the product manager, founder, or operations lead who needs to understand why one architecture mixes audio and another forwards it, so you can read an architecture proposal, ask your engineer the right question, and know what each choice costs you. A senior engineer will also find every claim traced to the relevant IETF RFC or the published behaviour of mediasoup, LiveKit, and Jitsi. By the end you will know which of the three to pick, and exactly when the cheap choice stops being cheap.


The one question every group call has to answer

Before we compare the three architectures, hold one idea in your head, because everything else follows from it. In a two-person call, audio is simple: my voice goes to you, your voice goes to me, done. The moment a third person joins, a new question appears that has no obvious answer. If three people are on a call, and I am one of them, I need to hear the other two. Should I receive two separate streams of audio and let my own device combine them? Or should something in the middle combine those two voices into one stream before it reaches me?

That single question — combine the voices, or keep them separate — is the entire difference between the three architectures. A topology, in the context of a real-time call, is the shape of the lines connecting the people in the room, plus the rule for what happens to the audio as it travels along those lines. The video version of this question — where the lines go and what the server does to the picture — is covered in depth in SFU vs MCU vs Mesh: The Three WebRTC Topologies in our Video Streaming hub. Here we look only at the audio half, because audio answers the "combine or keep separate" question very differently from video, and the difference is where most of the surprises live.

A useful mental image. Picture a dinner party. In a small group of four, everyone hears everyone directly — no one needs help. In a panel discussion, a sound engineer takes a microphone from each panellist, mixes them at a board, and sends one combined signal to the speakers in the hall. In a radio newsroom, a producer keeps every reporter's feed on its own channel and switches between them, sending whichever is live to air. The small group is P2P, the mixing board is an MCU (a Multipoint Control Unit), the channel switcher is an SFU (a Selective Forwarding Unit). The trade-offs in software map almost one-to-one onto these three real-world setups.

A note on naming. "P2P" and "Mesh" overlap. "P2P" — peer-to-peer — strictly means there is no server in the media path at all. With two people that is a genuine direct connection. With more than two people, P2P becomes a Mesh: every person opens a direct connection to every other person. We use P2P in this article because that is the term the SEO and product world uses for the no-server audio case, and we note where the Mesh distinction matters.

WebRTC, the protocol family browsers use to send live audio and video, does not pick a topology for you. The same JavaScript microphone API can feed all three patterns, which is exactly why choosing between them is your job, not the browser's.


P2P: every person sends their voice to every other person

The simplest architecture has no server in the audio path. Each participant encodes their microphone once per listener and sends it directly to each of them. With two people, this is one connection in each direction and it is the cleanest possible audio path — lowest latency, no middle box, no extra cost. This is why every WebRTC tutorial starts with a two-person P2P call.

The trouble starts when the call grows, and the trouble is upload bandwidth on the person's own device. In P2P, if there are five people on the call, your device must send your voice four separate times — once to each of the other four — and receive four separate voice streams in return. Audio is forgiving here compared to video, because a voice stream is small. A typical WebRTC voice stream using the Opus codec runs around 24 to 32 thousand bits per second; even ten of those is well under half a megabit, which any modern connection handles. So for audio alone, P2P survives larger groups than video P2P does.

But three real ceilings still close in. First, your device now runs an echo canceller, a noise suppressor, and a decoder for every incoming stream, and that processing climbs with the number of people. Second, encryption overhead: every direct connection is separately encrypted, so the per-stream cost is not just the audio bits. Third — and this is the one that ends P2P — the moment you also send video, the upload math collapses, because video streams are tens of times larger than audio streams. Most products run audio and video together, so the audio P2P ceiling is dragged down to the video P2P ceiling.

The practical rule: P2P audio is the right answer for one-to-one calls and very small group calls of three or four people with no server budget. Above that, you need a server in the middle. The only question left is what that server does to the audio — mix it, or forward it.

Three audio call architectures side by side, each showing four participants. In the P2P panel, every participant connects directly to every other, forming a full mesh of lines, with no server. In the MCU panel, every participant sends one voice stream up to a central server that mixes all voices and sends one combined stream back down to each. In the SFU panel, every participant sends one voice stream up to a central server that forwards individual streams down, but the server never mixes them. Annotations note that P2P has no server cost but high client upload, MCU sends one stream down but decodes and re-encodes everything, and SFU forwards without decoding. Figure 1. The three audio architectures. P2P keeps the server out of the path entirely. The MCU collapses every voice into one mixed stream. The SFU forwards each voice untouched. The difference between the last two is whether the server decodes the audio — and that difference sets the cost.


MCU: the server mixes every voice into one stream

An MCU puts a powerful server in the middle and gives it a hard job. Every participant sends their single voice stream up to the server. The server decodes each one back into raw sound, adds the sounds together into a mix, re-encodes the mix, and sends that one combined stream back down to each participant. From any participant's point of view, they upload one stream and download one stream, no matter how many people are in the call. A 50-person call looks, to your device, exactly like a 2-person call: one stream up, one stream down.

That property is the MCU's great strength, and it is worth being precise about. The mixing is not naive. If the server simply added everyone's audio together and sent the same mix to all, you would hear your own voice echoed back at you. So a correct audio MCU produces a slightly different mix for each participant: it sends you everyone's voice except your own. This is called an N-minus-1 mix — for N participants, each person gets the sum of the other N-1 voices. That subtraction is what stops the call from feeding back on itself.

What the MCU costs, in plain arithmetic

The price of all that convenience is server CPU, and the arithmetic is brutal. For every participant, the server must run a full audio decode, take part in the mix, and run a full audio re-encode. Let us make it concrete. Say each audio decode-plus-encode cycle on the server costs some unit of work, call it one "audio job." In a 50-person call, the server decodes 50 incoming streams and encodes up to 50 outgoing mixes — on the order of 100 audio jobs, continuously, for one room. An SFU doing the same room runs zero audio jobs, because it never decodes. That is not a small difference; it is the difference between a server hosting a handful of rooms and a server hosting hundreds.

The MCU also adds latency that the other two architectures avoid. Decoding, buffering enough audio to mix cleanly, mixing, and re-encoding all take time. Each hop through a decoder and encoder adds delay on top of the network trip, so the mouth-to-ear time on an MCU is reliably higher than on an SFU forwarding the same audio. For a deep look at where every millisecond of real-time audio delay comes from, see The WebRTC Audio Pipeline End-to-End.

Where the MCU still wins

Given the cost, why does the MCU survive at all? Three reasons, and they are real.

First, the receiver only has to decode one stream. A cheap set-top box, a phone-network gateway, an old smart-TV browser, or a dial-in telephone line cannot decode and mix ten separate audio streams — but it can play one. Any time your audience includes a device that can only handle a single stream, an MCU is the natural fit. The classic case is the telephone bridge: a PSTN caller dialled into a video meeting must receive one mixed audio stream, because a phone line carries exactly one.

Second, one mixed stream is trivial to record and to send onward. If you need a single audio file of the whole meeting, the MCU has already made it. We return to this below, because recording is where SFU architectures get complicated.

Third, the mix is consistent for everyone. Because the server controls the combined sound, it can apply uniform loudness normalisation, so a quiet speaker and a loud speaker arrive at comparable levels. (Loudness across speakers is its own discipline — see Loudness, Peak, RMS, LUFS.)


SFU: the server forwards voices without ever decoding them

An SFU also sits in the middle, but it does almost nothing to the audio. Each participant uploads their one voice stream. The server reads the packet headers, decides who should receive that stream, and forwards the packets onward — re-stamping the routing information but leaving the encoded audio payload completely untouched. It never decodes the sound, never mixes, never re-encodes. From your device's point of view you upload one stream, and you download one stream per person you are listening to.

Because the server never decodes, an SFU's CPU cost per room is a fraction of an MCU's. This is the single biggest reason the SFU became the default for group calls in the 2010s and remains so in 2026: it scales to far more concurrent rooms on the same hardware. The cost moves from the server to the client, because now your device receives and decodes several audio streams instead of one — but as we saw, audio streams are small, and decoding a handful of them is cheap on any modern phone.

The trick that makes SFU audio work: forward only the talkers

If an SFU naively forwarded every participant's audio to everyone, a 100-person call would flood each device with 99 incoming voice streams, and the combined hiss of 99 not-quite-silent microphones would be unlistenable. The SFU's defining audio feature is that it does not do this. It forwards only the streams that matter — typically the few loudest, currently-speaking participants — and drops the rest.

Here is the elegant part, and it is pure audio engineering. To decide who is loudest, the server would normally have to decode every stream and measure its level — exactly the expensive work the SFU is trying to avoid. The way out is a small RTP header extension defined in IETF RFC 6464 (Client-to-Mixer Audio Level Indication, December 2011). With it, each sender stamps a single number onto every audio packet: the loudness of that packet, expressed as a value from 0 to 127 in -dBov (decibels below the system's loudest possible signal, where 0 means full scale and 127 means digital silence). The number lives in the packet header, outside the encrypted-but-readable extension area, so the server can read each packet's loudness without decrypting or decoding the audio itself. The standard even reserves one bit, the "V" bit, for the sender to flag whether it believes the packet contains actual voice.

With that number on every packet, the SFU can rank participants by who is actually talking and forward only the top few — all without ever touching the sound. RFC 6464's own introduction states the purpose plainly: it lets a server "forward only a few of the loudest audio streams, without requiring it to decode and measure every stream that is received." The complementary standard, RFC 6465 (Mixer-to-Client Audio Level Indication, December 2011), lets a mixer tell each client the levels of the individual voices that went into a mix, which is how a client UI can light up the active speaker even when the audio arrives pre-mixed.

Dominant-speaker detection: turning loudness numbers into a decision

Reading the loudness number is step one. Deciding who the "active speaker" is — so the UI can highlight them and the SFU can forward them — is step two, and it is not as simple as "whoever is loudest right now." A cough, a dropped phone, or a single loud syllable should not steal the spotlight. RFC 6464 itself warns that servers "ought not base audio forwarding decisions directly on packet-by-packet audio level information," and should instead smooth the levels over time before acting.

The most widely deployed solution is an algorithm published by Ilana Volfin and Israel Cohen, Dominant Speaker Identification for Multipoint Videoconferencing, which analyses each participant's audio-level history across short, medium, and long time windows to decide who is genuinely holding the floor — all from the RFC 6464 loudness numbers, with no audio decoding. This is the algorithm behind mediasoup's ActiveSpeakerObserver (added in mediasoup 3.8.0) and Jitsi's dominant-speaker logic. The result is a stable "active speaker" signal that the SFU uses both to forward the right streams and to drive the UI highlight.

A worked example makes the saving obvious. In a 200-person town-hall meeting where one person presents and the rest listen, an MCU would decode and mix all 200 streams forever. An SFU using RFC 6464 levels plus dominant-speaker detection forwards essentially one stream — the presenter — to all 199 listeners, decoding nothing on the server. The bandwidth and CPU difference is not incremental; it is the difference between a feasible product and an infeasible one.

A flow diagram of how an SFU selects which audio to forward, drawn left to right. On the left, four participant streams arrive, each carrying an RTP audio-level number stamped by the sender per RFC 6464. In the middle, a box labelled level reader reads each packet's loudness without decoding the audio, feeding a dominant-speaker detector that smooths levels over short, medium, and long windows. On the right, the SFU forwards only the top speakers and drops the silent streams, with a note that the server never decodes the audio payload. Figure 2. How an SFU forwards only the talkers. Each sender stamps its loudness onto every packet (RFC 6464). The server ranks speakers from those numbers — never decoding the sound — and forwards only the active few, driven by a dominant-speaker algorithm that ignores brief noises.


The comparison, side by side

The three architectures trade the same four resources — client upload, client decode load, server CPU, and latency — in different directions. The table below is the decision in one view. The "winner" for any given column depends on your call shape, so read it against your own product, not in the abstract.

Criterion P2P MCU SFU
Server in audio path None Yes, heavy Yes, light
Server decodes audio n/a Every stream Never
Server CPU per room Zero Very high (decode + mix + encode per person) Low (header inspection only)
Streams a client uploads One per other person One One
Streams a client downloads One per other person One (the mix) One per forwarded speaker
Client decode load Moderate (one per person) Lowest (one stream) Low–moderate (a few speakers)
Added latency Lowest (direct) Highest (decode + mix + re-encode) Low (forward only)
Scales to large rooms No Limited by server CPU Yes (the modern default)
Easy single-file recording No Yes (mix already exists) Harder (must mix separately)
Works for dumb receivers (PSTN, set-top) No Yes No (client must mix)
Per-speaker control (mute, levels, layout) n/a Lost in the mix Full (streams stay separate)

The pattern reads cleanly once you see it. P2P wins on latency and cost but only for tiny calls. The MCU wins when the receiver is weak or you need one combined stream — at a steep server price. The SFU wins almost everywhere else, which is why it is the default, and its one real weakness is the very thing the MCU does for free: producing a single recording.


The common mistake: assuming the SFU recording is free

Here is the pitfall that surprises teams most often, and it is an audio-specific trap. Because an SFU keeps every voice on its own separate stream, there is no single combined audio stream anywhere in the system. That is wonderful for the live call — every speaker stays independent, so the UI can mute one person, show per-speaker levels, or apply spatial audio. But the moment a customer asks "can we record the meeting and email an MP3 afterwards?", the SFU has nothing to hand them. There is no mix. To make a single recording, you have to spin up a separate process that subscribes to every stream, decodes them all, mixes them down, and encodes the result — which is, ironically, exactly the work the MCU was doing all along.

This is why mature SFU products run a dedicated recording or "egress" service alongside the forwarder. LiveKit, for example, is an SFU that does not decode audio on the server during the live call; to record, you subscribe to the audio tracks, decode the Opus frames, aggregate them into raw PCM, and encode a file. The lesson for planning: if recording or transcription is a launch requirement, the SFU does not make it free, and you must budget the mixing CPU you thought you had escaped. The full recording-and-transcription path is its own topic — see Recording and Transcription Pipelines.

A second, quieter pitfall: DTX confuses naive level readers. Opus discontinuous transmission (DTX) lets a quiet microphone drop from around 24–32 kbit/s to roughly 1 kbit/s by sending almost nothing during silence. That is a real bandwidth win — but a level observer that expects a packet every 20 milliseconds can misread the gaps as "the speaker stopped," because the packets simply are not arriving. LiveKit's own issue tracker documents exactly this interaction between its audio-level observer and Opus DTX. If your active-speaker UI flickers when people go quiet, DTX timing is the first place to look. For how DTX and silence suppression actually work, see Voice Activity Detection and Discontinuous Transmission.


How real products choose

In 2026 the field has settled into a clear pattern. Almost every group call beyond four participants runs on an SFU, because nothing else scales to large rooms on affordable hardware. P2P survives for one-to-one and tiny calls where the lowest possible latency and zero server cost matter most — a doctor and a single patient, a two-person sales call. The MCU has retreated to the specific cases where its expensive mixing earns its keep: a meeting that must bridge to the telephone network, a broadcast that needs one composed feed, or a system where the receiving devices genuinely cannot decode more than one stream.

Many production systems are hybrids. A call can start P2P with two people, then promote to an SFU the moment a third joins. An SFU-based conference can stand up a small MCU-style mixer just for the PSTN dial-in leg, mixing the room down to one stream for the phone caller while everyone else stays on independent SFU streams. The architecture is not a religion; it is a per-room, sometimes per-participant decision, and the audio-level header from RFC 6464 is what makes the SFU half of every hybrid affordable.

A decision tree, in plain words. Two people and no recording need? P2P. Receivers that can only play one stream, or a hard need for one combined recording with minimal extra engineering? MCU. Everything else — which is most products — SFU, with a separate egress service if you need recordings. The diagram below walks the same logic top to bottom.

A top-down decision tree for choosing an audio call architecture. The first diamond asks whether the call is only two people; if yes, the outcome is P2P. If no, the next diamond asks whether any receiver can only decode a single audio stream, such as a phone-network caller or a set-top box; if yes, the outcome is MCU. If no, a third diamond asks whether a single combined recording is required with minimal engineering; if yes, MCU is suggested, otherwise the outcome is SFU with a separate egress service for recording. A side note states that hybrids are common, for example P2P that promotes to SFU when a third person joins. Figure 3. Choosing an audio architecture. Two people: P2P. A receiver that can only play one stream: MCU. A combined recording with minimal effort: MCU. Everything else: SFU, with a separate egress service when you need recordings.

Where Fora Soft fits in

We have built real-time audio into video conferencing, telemedicine, e-learning, and contact-centre products since 2005, and the architecture question above is one of the first we settle on every project. A telemedicine consultation between one clinician and one patient is often best as a clean P2P call; a virtual classroom with one teacher and forty students is an SFU forwarding the teacher plus the few students who unmute; a webinar that must also accept phone dial-ins needs an SFU with a small MCU bridge for the telephone leg. We choose the topology against the call shape, the recording requirements, and the device mix the audience will actually use — not against fashion. When a product needs both massive scale and clean recordings, we plan the egress mixing CPU up front, because we have seen what happens to teams who discover that cost the week before launch.

What to read next

Call to action

References

  1. IETF RFC 6464, A Real-time Transport Protocol (RTP) Header Extension for Client-to-Mixer Audio Level Indication, J. Lennox, E. Ivov, E. Marocco, December 2011. The audio-level header extension (0–127 in -dBov, the V voice-activity bit, URI urn:ietf:params:rtp-hdrext:ssrc-audio-level) and its stated purpose of forwarding only the loudest streams without decoding. Read in full from rfc-editor.org. https://www.rfc-editor.org/rfc/rfc6464.html
  2. IETF RFC 6465, A Real-time Transport Protocol (RTP) Header Extension for Mixer-to-Client Audio Level Indication, E. Ivov, E. Marocco, J. Lennox, December 2011. The complementary extension by which a mixer reports the levels of contributing sources back to clients. https://www.rfc-editor.org/rfc/rfc6465.html
  3. IETF RFC 3550, RTP: A Transport Protocol for Real-Time Applications, H. Schulzrinne et al., STD 64, July 2003. The base RTP/RTCP framework; defines the mixer and translator middlebox roles that the MCU and SFU implement. https://www.rfc-editor.org/rfc/rfc3550.html
  4. IETF RFC 5285, A General Mechanism for RTP Header Extensions, D. Singer, H. Desineni, July 2008. The one-byte / two-byte header-extension framing that carries the RFC 6464 audio level. https://www.rfc-editor.org/rfc/rfc5285.html
  5. IETF RFC 8825, Overview: Real-Time Protocols for Browser-Based Applications, H. Alvestrand, January 2021. The WebRTC architecture overview; defines the protocol from one browser's perspective and leaves multi-party topology to the application. https://www.rfc-editor.org/rfc/rfc8825.html
  6. I. Volfin and I. Cohen, Dominant Speaker Identification for Multipoint Videoconferencing, IEEE 27th Convention of Electrical and Electronics Engineers in Israel, 2012. The multi-window audio-level algorithm behind production dominant-speaker detection. https://israelcohen.com/wp-content/uploads/2018/05/IEEEI2012_Volfin.pdf
  7. mediasoup documentation, ActiveSpeakerObserver and AudioLevelObserver (v3 API), versatica/mediasoup. Implementation based on the Volfin–Cohen algorithm using RFC 6464 audio levels; no audio decoding on the server. https://mediasoup.org/documentation/v3/mediasoup/api/
  8. LiveKit documentation, LiveKit SFU internals and Audio recording / egress. A horizontally-scaling SFU that does not decode audio during the live call; recording requires subscribing, decoding Opus to PCM, and mixing. https://docs.livekit.io/home/
  9. LiveKit issue #159, AudioLevel observer is inaccurate with Opus DTX, livekit/livekit on GitHub. Documents the interaction between Opus discontinuous transmission and level-based active-speaker detection. https://github.com/livekit/livekit/issues/159
  10. T. Levent-Levi (BlogGeek.me), WebRTC Conferences — To Mix or to Route Audio? Practitioner analysis of audio MCU vs SFU trade-offs. https://bloggeek.me/webrtc-conferences-mix-or-route-audio/
  11. Jitsi / NOSSDAV 2015, Last N: Relevance-Based Selectivity for Forwarding Video in Multimedia Conferencing, E. Ivov et al. The relevance-ordering scheme (driven by audio activity) that an SFU uses to forward only the most relevant streams. https://jitsi.org/wp-content/uploads/2016/12/nossdav2015lastn.pdf
  12. IETF RFC 6716, Definition of the Opus Audio Codec, JM. Valin, K. Vos, T. Terriberry, September 2012. The codec carried by all three architectures; the basis for the bitrate and DTX figures cited. https://www.rfc-editor.org/rfc/rfc6716.html