Knowledge baseReal-time video & audio

WebRTC architecture · 2026 guide

A practical guide to WebRTC architecture for production real-time video and audio.

How WebRTC actually works end-to-end at production scale. The topology decision (P2P, SFU, MCU, hybrid, broadcast) every architect is asked to answer on the first call. The 12 components every production WebRTC system needs. The build-vs-SDK economics that decide whether a custom stack or Twilio / Agora / LiveKit Cloud is the right call. Written from the platforms we have shipped: Nucleus (600M+ call minutes per month), Worldcast Live (sub-second latency at 10,000 concurrent), StreamLayer (NBC / CBS / Chelsea FC / Sony Music interactive sports), and BrainCert (500M+ classroom minutes per year).

See the topology decision picker See bandwidth calculator

20+ years in real-time video · since 2005

|

625+ projects · 400+ clients

|

Top WebRTC Developer 2022 · Clutch 5.0 / 30 reviews

Industry recognition · 2019–2024

Top WebRTC Developer

2022
category leader

Top Telecom Software Dev

2024
industry recognition

EASA Best Software Partner

2019
category winner

DesignRush Top App Dev

2020
platform recognition

⭐ Clutch 5.0 / 30 reviews

Spring 2024
top performer

Quick answer

WebRTC is the browser-native standard for real-time peer-to-peer media — four primitives (getUserMedia, RTCPeerConnection, MediaStream, RTCDataChannel) plus ICE / STUN / TURN for NAT traversal. WebRTC architecture is the system-design pattern for shipping that standard at production scale.

Every production WebRTC system has 12 components surrounding the browser primitives: signaling, a media gateway (SFU / MCU / P2P), TURN servers, recording pipeline, transcoding, CDN distribution, observability, billing, end-to-end encryption, compliance tooling, guardrails, and deployment. Skipping any of them is technical debt that surfaces in the first month of production traffic.

P2P, SFU, MCU — the topology decision. P2P is direct browser-to-browser, lowest latency, but stops at ~4 participants because upload bandwidth scales O(N²). An SFU (Selective Forwarding Unit — mediasoup, LiveKit, Janus, Pion, Kurento) routes each peer’s stream once and forwards selectively; bandwidth at the peer becomes O(1), so SFUs scale to 10,000+ participants per cluster. An MCU mixes peer streams server-side into a single composite stream — useful for legacy SIP interop, costly because the server transcodes everything. Most modern stacks ship hybrid — P2P for 1:1, SFU for groups, MoQ or LL-HLS for broadcast above ~10K viewers.

How you scale WebRTC depends on the participant tier. Below 100 participants, a single SFU node handles it. 100–1K, simulcast + SVC gates the publisher count and serves receive-only viewers cheaply. 1K–10K, cascade SFUs across regions or hybrid with LL-HLS broadcast. Above 10K, WebRTC becomes the talent layer; LL-HLS or MoQ (Media over QUIC, the emerging 2026 standard) handles broadcast fan-out. Fora Soft has shipped at every tier — Nucleus runs 600M+ call minutes per month; Worldcast Live runs HD broadcast at 10K concurrent with sub-second latency; StreamLayer powers interactive sports for NBC, CBS, Chelsea FC, and Sony Music.

Topics & use Cases covered in this guide

P2P, SFU, MCU, hybrid, broadcast — and the verticals each one fits.

Four shapes of WebRTC stack dominate the 2026 landscape. Each gets a different topology, a different vendor stack, and a different scale ceiling.

01 / Telehealth & business comms

Nucleus shape

WebRTC + SIP-based video / audio calls with CRM and ERP integration. AI phone agents handle voice intake. SOC II + GDPR + HIPAA on the receipts. P2P for 1:1, SFU for groups.

02 / Live sports & interactive streaming

StreamLayer shape

Interactive overlays (polls, predictions, in-stream betting), watch parties, real-time stats. WebRTC for the talent layer plus LL-HLS / HLS for broadcast fan-out. Interactive viewers watch 33% longer than passive.

03 / Sub-second broadcast

Worldcast Live shape

HD concert streaming with multichannel audio (5 channels), 1.5 Gb/s bitrate, full-duplex two-way streaming for remote performers playing together in real time. 0.4–0.5 s latency at 10K concurrent.

04 / Virtual classroom & high-density

BrainCert shape

WebRTC + HTML5 virtual classrooms at production scale. 500M+ classroom minutes per year, 99.995% uptime, native HD virtual classrooms, interactive whiteboard, server-side recording, proprietary DRM.

Choose your scale — the architecture follows

WebRTC topology decision picker.

The central question every WebRTC architect is asked to answer on the first call: at what scale does your real-time call need to operate, and which topology actually holds up there? P2P, SFU, MCU, hybrid, broadcast — each one wins at a different participant count and breaks at the next tier. Pick your scale below to see the topology that fits, the vendor stack, the latency and bandwidth math, the failure mode that kills you at that tier, and the Fora Soft client running production at that scale.

The first decision is concurrency. Once you know it, 80% of the architecture follows.

Tier 01 · 1:1 · 2 participants

Direct peer connection · no media server

P2P direct — the lowest-latency option WebRTC offers.

A direct browser-to-browser RTCPeerConnection. ICE negotiates the path, STUN punches NAT, TURN relays when STUN cannot. No media server in the loop. The lowest-latency real-time path WebRTC offers — but it breaks the moment you add a third participant.

Topology

P2P direct

RTCPeerConnection · no SFU

Latency budget

80–250 ms

Glass-to-glass · same region

Cost shape

~$0 media server

TURN bandwidth only when needed

Vendor stack

Browser RTCPeerConnection API · coturn (TURN/STUN) · custom signaling (WebSocket). No SFU vendor required.

Bandwidth math

O(1) per peer. Each side uploads one stream and downloads one stream. ~2 Mbps each direction for 720p VP9.

Failure mode at this tier

Symmetric NAT on one side breaks STUN — TURN-relayed connection adds 50–150 ms. ~18–35% of cellular connections relay through TURN. Fix: deploy TURN in 3+ regions with anycast routing.

When it breaks

The moment you add a third participant. P2P scales O(N²) across the room — every peer uploads to every other peer. Three peers = 3 uploads each. Four peers = 6 uploads each. Stops working above ~4 participants in practice.

Fora Soft shipped at this tier

Nucleus (Fibernetics) — 600M+ call minutes per month · WebRTC + SIP business communication platform serving 5,000+ businesses with AI phone agents. SOC II + GDPR + HIPAA. 1:1 calls run P2P; group calls route through Kurento.

Tier 02 · 2–10 · Small group

SFU / P2P boundary · hybrid topology shines

SFU starts paying off. Above 4 participants, P2P collapses.

The boundary where most modern production stacks switch to an SFU (Selective Forwarding Unit). Each peer uploads one stream to the SFU; the SFU forwards copies to every other peer. Bandwidth per peer becomes O(1) regardless of group size. Hybrid topology dominates in 2026: P2P for the 1:1 leg, SFU for groups, selected dynamically per session.

Topology

SFU (or hybrid)

P2P at 1:1 → SFU at 3+

Latency budget

150–300 ms

Glass-to-glass · region-local SFU

Cost shape

$200–$2K / mo

Per SFU node bandwidth + compute

Vendor stack

mediasoup (open-source SFU) · LiveKit Cloud or self-host · Janus · Pion (Go-based) · ion-sfu. For 1:1 leg fall-back: native RTCPeerConnection.

Bandwidth math

O(1) per peer · O(N) at the SFU node. For 10 peers at 720p simulcast (low/med/high): peer uplink ~3 Mbps; SFU egress 27 Mbps (9 copies × 3 Mbps).

Failure mode at this tier

Single-region SFU breaks for cross-continent groups (Sydney + London + NYC routed to one SFU = 30%+ failed joins). Fix: regional SFU cascade with simulcast routing.

When it breaks

Above ~100 publishers per SFU node. Above 10 active video streams the simulcast forwarding burns CPU instead of bandwidth — switch to SVC (scalable video coding) or cascade SFUs.

Fora Soft shipped at this tier

Ruume.ai · iMind · Meetric · business video conferencing at small-to-medium group scale. HIPAA-compliant where required (Ruume.ai). Mediasoup + LiveKit + Twilio as the common stack.

Tier 03 · 10–100 · Classroom / meeting

SFU mandatory · simulcast becomes critical

SFU is non-negotiable. Simulcast / SVC enters the architecture.

The classroom-and-meeting tier. SFU is mandatory — P2P is mathematically impossible at this scale. Simulcast (multiple quality tiers per stream) becomes critical because participant networks span 4G cellular to fibre. The SFU drops quality tiers selectively per receiver. Recording moves server-side (client-side recording loses 30%+ of participants when tabs close).

Topology

SFU + simulcast

3 quality tiers per stream

Latency budget

200–400 ms

Region-local · sub-300 ms target

Cost shape

$2K–$10K / mo

SFU autoscale + recording pipeline

Vendor stack

mediasoup with simulcast · LiveKit Cloud or self-host on K8s · Kurento (still production at clients like Nucleus and Worldcast Live) · Janus. Server-side recording: FFmpeg orchestrated from SFU events.

Bandwidth math

~500 publishers per SFU node at production tuning. 100-participant classroom with 3 simulcast tiers: peer uplink ~3 Mbps; SFU egress ~30–90 Mbps depending on active video count.

Failure mode at this tier

Client-side recording drops students when tabs close. Cold-start SFU pods add 1.5–3 seconds to join. Fix: warm pool of 2–3 idle pods per region; server-side recording pulled from SFU, never from client.

When it breaks

Above ~500 publishers per SFU node. Single-region collapse on cross-continent classes. Recording pipeline backlog if transcoding cannot keep up with concurrent class endings.

Fora Soft shipped at this tier

BrainCert · 500M+ classroom minutes / year · 99.995% uptime · the world's first WebRTC + HTML5 virtual classroom LMS. $3M ARR, 100K+ customers, 10 worldwide datacenters. SOC 2 + ISO 27001 + HIPAA + GDPR + PCI DSS + CCPA compliance stack.

Tier 04 · 100–1K · Webinar / large room

SFU + aggressive simulcast / SVC · publisher count throttled

Publisher count gets gated. Most participants become receive-only.

Above 100 participants, every architecture in production gates the publisher count. Only the host plus a handful of panelists publish video; everyone else is receive-only. The SFU still serves 1,000+ receivers, but only 5–20 simultaneous publishers. SVC (scalable video coding, AV1 / VP9) starts to outperform simulcast on bandwidth efficiency. Recording becomes a separate pipeline track from live delivery.

Topology

SFU + receive-only fan-out

5–20 publishers · 1K receivers

Latency budget

250–450 ms

P99 ~ 800 ms · interactive bar

Cost shape

$10K–$50K / mo

Multi-region SFU + CDN start

Vendor stack

mediasoup at production tuning · LiveKit self-host on K8s · Pion-based custom SFU for control · server-side recording with FFmpeg · CDN edge cache for VOD replay.

Bandwidth math

SFU egress dominates: 10 publishers × 3 simulcast tiers × 500 receivers ≈ 15K outbound streams per node. Sharding receivers across SFU instances is mandatory.

Failure mode at this tier

SFU egress saturation. Bandwidth contention when 1,000 students all turn cameras on at once. Fix: camera-default-off policy with one-click enable; aggressive simulcast tier-dropping; SVC for newer browsers.

When it breaks

When you try to keep interactive parity at broadcast scale. Above 1K simultaneous participants the math breaks: cascade SFUs or move to a hybrid SFU + LL-HLS architecture.

Fora Soft shipped at this tier

AllAboutLaw · 1,000 concurrent users · 85,000+ event registrations on serverless AWS (Lambda + DynamoDB + AWS Chime). The UK's biggest virtual law fair — week-long events scaling cleanly through aggressive fan-out and serverless backend.

Tier 05 · 1K–10K · Conference / large event

SFU cascade + LL-HLS hybrid · region cascade non-negotiable

Single SFU cannot serve. Cascade + hybrid is the only path.

The tier where a single SFU stops being viable. Cascade SFUs across regions: streams hop from publisher's regional SFU to receiver's regional SFU once, the receiver SFU serves all local viewers. For one-way fan-out above ~5K, hybrid topology wins: interactive layer on WebRTC SFU (sub-second latency for the talent + Q&A pool), broadcast layer on LL-HLS (2–4 second latency for the rest). Recording fully separated.

Topology

SFU cascade + LL-HLS

Two-tier: interactive + broadcast

Latency budget

Sub-500 ms / 2–4 s

Interactive vs broadcast tier

Cost shape

$50K–$200K / mo

Multi-region + CDN egress

Vendor stack

Custom mediasoup cascade for interactive · Kurento at scale · LL-HLS via Cloudflare Stream or AWS Elemental for broadcast · Mux or Bunny.net for edge delivery · RTMP ingest for legacy broadcast input.

Bandwidth math

Interactive layer: 50–200 SFU-served participants. Broadcast layer: HLS fan-out to 5K–10K via CDN. CDN egress dominates cost (~$0.04–$0.08 per GB).

Failure mode at this tier

CDN cache miss storms at event start. Audio drift between WebRTC and LL-HLS layers (publishers ahead of viewers by 2–3 seconds). Fix: explicit timing-source sync; pre-warm CDN edges; staggered fan-out.

When it breaks

Above ~10K concurrent receivers per region. CDN regional saturation. Need multi-CDN failover and edge anycast routing.

Fora Soft shipped at this tier

Worldcast Live · 10,000 concurrent viewers · sub-second latency (0.4–0.5 s) · custom WebRTC + Kurento HD concert streaming platform with multichannel audio (5 channels) and 1.5 Gb/s bitrate for true HD. Full-duplex two-way streaming for remote performers playing together in real time.

Tier 06 · 10K+ · Broadcast / mass scale

LL-HLS / MoQ / HLS · WebRTC for interactive talent layer only

WebRTC is the talent layer. Broadcast goes to LL-HLS / MoQ.

Above 10K concurrent viewers, WebRTC is not the delivery layer for the bulk of the audience. WebRTC remains the interactive talent layer (50–500 talent + selected interactive viewers at sub-300 ms); LL-HLS, MoQ, or standard HLS handles the broadcast fan-out. MoQ (Media over QUIC) is the emerging 2026 standard for sub-500 ms broadcast at million-viewer scale — Cloudflare, Twitch, Meta have shipped early MoQ deployments. For new builds expecting > 100K simultaneous viewers, MoQ is the right thing to test.

Topology

Hybrid: WebRTC + LL-HLS / MoQ

Interactive talent + broadcast fan-out

Latency budget

Sub-300 ms / 200–500 ms

WebRTC talent · MoQ broadcast

Cost shape

$200K+ / mo

CDN egress dominates · multi-CDN

Vendor stack

Custom WebRTC SFU for talent layer · LL-HLS via Cloudflare Stream / AWS Elemental MediaPackage / Bunny / Mux · MoQ (Media over QUIC) for next-gen sub-500 ms broadcast · multi-CDN failover · RTMP ingest for broadcast camera feeds.

Bandwidth math

CDN egress dominates. 100K concurrent at 720p HLS ≈ 200 Gb/s sustained. Cost: ~$0.02–$0.06 per GB at scale tiers; ~$15K–$40K per million viewer-hours.

Failure mode at this tier

CDN regional saturation; cache-miss storms at major events; codec interop on long-tail receivers; audio drift between WebRTC and HLS layers. Fix: multi-CDN failover; pre-warmed edges; explicit timing sync.

When it breaks

A regional CDN provider has a quarterly outage at the worst possible moment. Plan: multi-CDN with anycast routing on day one.

Fora Soft shipped at this tier

StreamLayer · interactive sports streaming for NBC, CBS, Red Bull, Chelsea FC, Coca-Cola, Sony Music · $14.1M raised across 7 rounds. Polls, prediction games, watch parties, in-stream betting handoff. Interactive viewers watch 33% longer than passive. Platforms running the full stack report 60–100% revenue uplift over basic ad breaks.

Scale is the first decision. The vendor choice (mediasoup vs LiveKit vs Janus vs Kurento vs Pion) matters less than topology — most production-grade SFUs handle Tier 02 through Tier 05 well at the right tuning. The actual production reality is that no architecture is one-tier: most platforms ship hybrid topology that picks per-session. Fora Soft has shipped at every tier above. Book a call to map your concurrency model against this picker on the first conversation.

Reference architecture · at a glance

The WebRTC stack, from your code to the wire.

Four layers, nineteen components. Browser primitives at the top, server infrastructure at the bottom. Click any node in the flow strip to highlight the components on the packet's path. Hover any tile for the vendor or spec shortlist.

RTP packet path · sender → receiver

Application APIs · 5 Media engine · 4 Transport & security · 4 Server infrastructure · 6

Layer 1 Application APIs 5 components What your code writes. Browser-native.

01 App getUserMedia Asks the OS for camera + mic. Returns a MediaStream. W3C spec · navigator.mediaDevices.getUserMedia()

02 App RTCPeerConnection The connection object. Holds codecs, ICE state, senders, receivers. W3C spec · the central WebRTC object

03 App MediaStream Container for audio + video tracks at either end of the pipe. W3C spec · addTrack() / removeTrack() / clone()

04 App RTCDataChannel Reliable or unreliable arbitrary data, multiplexed on the same connection. W3C spec · SCTP over DTLS · chat, telemetry, game state

05 App Signaling client Your WebSocket client. Exchanges SDP offer/answer + ICE candidates. Not standardized — you build it (Socket.io, native WS)

Layer 2 Media engine 4 components Codecs and DSP. Browser-internal.

06 Media Codecs Audio: Opus (default). Video: VP8 / VP9 / H.264 / AV1. Negotiated via SDP. Opus 16 kHz mono → 48 kHz stereo · VP9 + AV1 for newer browsers

07 Media DSP · AEC / NS / AGC Echo cancel, noise suppress, auto-gain. Tuning DSP per-language matters for global users. libwebrtc audio_processing module · per-room tuning

08 Media Jitter buffer Receiver-side. Smooths packet arrival jitter so playback stays smooth. Adaptive depth · trades latency for smoothness

09 Media Simulcast / SVC Publish 2–3 quality tiers; SFU drops tiers per receiver. Mandatory above 10 participants. VP9 SVC (built-in) · H.264 simulcast (multi-encoder)

Layer 3 Transport & security 4 components How packets actually move. The wire.

10 Wire DTLS-SRTP DTLS handshake derives SRTP keys. End-to-end encryption between peers (or peer–SFU). RFC 5764 · mandatory in every browser since 2017

11 Wire SCTP / usrsctp Reliable + unreliable transport for the data channel, multiplexed over DTLS. RFC 4960 · usrsctp library in browsers

12 Wire ICE / STUN / TURN NAT traversal. ICE gathers candidates, STUN punches, TURN relays. 18–35% of cellular relays. RFC 8445 (ICE) · RFC 5389 (STUN) · RFC 5766 (TURN)

13 Wire BWE / GCC Bandwidth estimation, congestion control. Adapts encoder bitrate to network conditions. Google Congestion Control (GCC) · transport-cc feedback

Layer 4 Server infrastructure 6 components What you build and operate.

14 Server Signaling server WebSocket layer that exchanges SDP and ICE between peers. Owns auth and room state. Custom · Node + Socket.io, Nest.js, Go gorilla/websocket

15 Server Media gateway SFU forwards, MCU mixes. The 2026 default is SFU for any room above 4 peers. mediasoup · LiveKit · Janus · Pion · Kurento · ion-sfu

16 Server TURN cluster Relay servers when ICE/STUN can't punch NAT. Anycast-routed across 3+ regions. coturn · Twilio TURN · Cloudflare TURN

17 Server Recording + transcoding Server-side recording from the SFU. FFmpeg transcodes to HLS/DASH for VOD. LiveKit egress · FFmpeg · AWS MediaConvert · never client-side

18 Server CDN egress / broadcast LL-HLS for 2–4 s broadcast latency. MoQ for sub-500 ms at 100K+ viewers. Cloudflare Stream · AWS CloudFront · Mux · Bunny.net

19 Server Observability + billing getStats() + RTCP feedback into a time-series DB. Per-participant-minute metering. callstats.io (legacy) · Datadog · OpenTelemetry · custom

Layer 1 + 2 ship inside every modern browser (Chrome, Firefox, Safari, Edge). Layer 3 is half-spec, half-build — the protocols are standardized, the deployment is yours. Layer 4 is where most of the engineering happens, and where most production bugs live.

Media stack · what actually moves the bits

From a video frame to a UDP packet on the wire.

Every WebRTC media packet stacks six layers, each adding a header on the way down and stripping it on the way up. Click any layer to see what it adds, the protocol that owns it, and the field-by-field byte budget.

One outgoing packet · ~1300 bytes total

IP · 20

UDP · 8

DTLS · 13

SRTP · 12

RTP · 12

Encoded media payload

byte 0 byte ~65 byte ~1300

Layer 05 · RTP · RFC 3550

Real-time Transport Protocol header (12 bytes fixed)

The header that makes media reassemblable at the receiver. Sequence number for reorder detection, timestamp for jitter buffer pacing, SSRC for stream identification, payload type for codec dispatch.

Sequence number (16 bits): packet order across the stream
Timestamp (32 bits): media clock, not wall clock
SSRC (32 bits): unique stream identifier inside the session
Payload type (7 bits): tells receiver which codec to dispatch to

Encapsulation reads bottom-up at the sender, top-down at the receiver. Roughly 5% of a 1300-byte packet is headers; the rest is payload. On 50 ms voice, that's ~7 bytes of header per 100 bytes of audio — cheap compared to the engineering it pays for.

NAT traversal · the ICE state machine

How a WebRTC connection actually finds the other peer.

ICE (RFC 8445) doesn’t just find one path — it gathers every candidate it can, pairs them up, runs connectivity checks on the cartesian product, and promotes the winner. Click any state to see what’s happening internally and what trips it into the next state.

Candidate gathering · what each peer collects before the state machine starts

HOST Direct interface IPs (Ethernet, Wi-Fi, cellular). The first to be gathered, fastest to connect when the network allows. priority 126

SRFLX Server-reflexive. STUN binding tells you your public IP+port after NAT. Works for ~65% of consumer connections. priority 100

PRFLX Peer-reflexive. Discovered during connectivity checks when a peer’s NAT shows you a new mapping mid-flight. priority 110

RELAY TURN-allocated. Last resort, but mandatory for symmetric NAT and most cellular. ~18–35% of cellular ends up here. priority 0

new

The PeerConnection is built but no checks have started.

You called new RTCPeerConnection() and maybe even setLocalDescription, but ICE hasn’t fired off any STUN binding requests yet.

What kicks off the next stateFirst STUN binding request sent or first remote candidate received via signaling

Typical durationMilliseconds — if you’re stuck here for >1 s, signaling hasn’t delivered candidates

Failure modeSignaling server is unreachable or addIceCandidate() is being called before setRemoteDescription

In production the path you actually care about is new → checking → connected → completed in under 500 ms, then living in completed for the session’s lifetime. Anything else is an incident.

SFU vendor deep-dive

mediasoup vs LiveKit vs Janus vs Pion vs Kurento — five production SFUs compared.

Every production SFU choice is a trade-off between control and time-to-market. Fora Soft has shipped on all five. Pick a vendor below to see the operational reality from production deployments — language, license, capacity, time-to-MVP, where each one wins, where it loses, and the Fora Soft client running it.

mediasoup · the architect's SFU

C++ media worker · Node.js API · ISC license · since 2016

Maximum control. Steep cognitive load. Production-correct.

Node.js API wrapping a C++ media worker. Each worker is a single OS process pinned to one CPU core; scale by spawning more workers across more cores. The API is explicit — you create routers, transports, producers, and consumers by hand. No magic. The choice for architects who need fine-grained control over simulcast layer activation, custom audio leveling, non-standard topologies, or single rooms with 1,000+ publishers.

Time to MVP

4–6 months

Higher cognitive load

Capacity / node

500–800 participants

Well-tuned m5.4xlarge

Per-core throughput

~500 consumers

~2× LiveKit on the same hardware

License

ISC (BSD-like)

Permissive · safe for commercial

Wins when

You need to control simulcast layer activation per receiver, implement custom audio leveling, run a non-standard topology (mesh of SFUs forwarding selectively), or scale a single room to 1,000+ publishers. Mediasoup's routing graph handles this well.

Loses when

Team is small and wants to ship in 8 weeks. Every transport, every producer, every consumer is yours to manage. No built-in cascade; you build the inter-SFU forwarder yourself.

Resource profile

~30–40 Mbps and ~5% of a Xeon core per active SFU stream. A c5.2xlarge handles ~150 concurrent participants comfortably; an m5.4xlarge tuned deployment reaches 500–800.

Production users

Daily (pre-rewrite), Discord (historical), Slack Huddles, multiple Fora Soft production deployments. The de facto pick when an architect wants minimum abstraction over the WebRTC media stack.

Fora Soft canon

Multiple production deployments. Default pick when the team grows into fine-grained control and the use case justifies the 4–6 month build window. Pairs well with Kubernetes operators and custom recording pipelines.

LiveKit · the SaaS-shaped SFU

Go · Apache 2.0 · single binary · LiveKit Cloud managed option

Fastest time-to-MVP. AI-agent framework. Cloud + self-host.

Go-based, Apache 2.0, ships as a single binary. Room → participant → track API; room configuration handles simulcast and recording declaratively. The LiveKit Agents framework (Python + Node SDKs) makes it the dominant choice for AI voice and video agents in 2026. LiveKit Cloud's globally distributed mesh handles regional SFU placement, routing, and failover transparently.

Time to MVP

2–3 months

Fastest of the five

Per-core throughput

~100 video tracks

Public benchmark

Cloud pricing

$0.0075 / PPM

Self-host is free

License

Apache 2.0

Permissive · safe for commercial

Wins when

You want a 2–3 month MVP, an AI agent layer is in the design, or LiveKit Cloud's managed hosting saves you the multi-region operations cost. Best-in-class globally distributed mesh in 2026.

Loses when

You want fine-grained codec routing (LiveKit's abstractions are higher level than mediasoup), need GPLv3-compatible plugins (Janus territory), or have very unusual topology requirements.

Killer feature

The Egress service handles server-side recording cleanly — pull MP4 or HLS to S3 with one API call. For many products, this alone saves 2–3 months of recording-pipeline engineering.

AI agents framework

First-class support for OpenAI, Anthropic, Google, ElevenLabs, Cartesia, Deepgram. Streaming VAD, audio chunking, turn detection, function calling — all out of the box. The 2026 default for AI voice agents.

Fora Soft canon

Several AI-agent production builds. Default greenfield pick when no SIP and no server-side transcoding is required. Start on LiveKit Cloud for MVP; migrate to self-hosted on Kubernetes when scale crosses 5M participant-minutes / month.

Janus · the SIP gateway

C · GPLv3 (commercial license available) · plugin-driven · since 2014

Mature. Plugin-driven. SIP interop unmatched.

C-based, plugin-driven. The core Janus server handles transport; every feature (VideoRoom, AudioBridge, Streaming, SIP, NoSIP) is a separately loadable plugin. The plugin model means Janus can do things other SFUs cannot — particularly the SIP gateway plugin, which bridges WebRTC to traditional VOIP carriers. The most mature SFU in production (since 2014).

Time to MVP

3–5 months

Plugin-driven

Maturity

Since 2014

Deepest production tooling

SIP gateway

First-class

Unmatched VOIP interop

License

GPLv3

Commercial license adds cost

Wins when

You need SIP interop (Nucleus pattern), stream-level features in plugins (audio mixing without full MCU, broadcast streaming, recording), or mature production tooling. The plugin model is unique in the SFU world.

Loses when

Licensing matters. Janus is GPLv3 by default. Commercial license available for proprietary deployments but adds cost. Greenfield projects without SIP rarely choose Janus over LiveKit or mediasoup in 2026.

Killer plugins

VideoRoom (SFU), AudioBridge (server-side audio mixing), Streaming (RTSP/RTP ingest), SIP gateway, NoSIP (raw WebRTC ↔ raw RTP bridge), Recorder. Each plugin loadable independently.

Production users

Telephony platforms, telehealth with SIP fallback, large conferencing where the audio mixing is server-side. Meetecho (the original sponsors) ship Janus at scale.

Fora Soft canon

Multiple production deployments. Reserved for projects with SIP gateways at the heart — telecom, contact-center extensions, hybrid WebRTC + PSTN architectures. Not the default for AI-agent or pure-WebRTC builds in 2026.

Pion · the build-it-yourself library

Go · MIT · WebRTC primitives · not a server

Primitives, not a product. Full control. 12+ months to ship.

Go library implementing the WebRTC stack from scratch — RTCPeerConnection, RTP, ICE, DTLS. Not an SFU; you build your own SFU server on top. For teams building deeply custom servers (custom load-balancing logic, custom codec routing, server-side signal processing) and willing to own the entire stack in Go.

Time to MVP

6–12 months

You build the SFU

Maturity

Library

Not a server

Control level

Maximum

You own everything

License

MIT

Most permissive

Wins when

You are building a deeply custom server (custom load-balancing logic, custom codec routing, server-side signal processing) and want to own the entire stack in Go. Pion is the foundation many production SFUs are built on (including ion-sfu).

Loses when

You want any kind of out-of-the-box SFU behavior. Pion is primitives. Time to production is 12+ months for a full SFU built on it.

Best for

Custom server-side signal processing (audio analytics, ML on the media stream), embedded WebRTC (IoT, robotics), or research-grade experimentation. ion-sfu and several niche SFUs are Pion-based.

Community

Active GitHub project (~13K stars). Used by Twitch's go-rtmp-to-webrtc bridge. Apache Foundation projects depend on it. Strong upstream commit cadence.

Fora Soft canon

Niche custom builds where standard SFUs do not fit — custom routing logic, embedded WebRTC clients, server-side ML on raw RTP. Not a default pick for typical conferencing or AI-agent builds.

Kurento · the transcoding media server

Java · Apache 2.0 · GStreamer pipeline · mature

Server-side media transformation. Transcoding first-class. JVM ops required.

Java-based media server with a graph-of-media-elements API. You compose pipelines like webrtc-endpoint → filter → recording-endpoint programmatically. Transcoding is first-class — the original design goal was bridging between WebRTC and legacy media. Nucleus and Worldcast Live both run Kurento because transcoding is core to those products.

Time to MVP

4–6 months

Pipeline composition learning curve

Transcoding

First-class

Opus ↔ G.711 · VP9 ↔ H.264

Pipeline model

Media elements

Composable filter graph

License

Apache 2.0

Permissive · safe for commercial

Wins when

You need server-side video manipulation (overlays, compositing, watermarking), audio transcoding (Opus ↔ G.711 for SIP interop), or server-side recording with on-the-fly format conversion. Nucleus and Worldcast Live both ship Kurento for these reasons.

Loses when

"Modern Go-shaped" matters — Kurento is Java, JVM tuning is part of operations. Innovation slowed 2023–2025; active development moved to LiveKit / mediasoup for greenfield work.

Killer feature

Composable media pipelines. Want to overlay a watermark on every recorded stream? It's a filter element in the pipeline. Want to bridge a 5-channel HD WebRTC publisher to a stereo G.711 SIP leg? It's a transcoder element.

Production users

Nucleus (Fibernetics) — WebRTC + SIP. Worldcast Live — HD concert streaming with multichannel audio. Long tail of legacy production deployments where transcoding is core.

Fora Soft canon

Nucleus and Worldcast Live both run Kurento in production because transcoding is core to those products. The pick when server-side media manipulation, multi-codec transcoding, or pipeline-style processing is non-negotiable.

No vendor is universally correct — the right pick depends on team skills, the SIP / transcoding / AI-agent shape of the product, and how much control you need over the media pipeline. Fora Soft canon: greenfield builds with no SIP and no transcoding requirement default to LiveKit for speed; mediasoup when the team grows into fine-grained control; Kurento for products where transcoding is core; Janus for SIP-heavy architectures; Pion for custom servers built from primitives.

Bandwidth & cost calculator

WebRTC bandwidth and TURN cost calculator.

Pick a topology, participant count, bitrate per stream, and TURN relay percentage. The calculator runs the four canonical WebRTC bandwidth formulas live — publisher upload, SFU egress, MCU egress, total room bandwidth — and projects monthly TURN relay cost. The same back-of-the-envelope math architects walk through on the first scoping call.

Topology

Participants 10

Bitrate per stream

TURN relay percentage 25%

Average call duration (minutes) 30

Calls per month 10,000

Output · live calculation

Headline result

—

Publisher upload (per peer)

—

Server egress (per room)

—

Receiver download (per peer)

—

Total room bandwidth

—

Monthly TURN cost (self-hosted)

—

At ~$0.01 per GB

Monthly TURN cost (managed)

—

At ~$0.05 per GB

Self-hosted TURN bandwidth assumes Hetzner / OVH / dedicated tier (~$0.005–$0.02 / GB). Managed TURN reflects Twilio / Cloudflare / Subspace pricing (~$0.04–$0.06 / GB). Add ~20% for retransmit / FEC overhead in production. Numbers exclude SFU compute, recording storage, and signaling — those are roughly 10–30% of total infrastructure cost at typical fleet sizes.

Security architecture · deep-dive

DTLS-SRTP, SFrame, and the patterns that pass audits.

DTLS-SRTP encrypts media between peers, SFrame encrypts media end-to-end through the SFU, and the audit log is what gets you SOC 2 / HIPAA past the procurement gate. Four tabs, four production patterns.

The DTLS handshake bootstraps every SRTP session.

DTLS only runs at session start — one or two round trips, then it’s out of the data path. The whole point is to derive the SRTP keys safely; everything after that is straight UDP carrying SRTP packets.

1Client→ServerClientHello + cipher suites + DTLS-SRTP extension

2Server→ClientHelloVerifyRequest with cookie (DDoS defense)

3Client→ServerClientHello with cookie echoed back

4Server→ClientServerHello + cert + cert verify + key share

5Client→ServerClient key share + cert + verify + Finished

6Both=SRTP keysEXTRACTOR per RFC 5764 derives master + salt + auth keys

Modern

DTLS 1.3 (RFC 9147)

One round trip instead of two. Reduces session-setup latency by ~100–150 ms.

Cert auth

Self-signed peer certs

The SDP carries the peer’s cert fingerprint. Signaling identity binds to fingerprint binds to cert.

Failure mode

MTU < 1280

DTLS fragments on small MTUs. Below 1280 bytes the handshake silently fails. Always probe MTU with PMTUD or pin to 1200.

DTLS bootstraps trust. SRTP protects the wire. SFrame protects against the SFU. The audit log makes all of it auditable. Skip the last one and you fail the first procurement review.

Observability & SRE · what to measure, what to alert on

The four metric families every production WebRTC system tracks.

Quality tells you what users feel. Errors tell you what broke. Cost tells you what you’re paying for. Compliance tells you what an auditor sees. Click any metric to expand the alert rule.

Quality

What users feel

Five metrics that predict NPS. Track p50 + p95 + p99 on every one.

RTT on the selected ICE candidate pair. From RTCIceCandidatePairStats.currentRoundTripTime. Track it per region — a spike means you’re routing through TURN unexpectedly.

alert if p95(rtt) > 0.250 for 5m
group_by: region, candidate_type
severity: page if relay% > 20

RTCInboundRtpStreamStats.packetsLost / packetsReceived. Above 2% you lose intelligibility on Opus and start seeing video freezes. Above 5% the FEC budget runs out.

alert if p95(loss_pct) > 0.02 for 3m
group_by: track_kind, codec
severity: warn at 2%, page at 5%

RTCInboundRtpStreamStats.jitter in seconds. Cellular usually shows 10–50 ms; corporate Wi-Fi under 15. Above 50 ms means receivers are buffering through the floor.

alert if p95(jitter_ms) > 30 for 5m
group_by: network_type
severity: warn at 30 ms, page at 100 ms

RTCInboundRtpStreamStats.freezeCount + totalFreezesDuration. The metric Google actually optimizes Meet around. One freeze per minute is the rough acceptability threshold.

alert if avg(freezes_per_min) > 1 for 5m
group_by: codec, sfu_node
severity: page if > 3

Synthesized mean opinion score from packet loss + jitter + RTT. ITU-T P.1201 (audio) or P.1203 (video). Roll your own or use call-quality vendors (callstats, Bria Insights).

alert if avg(mos_audio) < 4.0 for 10m
group_by: client_platform
severity: warn

Errors

What broke

Connection lifecycle failures plus the rate of degraded sessions.

Sessions that hit iceConnectionState === 'failed' before reaching connected. Above 1% means TURN capacity, MTU, or firewall — needs a per-region breakdown to root-cause.

alert if rate(ice_failed) > 0.005 for 10m
group_by: region, network_type
severity: page if > 1%

DTLS handshake never completes. Usually MTU under 1280 or a misconfigured TURN that strips DTLS. Hard to debug from logs alone — capture the SDP + offering pcap on a sample.

alert if rate(dtls_failed) > 0.001 for 5m
group_by: sfu_node, client_version
severity: page

Sessions that transition from connected/completed to failed/disconnected without user-initiated close. Highest signal: Wi-Fi-to-cellular handoff without ICE restart.

alert if rate(session_drop) > 0.01 for 15m
group_by: handoff_event
severity: warn

Calls that look connected but stop receiving consent STUN responses. Calls die ~30 s later. Symptom of NAT mapping expiry on long-running cellular sessions.

alert if rate(consent_fail) > 0.002 for 30m
group_by: network_type
severity: warn

Cost

What you’re paying for

Per-minute, egress, and infrastructure cost decomposition.

Sessions whose selected candidate pair is on a TURN relay. Drives both latency and cost. 18–35% is normal on consumer cellular; corporate networks should be under 10%.

alert if relay_pct > 0.25 for 1h
group_by: region, network_type
severity: warn

SFU egress dominates compute above Tier 04. Track GB egressed per participant-minute, broken down by simulcast layer. Untuned simulcast = receivers pin to highest tier = 3× the bill.

budget: 1.5 GB / participant-hour at 720p
alert if avg > 2.5 GB / participant-hour
group_by: simulcast_layer

Per-minute accounting reconciled weekly against vendor invoice. Drift > 5% means metering is wrong — either you’re under-billing customers or your unit economics are wrong.

weekly_reconcile: metered_minutes vs invoice_minutes
alert if drift > 0.05
severity: page

Compliance

What an auditor sees

Audit-log completeness and consent flow integrity for SOC 2 / HIPAA / GDPR.

Every privileged action (join, leave, recording access, admin, config change) emits an immutable audit log entry. Any gap is a SOC 2 finding. Test with periodic dummy events that flow end-to-end.

alert if missing_audit_events > 0 in last 1h
group_by: event_type
severity: page

Users who hit the consent screen but never accept. Above 5% means the consent UX is broken or the screen is being shown when it shouldn’t. Required for GDPR + COPPA + state privacy laws.

alert if consent_dropoff > 0.05 for 1h
group_by: jurisdiction, age_band
severity: warn

Every read of a recording emits an audit log with user ID + access reason + IP. HIPAA + SOC 2 require this. Catches insider access patterns — the same admin reading 50 patient sessions in a row.

alert if user_unique_recordings_accessed > 20 in 1h
severity: page (insider risk)

Sessions tagged EU-only that land on a non-EU SFU node. Audit log + alert. Fixed by region pinning at the signaling server based on the user’s tenant region.

alert if eu_session on non_eu_sfu > 0
severity: page (hard SLO violation)

getStats() is the API you build everything on. RTCInboundRtpStreamStats + RTCIceCandidatePairStats + RTCRemoteOutboundRtpStreamStats get you the bulk of the metrics on this page. Sample every 2 seconds, ship to a time-series DB, alert on percentiles — never on means.

Standardized WebRTC ingest / egress

WHIP and WHEP — HTTP-native WebRTC signaling.

RFC 9725 (WHIP, March 2024) and RFC 9737 (WHEP, July 2024). The IETF’s answer to: stop building bespoke WebSocket signaling for every product. Publish with HTTP POST, subscribe with HTTP POST, both with SDP in the body. Click any step to see the actual exchange.

WHIP RFC 9725 · ingest

Publisher uploads media to the server.

One HTTP POST creates the session and exchanges SDP. The server returns a resource URL you DELETE to tear down. That’s the entire protocol — the rest is ordinary WebRTC over UDP.

WHEP RFC 9737 · egress

Subscriber pulls media from the server.

Mirror image. POST the SDP offer; server replies with answer + resource URL. Layer switching (simulcast, SVC) happens via PATCH with SDP renegotiation or out-of-band.

WHIP step 1 · publisher

POST /publish — create the ingest session

POST /publish HTTP/1.1
Host: ingest.example.com
Content-Type: application/sdp
Authorization: Bearer eyJhbGc...

v=0
o=- 7456 2 IN IP4 0.0.0.0
s=-
t=0 0
m=audio 9 UDP/TLS/RTP/SAVPF 111
m=video 9 UDP/TLS/RTP/SAVPF 96
…

Single round trip. The body is a standard SDP offer with the same media descriptions you’d send to any SFU. The Bearer token is what authenticates the publisher — usually a short-lived JWT minted by your control plane.

WHIP step 2 · server

201 Created — SDP answer + resource URL

HTTP/1.1 201 Created
Location: /publish/abc-123
Content-Type: application/sdp
Link: <turn:turn.example.com:3478?transport=udp>;
      rel="ice-server"

v=0
o=- 7456 2 IN IP4 0.0.0.0
s=-
…

Location header carries the resource URL you’ll later DELETE. The Link header is the standard way to advertise the ICE server config — same role as the iceServers array in vanilla WebRTC.

WHIP step 3 · trickle

PATCH /publish/abc — trickle ICE candidate(s)

PATCH /publish/abc-123 HTTP/1.1
Content-Type: application/trickle-ice-sdpfrag

a=ice-ufrag:8hhY
a=ice-pwd:asd88fgpdd777uzjYhagZg
a=candidate:1 1 UDP 1694498815 192.0.2.3 45664
            typ srflx raddr 192.0.2.3 rport 45664

Optional in WHIP — you can also stuff all candidates into the initial POST. But trickle ICE knocks 100–300 ms off session setup when STUN gathering is slow.

WHIP step 4 · teardown

DELETE /publish/abc — tear down

DELETE /publish/abc-123 HTTP/1.1
Authorization: Bearer eyJhbGc…

HTTP/1.1 200 OK

Drops the SFU allocation immediately. Always call this on publisher exit — forgetting to means the SFU keeps the slot allocated until session timeout (typically 30–60 s). Costs add up at scale.

WHEP step 1 · subscriber

POST /subscribe — create the egress session

POST /subscribe/{stream-id} HTTP/1.1
Content-Type: application/sdp
Authorization: Bearer eyJhbGc…

v=0
o=- 7456 2 IN IP4 0.0.0.0
s=-
m=audio 9 UDP/TLS/RTP/SAVPF 111
m=video 9 UDP/TLS/RTP/SAVPF 96
a=recvonly
…

Subscriber identifies the stream via the path. The SDP offer is recvonly — subscribers don’t publish back. Everything else mirrors WHIP.

WHEP step 2 · server

201 Created — SDP answer with sendonly media

HTTP/1.1 201 Created
Location: /subscribe/xyz-789
Content-Type: application/sdp

v=0
o=- 8889 2 IN IP4 0.0.0.0
s=-
m=audio 9 UDP/TLS/RTP/SAVPF 111
m=video 9 UDP/TLS/RTP/SAVPF 96
a=sendonly
…

The server is sendonly to the subscriber. After this, ordinary WebRTC packets flow over UDP. The HTTP transaction is over.

WHEP step 3 · layer switch

PATCH /subscribe/xyz — layer / quality switch

PATCH /subscribe/xyz-789 HTTP/1.1
Content-Type: application/json

{
  "layer": "low",
  "spatial": 1,
  "temporal": 2
}

The WHEP spec leaves layer switching to extensions. Most production implementations accept either an SDP re-offer or a JSON body indicating which simulcast layer + SVC temporal/spatial tier to forward.

WHEP step 4 · teardown

DELETE /subscribe/xyz — tear down

DELETE /subscribe/xyz-789 HTTP/1.1
Authorization: Bearer eyJhbGc…

HTTP/1.1 200 OK

Same shape as WHIP. The egress allocation is released immediately. Skipping this leaves SFU egress streams allocated until session timeout — pure compute waste.

Why WHIP / WHEP matters in 2026: every WebRTC product used to ship its own bespoke WebSocket signaling protocol. WHIP / WHEP collapses that to standard HTTP. OBS Studio ships native WHIP. Cloudflare, Twitch, AWS, mediasoup, LiveKit, and most CPaaS vendors all expose WHIP / WHEP endpoints. The result: ingesting WebRTC to any cloud is now a one-line config change.

Cascade SFUs · globally distributed mesh

How a single call spans seven regions.

A single SFU node tops out at a few thousand publishers per cluster. Beyond that, you cascade: each region runs its own SFU, peers join the nearest, and SFUs relay only the streams that need to cross regions. Click any region to see its config; hover an edge for inter-region latency.

eu-central-1 · Frankfurt

The European hub — GDPR-residency default

EU-only data deployments pin here. Highest density of cross-region edges: Virginia 78 ms, Mumbai 120 ms, Singapore 155 ms, Tokyo 220 ms. Best central position in the global mesh.

Typical load28% of global sessions

SFU nodes20 autoscaled

ComplianceGDPR data residency · ISO 27001

Routing rulePin EU-tenants here regardless of user location

How a session picks a region

Signaling server reads the tenant’s residency rule (EU-only, US-only, AU-only, anywhere).
If residency-locked, pin to that region’s SFU. End of decision.
Otherwise, geo-lookup the user’s IP, pick nearest SFU on the map by network latency (not great-circle distance).
If the user’s nearest SFU is full or unhealthy, fail over to the next nearest. ~3% of sessions hit failover in steady state.
Cross-region cascade only when peers in the same call land on different SFUs. Each cross-region hop adds 70–220 ms.

Cascade only crosses regions for streams that need to. Same-region peers stay on the same SFU node. Cross-region pairs go SFU→SFU over private backbone (or public internet on cheap tiers). The architecture wins when single-region cost is much cheaper than centralizing everywhere — which it always is past a couple thousand concurrent sessions.

AI voice agent latency budget

The latency budget that decides whether an AI voice agent feels human.

Total end-to-end latency from user finishing speaking to agent's first audio frame at the user's speaker determines whether the conversation feels natural or robotic. Three scenarios shape the budget — best case (~450 ms achievable), production median (1.4–1.7 s industry-published 2025–2026 data), and P99 in the wild (3–5 s without disciplined ops). Click a scenario to see how each stage contributes and where the optimization wins sit.

Total end-to-end latency

450 ms

User finishes speaking → first audio frame at speaker

All optimizations stacked. ASR / LLM / TTS all streaming, co-located in one region, pre-warmed model contexts. The conversation feels natural; users interrupt the agent without friction.

User perception thresholds (highlighted band shows where current scenario lands)

< 300 ms

Feels human · interruption flows naturally

300–600 ms

Slightly sluggish · acceptable for routine queries

600 ms – 1 s

Users notice the gap · rhythm breaks

1 – 1.5 s

Talking over agent · UX bug territory

> 1.5 s

Users revert to taps · or hang up

Numbers reflect industry-published medians from 2025–2026 production deployments. The biggest individual lever is converting every component from request/response to streaming — partial ASR transcripts, streamed LLM tokens, streamed TTS audio. The second biggest is co-locating SFU, ASR, LLM, and TTS in one datacenter to eliminate cross-region hops. Speech-to-speech models (OpenAI Realtime, Gemini Multimodal Live) compress this stack but trade observability for latency.

Delivery protocol matrix

WebRTC vs LL-HLS vs MoQ vs HLS vs RTMP — what to pick for what.

Every video product touches at least two delivery protocols — one for interactive (sub-500 ms) and one for fan-out (broadcast scale). The matrix below plots latency against scale ceiling for the five protocols dominating 2026 production. Click a protocol to see the production reality, current 2026 status, and the Fora Soft client running it.

Y-axis: scale ceiling (concurrent viewers)

X-axis: glass-to-glass latency · lower is better for interactive use cases

Every product above 1K concurrent runs hybrid: WebRTC for the talent and interactive layer, LL-HLS or MoQ for broadcast fan-out, with a transcoding bridge in between. StreamLayer ships this pattern; Worldcast Live runs WebRTC end-to-end at 10K concurrent because the latency requirement is sub-second and the audience size sits in WebRTC's sweet spot.

Recording architecture · server-side, scaled, compliant

The recording pipeline that does not lose a frame.

Client-side recording breaks above ~200 participants and on every tab close. Production WebRTC records on the server, pulls media from the SFU, transcodes in a separate pipeline, writes to object storage with chunked uploads, and serves through a CDN. Six stages, each with its own scale ceiling.

→ → → → →

Stage 02 · Recording orchestrator

One controller per recording job, scaling horizontally.

The orchestrator listens to SFU events, decides which sessions need recording, spawns recorder workers (typically Kubernetes pods or AWS Fargate tasks), and tracks lifecycle. It also enforces compliance — consent-flag check, retention policy, jurisdiction routing.

InputsSFU webhooks · tenant config · consent flags · retention rules

OutputsKubernetes Job spec · recorder pod with credentials, output paths, manifest URLs

StackCustom Go / Node service · Temporal or Argo Workflows for state · Kubernetes for compute

Failure modeOrchestrator crashes mid-session — running recorders survive (they’re independent pods), but new sessions don’t start until the orchestrator recovers. Run at HA from day one.

Plan for ~1 GB / hour / room at 720p. Multiply by participant count for multi-track recordings, divide by ~3 if you record a composite mix. Compliance retention (HIPAA 6 years, GDPR “as short as possible”) is the dominant cost driver above 50K recorded hours.

WebRTC on mobile · architectures & pitfalls

Five platforms, one libwebrtc, very different gotchas.

All five mobile WebRTC platforms ultimately link against the same libwebrtc — the difference is what the platform layer makes hard. Audio session handling on iOS, background mode on Android, codec licensing on iOS, screen share permissions everywhere. Pick a platform.

iOS native · libwebrtc + AVAudioSession

You link against libwebrtc.xcframework (Apple-signed Google builds) and call from Swift or Objective-C. The platform layer is dominated by AVAudioSession — configure it wrong and you get one-way audio or no audio at all.

Owned gotchas

AVAudioSession category: must be .playAndRecord with .allowBluetooth + .mixWithOthers. Default category breaks Bluetooth headsets.
CallKit integration: required for inbound calls or iOS rejects waking the app from background. PushKit + CallKit is mandatory pairing.
H.264 only: VideoToolbox hardware encoder ships H.264 hardware acceleration. VP9 / AV1 work but software-only — battery murders.
Screen share: ReplayKit Broadcast Extension required. Runs in a separate process with 50 MB memory ceiling; design accordingly.
Background audio: only works if Background Modes » Audio is enabled in entitlements. Video freezes when app backgrounds — expected behavior.

SDK source

Google’s WebRTC binary builds via CocoaPods or SPM (LiveKit, mediasoup, Daily all wrap it)

Min iOS

iOS 14 for the modern Encoded Transform API. iOS 12 for baseline DTLS-SRTP.

Battery profile

25–35% per hour for 720p video. H.264 hardware path is half the cost of VP9.

Production canon

Nucleus iOS · BrainCert iOS · Translinguist iOS

If you only have engineering for one platform: ship native iOS and Android. Mobile web is for the “tap to join” flow only. React Native + Flutter are valid for cross-platform teams — just remember the native gotchas don’t go away, they just get one bridge call deeper.

Failure mode playbook · the on-call runbook

Twelve production WebRTC failures, in the order you’ll hit them.

Each entry: what users feel, the root cause, the fix that ships, the telemetry that catches it next time. Filter by severity to focus the runbook view.

PAGE TURN cold-start adds 1.5–3 s to join join latency

Symptom: User joins a call, sees the “Connecting…” spinner for 2–3 seconds longer than normal. Happens in the first hour after a region scales out.

Root cause: TURN pod spawning — the relay server takes time to load coturn config, register with the control plane, and warm DNS. Cold workers cost ~1.5–3 s before they answer their first STUN request.

Fix: Keep a warm pool of 2–3 idle TURN workers per region. Scale by predicted peak +20% buffer. Pre-emptive scale-out on the daily-rhythm signal.

Telemetry: Track p99 ICE-pair-formation-time by region. Alert when it crosses 800 ms for 5 minutes. Tag candidate.type for relay vs srflx breakdown.

PAGE DTLS handshake silently fails when path MTU < 1280 connection establishment

Symptom: ICE shows connected, but no media. The browser doesn’t surface why. iceConnectionState stays at connected forever; getStats() shows no inbound RTP.

Root cause: DTLS records are sent as DTLS packets that exceed MTU. Some intermediate network drops the fragments. The DTLS handshake stalls without alerting. Common on corporate VPNs (1380 MTU after IPsec), Cloudflare WARP, mobile cellular with WAP gateways.

Fix: Pin DTLS record size to 1200 bytes via libwebrtc kMinPacketSize. Or run PMTUD at session start. Most production stacks just pin to 1200 and never look back.

Telemetry: dtlsTransportState transitions to <new, connecting, connected, closed, failed>. Alert when it stays in connecting > 5 s.

PAGE ICE consent freshness expires, calls die after 30 s mid-call drop

Symptom: Users are happily in a call. Around 30 seconds after a network change (Wi-Fi to cellular, VPN reconnect, NAT mapping refresh) the call goes silent and disconnects.

Root cause: ICE consent freshness (RFC 7675) checks every 5 s. After ~6 missed responses the connection transitions to disconnected, then failed. The browser doesn’t auto-restart unless your code calls pc.restartIce().

Fix: Listen to navigator.onLine. On any online event during an active session, call pc.restartIce() to renegotiate ICE candidates. Pair with a UI “Reconnecting…” banner.

Telemetry: count sessions that drop from completed to failed within 30 s of a navigator.onLine event. That’s the recovery you’re missing.

PAGE Untuned simulcast — receivers pin to highest tier cost runaway

Symptom: SFU egress costs balloon 2–3× expected. CDN egress invoice arrives, finance asks what changed. No user-visible quality drop.

Root cause: Simulcast was enabled, but the SFU isn’t dropping tiers. Every receiver gets the high tier regardless of bandwidth. Common when GCC feedback isn’t flowing or the SFU is configured to forward all layers.

Fix: Verify SFU is using BWE feedback to drop tiers per receiver. Inspect a sample session in the SFU debug UI — if egress bitrate equals publisher bitrate × receiver count, you’re leaking. Pin per-receiver bitrate ceilings as a backstop.

Telemetry: egress GB per participant-minute, broken down by simulcast layer. Budget 1.5 GB / participant-hour at 720p; alert at 2.5 GB / hour.

PAGE Recording stitched client-side breaks at 200+ participants recording integrity

Symptom: Recordings show frozen frames, missing speakers, or audio cuts where tabs were closed. Customer complaints about missing footage.

Root cause: The recording is being assembled from browser MediaRecorder on one of the participants’ machines. When a tab closes, refreshes, or backgrounds — gone. At 200+ participants the probability of disruption per session approaches 1.

Fix: Move recording to the server. Recorder joins the SFU as a participant, pulls SRTP, writes the master file in a dedicated pod. Never client-side except for prototype demos.

Telemetry: count recording-success rate per session. Alert when it drops below 99% for any room size.

WARN Aggressive noise suppression cuts whispered speech audio quality

Symptom: Soft-spoken users in healthcare or legal calls report “the other person can’t hear me.” AGC is pumping but transcripts still drop whispered words.

Root cause: Default WebRTC noise suppression treats whispered consonants as noise. Per-language tuning matters — English vs Mandarin vs Russian have very different speech spectra.

Fix: Tune VAD aggressiveness per language. Set noiseSuppression: false when the user is on AirPods (which do their own NS). Use libwebrtc audio_processing module config + pass language hint from the signaling layer.

Telemetry: Inbound RTP audioLevel + voiceActivityFlag. Compare against client-side AGC compression ratio. Outliers are the channels with broken NS.

WARN Audio drift between WebRTC and LL-HLS tracks hybrid sync

Symptom: On hybrid SFU + LL-HLS broadcast deployments, the WebRTC talent audio is 200–500 ms ahead of the LL-HLS audience video. Reactions feel disconnected.

Root cause: RTP and HLS use independent clocks. The LL-HLS pipeline buffers segments; the WebRTC path doesn’t. Without explicit sync, drift grows.

Fix: Stamp every WebRTC packet with absolute capture time in an RTP header extension. Carry the same timestamp through HLS metadata (ID3 in TS or emsg in fragmented MP4). Receiver re-syncs by NTP-anchored timestamps.

Telemetry: A/V offset per session, sampled from a known reference frame embedded in both tracks. Alert at 250 ms drift.

WARN SFU egress dominates compute at Tier 04 cost shape

Symptom: Compute cost per participant looks fine in dev. In production at 200–1000 participants per room, egress bandwidth is 60%+ of total infrastructure spend.

Root cause: Cost shape inverts above Tier 04: bandwidth out-prices CPU on most cloud providers. Each receive-only viewer multiplies egress without adding much CPU.

Fix: Negotiate committed-rate egress with your cloud provider above 1 PB / month. Switch to R2 or B2 for recordings. Above Tier 05, route viewers via LL-HLS through CDN edge cache — cheap egress on cached segments.

Telemetry: Cost per participant-hour, decomposed into CPU + RAM + egress + storage. Decomposition tells you where the next optimization lives.

WARN First 5 s bandwidth estimation undershoots startup quality

Symptom: First 5–10 seconds of a call look choppy / low-res, then improve. Users on fast connections complain about “why does it always start grainy.”

Root cause: Google Congestion Control starts conservative (~300 kbps) and ramps up. Without an initial bandwidth hint, GCC doesn’t know the network is capable of 5 Mbps until it’s measured it.

Fix: Set initialBitrate in PeerConnection config based on the user’s historical bandwidth (cookie or server-side per-user record). Set networkPriority: high for the active speaker’s send stream.

Telemetry: Track p95 time-to-target-bitrate from session start. Alert when it crosses 3 s.

WARN TURN allocations leak when close() is missed cost & capacity

Symptom: TURN GB-relayed bill grows faster than session volume. Capacity alerts fire mid-week even though new-session count is flat.

Root cause: JS code path doesn’t call pc.close() on every exit branch — tab close, navigation, error, browser-kill. The TURN allocation stays open until its timeout (default 600 s on coturn).

Fix: Beacon-API call on beforeunload / pagehide that sends a DELETE to your signaling backend, which tells the SFU to tear down. Also set coturn allocation timeout shorter (60 s) for sessions tagged short-lived.

Telemetry: Active TURN allocations vs active signaling sessions. Mismatch > 5% means leaks.

INFO VP9 SVC layer-drop causes flicker cosmetic

Symptom: Brief blocky flicker on receiver every 2–5 seconds when bandwidth tightens. No freeze, just visual artifacts.

Root cause: VP9 SVC layer-drop is per-frame. When the SFU drops a temporal layer mid-GOP, the receiver renders a partial frame before the next keyframe.

Fix: Increase keyframe interval to ~2 s minimum (drops resilience but reduces flicker). Or force a keyframe when layer state changes via REMB / PLI / FIR.

Telemetry: framesDecoded vs keyFramesDecoded ratio. Spikes correlate with bandwidth events.

INFO Bluetooth headset audio routing fails on first iOS call first-call UX

Symptom: User with AirPods or Bluetooth headphones: first call routes audio to phone speaker. Second call routes correctly.

Root cause: AVAudioSession needs .allowBluetooth + .mixWithOthers in CategoryOptions. iOS caches the routing decision; first call is made before the option propagates.

Fix: Configure AVAudioSession at app launch with the full CategoryOptions set, not at call start. Force re-evaluation on every session activation.

Telemetry: Survey users: rate of “had to switch headphones” complaints. App-level metric, not WebRTC stats.

These twelve cover roughly 90% of production WebRTC incidents at Fora Soft over the past 24 months. The remaining 10% are unique to a particular client’s stack — not generalizable, but they all teach the same lesson: instrument before you launch, alert on percentiles, never on means.

CPaaS vendor matrix · pricing & when each one wins

Five WebRTC clouds. The lens you read them through decides which wins.

Same five vendors, very different optima. The cell highlighting changes based on what you’re optimizing for. 2026 pricing — check vendor calculators for binding numbers.

Optimize for:

	Twilio Video Enterprise legacy	Agora Asia-first scale	LiveKit Cloud AI-native	Daily Embedded experience	Vonage Programmable comms
Per-minute price (HD)	$0.0040 / pm	$0.99 / 1K min	$0.0040 / pm + agent-hour	$0.0050 / pm	$0.00475 / pm tiered
Recording (per minute)	$0.0040 + storage	$0.59 / 1K min	$0.0030 / pm egress	$0.0040 / pm + storage	$0.0040 + storage
Max participants / room	50 (group), 200 (small-group)	1M (broadcast)	100K (broadcast), 1K (interactive)	300 (interactive)	500 (relayed)
AI agent integration	Voice Intelligence add-on	Conversational AI Engine	Agents framework (Python, Node, Go)	Pipecat (open-source) + Daily Bots	AI Studio (low-code)
Compliance (HIPAA / SOC 2 / GDPR)	HIPAA BAA + SOC 2 + GDPR + ISO + PCI	HIPAA BAA + SOC 2 + GDPR + ISO	HIPAA BAA + SOC 2 Type II + GDPR	HIPAA BAA + SOC 2 + GDPR	HIPAA + SOC 2 + GDPR + FedRAMP
Region count	12+	200+ data centers	20+	10+	15+
Developer ergonomics	Mature, verbose REST + helper libs	SDK-heavy, doc gaps	Open-source server compatibility	Prebuilt UI + minimal config	Long history, dated API surface
Self-host fallback	No	No	Yes (same OSS server)	No	Limited (on-prem option)

Pricing approximate, 2026 Q1 published rates. Always model against actual usage shape: heavy recording shifts the curve toward LiveKit + Daily; sub-second broadcast shifts toward Agora; HIPAA without engineering capacity shifts toward Twilio + Vonage; AI-agent runtime shifts toward LiveKit + Daily. Highlighted cells are the per-lens winners.

2026 trends · what is changing in WebRTC

Nine shifts in the WebRTC architecture stack to plan for.

Each one is at a different point on the curve: some are RFC-standard and shipping, some are in early production, some are still experimental. Filter by maturity to see the planning horizon.

Emerging

Media over QUIC (MoQ)

IETF MOQ working group standardizing publish/subscribe over QUIC. Sub-500 ms latency at 100K+ viewers. Replaces LL-HLS in the broadcast tier.

2026: drafts · 2027: RFC

What changes: Tier 06 (10K+ broadcast) shifts from LL-HLS toward MoQ. Cloudflare, Twitch, and Meta have shipped early MoQ deployments. Latency drops from 2–4 s (LL-HLS) to 200–500 ms.

What to plan: If you’re building a broadcast product launching in late 2026, prototype on MoQ. Cloudflare and AWS both have preview endpoints. Don’t commit to it for anything shipping before Q4 2026 — the spec is still moving.

Emerging

WebRTC + AI agents on the talent layer

LiveKit Agents framework, OpenAI Realtime API over WebRTC, Pipecat. Voice / multimodal AI agents that sit on the room as a participant.

2026: production at single-agent · 2027: multi-agent orchestration

What changes: AI agents become first-class participants in WebRTC rooms. Sub-second voice loops via OpenAI Realtime API. The architecture becomes a hybrid: human peers + AI peers + SFU routing all of them.

What to plan: Build the AI-agent slot into your architecture even if you’re not shipping it day-one. Pillar 5 covers the AI-agent side in depth. The interface (LiveKit Agents, Pipecat) matters less than the architecture choice (custom turn-taking, function-calling, voice cloning).

Emerging

AV1 + L1T3 (Low-Complexity AV1) on mobile

AV1 hardware decode on iPhone 15+, Pixel 8+, modern Android flagships. Software encode still expensive; hardware encode in fewer chips.

2026: hardware decode mainstream · 2027: hardware encode common

What changes: AV1 delivers ~30% better compression than VP9 at the same quality. For mobile broadcast it’s a serious bandwidth win. Catch: hardware encode is still rare, so publishing in AV1 burns CPU + battery.

What to plan: Enable AV1 receive-only first — let publishers send VP9 / H.264 and let the SFU transcode to AV1 for AV1-capable receivers. Reserve AV1 publish for power-plugged scenarios (broadcast studios, desktop apps).

Production-ready

WHIP & WHEP shipped everywhere

RFC 9725 (WHIP) + RFC 9737 (WHEP) finalized 2024. OBS, Cloudflare, Twitch, AWS, mediasoup, LiveKit all ship native WHIP / WHEP endpoints.

2026: standard for new products

What changes: Custom WebSocket signaling stops being a thing for new products. WHIP for ingest, WHEP for egress, plain HTTP, done. The signaling layer collapses to a single REST endpoint.

What to plan: For new builds, default to WHIP / WHEP. For existing builds with custom signaling, add WHIP / WHEP as an additional ingest path for third-party clients (OBS, broadcast hardware, partner integrations).

Production-ready

AI noise suppression (Krisp, Daily AI denoiser)

ML noise suppression that strips background noise without harming speech. Krisp shipped at AT&T scale; Daily AI denoiser default in Daily SDK.

2026: table stakes for support & healthcare calls

What changes: The bar for “professional call audio” rises. Customers expect their dog barking, the construction next door, and their kid in the background to disappear from the call. Standard WebRTC NS doesn’t do this; AI denoisers do.

What to plan: Add an AI denoiser path before the encoder. Krisp ships as an SDK; Daily includes it in their stack; LiveKit has an open-source noise suppression module. Latency cost: 10–30 ms. Quality cost: occasional missed consonants on very quiet speech.

Production-ready

SFrame end-to-end encryption (RFC 9678)

Standardized 2024. SFU can’t see the content; only forwards encrypted frames. Key distribution via MLS (RFC 9420).

2026: shipping in healthcare · 2027: default for regulated SaaS

What changes: Regulated industries (healthcare, legal, financial) move toward SFrame to make the SFU operator no longer a trust boundary. The SFU still forwards; it just can’t read.

What to plan: For HIPAA + sovereign-data deployments, design with SFrame in scope from day one. You’ll lose server-side recording transcription (without key escrow) and AI moderation; design around that trade-off.

Standard

DTLS 1.3 in all major browsers

RFC 9147 shipped 2022, browsers gradually default to it. One round trip instead of two — ~100–150 ms join-time savings.

2026: shipped everywhere relevant

What changes: Session-setup latency drops by one round trip. On mobile cellular that’s 100–150 ms off the join time — user-visible.

What to plan: Don’t pin DTLS version. Let the browser negotiate. Verify your SFU supports DTLS 1.3 server-side (mediasoup, LiveKit, Janus all do; older custom builds may not).

Standard

Encoded Transform API (Insertable Streams)

Lets JS code transform encoded frames before send / after receive. Shipped in Chrome + Edge; Safari + Firefox catching up.

2026: standard, used in production

What changes: Enables SFrame E2EE, watermarking, custom forward-error-correction, AI processing of encoded frames — all from JS, without forking libwebrtc.

What to plan: Use it for SFrame implementations and any per-frame processing. Don’t use it for what would be cheaper at the codec level (e.g. simulcast: still belongs in the encoder, not in JS transform).

Standard

Hardware AEC + AGC on mobile

Bluetooth and built-in mics on modern iOS + Android phones now ship hardware-level echo cancellation and auto-gain. WebRTC’s software AEC becomes a fallback.

2026: most flagship phones

What changes: Audio quality on mobile gets visibly better. WebRTC’s default audio_processing module can be disabled when the OS reports hardware AEC. Saves CPU + battery + improves quality on AirPods + Galaxy Buds.

What to plan: Detect hardware AEC via AudioContext metadata. Disable libwebrtc software AEC when hardware path is available. Pair this with AI denoiser for the best 2026 audio stack.

The pattern across all nine: WebRTC is consolidating around an IETF + W3C standards stack, while AI-native features get bolted on at the JS layer through Encoded Transform. Plan for a 2026 architecture where the wire protocol is stable, the value is in the AI + UX layer on top.

Production examples

Four shipped WebRTC systems and the architectural choices that made them work.

Telehealth and business communication at 600M+ call minutes per month. HD broadcast at 10K concurrent with sub-second latency. Interactive sports streaming for major networks. The world’s first WebRTC + HTML5 virtual classroom LMS at 500M+ classroom minutes per year. Four production builds running today across four very different scale shapes.

Nucleus · WebRTC + SIP

600M+ call minutes per month · 5,000+ businesses

Secure on-premise Slack alternative, serving 300,000+ customers on a national network processing 2 billion+ phone calls annually. Nucleus serves 5,000+ businesses with AI phone agents handling 600M+ call minutes monthly. Featured in Financial Post and PRWeb.

The WebRTC architecture combines real-time WebRTC calls with SIP for cellular and landline bridging — chat-to-SMS, voice-to-voice AI translation, video and audio calls, CRM and ERP integration. SOC II, GDPR, and HIPAA-compliant.

Worldcast Live · Sub-second concert

10K concurrent · 0.4–0.5 s latency · 5-channel HD

HD concert streaming platform — one of the first to achieve sub-second latency (0.4–0.5s) at scale, serving 10,000 simultaneous viewers. Custom WebRTC + Kurento architecture with multichannel audio (5 channels), 1.5 Gb/s bitrate for true HD quality, and dynamic video quality adjustment for poor connections.

Full-duplex two-way streaming lets performers in different locations play together in real time. The white-label Multiple Venue Streaming (MVS) plugin syncs live streams across multiple websites simultaneously.

StreamLayer · Interactive sports

NBC, CBS, Red Bull, Chelsea FC, Sony Music

Interactive sports streaming SaaS used by NBC, CBS, Red Bull, Live Nation, Chelsea FC, Coca-Cola, and Sony Music. Powered live engagement at Lollapalooza and Jay-Z’s Made In America festival. Raised $14.1M across 7 rounds ($8M Series A in 2022 led by Las Vegas Sands). Featured on NASDAQ and Gaming America.

Platforms implementing the full StreamLayer stack report 60–100% revenue uplift over basic ad breaks, with effective CPM rates reaching $50–80+. Interactive viewers watch 33% longer than passive viewers — ~2x increases in dwell time, ad revenue, and subscriber growth.

BrainCert · Virtual classroom LMS

500M+ classroom minutes / year · 99.995% uptime

The world’s first WebRTC + HTML5 virtual classroom LMS. Bootstrapped to $3M annual revenue and 100K+ customers — a 12-person team outcompeting VC-backed giants across the $400B+ e-learning market. 500M+ real-time classroom minutes delivered across 10 worldwide datacenters with 99.995% guaranteed uptime. 4 Brandon Hall Awards (Triple Bronze 2021).

Full compliance stack: SOC 2 Type I & II, ISO/IEC 27001:2022, HIPAA, GDPR, PCI DSS, CCPA, NIST SP 800-171. Proprietary QuantumKey DRM encryption shipped at no additional cost to customers. Fora Soft built the product from the ground up; 2+ year ongoing partnership.

Decision framework

Build custom WebRTC, buy SDK, or hybrid — when each one wins.

Three architectural paths for shipping a real-time video product. None is universally correct. The right choice is a function of usage volume, customization depth, compliance scope, and whether the platform is the product or supports the product.

Build custom

When you need control SDK won’t give you

Wins when: real-time video is the product, concurrency clears 1,000 simultaneous, custom server-side routing logic, regulated industry (HIPAA, SOC 2, GDPR), brand-embedded experience, multi-tenant SaaS, or specific media routing (MoQ delivery, multi-region cascade, simulcast tuning by network condition). You own all the IP.

Cost shape: $5K–$40K build over months. Plus $500-$2K monthly operations (depends on the usage). Break-even on per-minute economics typically falls around 8M participant-minutes per month. Archetypes: Nucleus, Worldcast Live, BrainCert, StreamLayer.

Buy SDK

When you can live in the vendor’s box

Wins when: under 5M participant-minutes per month, standard call patterns, no custom server-side routing, no regulatory or data-residency constraints, no in-house engineering capacity to operate multi-region WebRTC. Twilio Video, Agora, LiveKit Cloud, Daily, Vonage — all do this well.

Cost shape: $0.004–$0.06 per participant-minute (Twilio band), per-minute rate scales linearly with usage. No upfront build cost. Vendor lock-in. Most SDKs don’t offer BAAs.

Hybrid

When half the problem fits SDK

Wins when: the runtime fits an SDK but the differentiator (custom turn-taking, regional routing, voice cloning, embedded experience) needs custom code. The most common 2026 pattern in regulated SaaS plays.

Cost shape: SDK subscription + $–$0K for the custom layer + monthly maintenance. Designed to migrate to full custom if usage scales past ~8M participant-minutes.

Cost ranges are 2026-indicative. Implementation specifics — concurrent participant targets, compliance scope, recording requirements, codec mix, multi-region cascade depth — typically dominate the spread within each tier.

Custom WebRTC architecture costs $-build over months. SDK alternatives cost $0.004–$0.06 per participant-minute with vendor lock-in. The break-even point typically falls around 8M participant-minutes per month — below, SDK wins; above, custom wins.

FAQ

Twelve questions every WebRTC architecture review covers.

What is WebRTC SFU architecture?

An SFU (Selective Forwarding Unit) is the media routing topology used by every production WebRTC group call above ~4 participants. Each peer uploads a single encoded stream to the SFU; the SFU forwards selective copies to every other peer without decoding or re-encoding. This makes bandwidth at the peer O(1) regardless of room size. SFU vendors include mediasoup, LiveKit, Janus, Pion, ion-sfu, Kurento, and Jitsi Videobridge. Modern SFUs add simulcast and SVC so the SFU can drop quality tiers selectively per receiver based on network conditions.

What is the difference between P2P and SFU?

P2P (peer-to-peer) connects peers directly via RTCPeerConnection — no server in the media path. Bandwidth at the peer scales O(N²) across the room (every peer uploads to every other peer), which works for 1:1 and stops being viable above ~4 participants. SFU routes streams through a server: each peer uploads once, the server forwards. Bandwidth at the peer becomes O(1), making SFU scale to 10,000+ participants. Most production stacks ship hybrid topology — P2P for 1:1 legs, SFU for groups, selected dynamically per session.

How does SFU vs MCU compare?

SFU and MCU are both server-side topologies, but they route media differently. SFU forwards streams without transcoding — low latency (150–300 ms) but consumes bandwidth at the SFU egress. MCU mixes peer streams server-side into a single composite stream — higher latency (250–500 ms with transcoding), lower bandwidth at the SFU, but very compute-heavy. In 2026, SFU is the default for almost every production WebRTC build. MCU is reserved for legacy SIP interop and broadcast composite output where a single mixed stream is required.

How do you scale WebRTC?

Scale depends on the participant tier. Below 100 participants, a single SFU node handles it. 100–1K, simulcast + SVC gates the publisher count and serves receive-only viewers cheaply. 1K–10K, cascade SFUs across regions or move to a hybrid SFU + LL-HLS architecture. Above 10K, WebRTC becomes the talent layer; LL-HLS or MoQ (Media over QUIC, the emerging 2026 standard) handles broadcast fan-out. The Topology Decision Picker above walks through each tier with vendor stack, latency budget, and failure mode.

Which topology should I use — P2P, SFU, MCU, hybrid, or broadcast?

It depends on participant count and use case. P2P for 1:1 (sub-250 ms latency, no server). SFU for 4–10,000 participants (the default in 2026). MCU only for legacy SIP interop or when a single composite output stream is required. Hybrid (P2P + SFU) for products where call patterns vary — consumer apps, telehealth, customer-support video. Broadcast (LL-HLS / MoQ) above 10,000 viewers — WebRTC remains the interactive talent layer.

How does WebRTC handle NAT traversal?

WebRTC uses ICE (Interactive Connectivity Establishment), STUN (Session Traversal Utilities for NAT), and TURN (Traversal Using Relays around NAT). ICE gathers candidate addresses from each peer. STUN punches direct connections through most NATs. TURN relays through a server when STUN cannot — and 18–35% of cellular connections relay through TURN in practice. Production deployments run TURN servers in 3+ regions with anycast routing to minimize relay latency.

What’s the typical latency budget for production WebRTC?

Glass-to-glass latency by topology: P2P direct 80–250 ms; SFU region-local 150–300 ms; SFU cross-region 250–500 ms; MCU 250–500 ms with transcoding; LL-HLS broadcast 2–4 seconds; MoQ broadcast 200–500 ms (emerging). For real-time interactive use cases (telehealth, customer-support video, conferencing) the target is sub-300 ms; for interactive broadcast (sports interactive layer) sub-500 ms; for near-real-time broadcast (LL-HLS sports streaming) 2–4 seconds.

Is Janus / mediasoup / LiveKit / Pion / Kurento better?

All five are production-grade SFUs. Choice depends on team skills and product needs. mediasoup (Node.js-based, lowest-level) for teams that want fine-grained control. LiveKit (Go-based, opinionated, Cloud + self-host) for teams that want fastest time-to-MVP and the AI-agent framework on top. Janus (C-based, plugin-driven, mature) for legacy SIP / VOIP integration. Pion (Go, library not server) for teams building custom servers from primitives. Kurento (Java, mature, used by Nucleus and Worldcast Live) for teams that need a media server with built-in transcoding and recording. Fora Soft has shipped production builds on all five.

How do you handle recording at scale?

Server-side recording, pulled from the SFU media path, never from the client browser (browser MediaRecorder drops participants when tabs close). The SFU emits an event when a publisher joins or leaves; a recording orchestrator spawns an FFmpeg process that subscribes to the SFU as a participant and writes HLS / DASH segments to S3-compatible storage with chunked uploads. Transcoding happens server-side with GPU tiers above 100K minutes / month. Plan for ~1 GB / hour / room at 720p, more for 1080p.

How do you make WebRTC HIPAA-compliant?

Four layers. (1) Signed BAA with infrastructure providers (AWS, GCP, Azure all offer BAAs; many WebRTC SDKs do not). (2) Encryption in transit (DTLS-SRTP is built into WebRTC) and at rest (encrypted recording storage). (3) Full audit logging of every join, leave, recording access, and admin action. (4) Access controls — RBAC, MFA, automatic session termination. Fora Soft has shipped HIPAA-compliant WebRTC platforms including Nucleus (SOC II + GDPR + HIPAA) and BrainCert (full compliance stack). Most SaaS WebRTC SDKs do not offer BAAs, so HIPAA almost always pushes toward custom or self-hosted deployments.

What’s the cost difference between custom WebRTC and Twilio Video / Agora?

A custom WebRTC platform costs $5K–$40K to build over 1–3 months, then $500–$2K per month to operate (depending on the usage). Twilio Video costs $0.004–$0.06 per participant-minute; Agora $0.99 per 1,000 HD participant-minutes; LiveKit Cloud $0.40–$2.00 per agent-hour. Break-even is around 8M participant-minutes per month — below, SDK is cheaper; above, custom wins on per-minute economics. Add regulatory constraints and custom often wins below the break-even, because most SDKs do not flexibly contract around HIPAA, GDPR EU-only data, or FedRAMP.

What is MoQ and how does it relate to WebRTC?

Media over QUIC (MoQ) is an IETF-standardizing delivery protocol for sub-500 ms latency at one-to-many scale. It uses QUIC (the HTTP/3 transport) with publish / subscribe semantics. MoQ is not a WebRTC replacement for interactive group calls — it targets the gap between WebRTC (interactive, sub-300 ms, 10K participants) and HLS (broadcast, 6–30 seconds, millions of viewers). Cloudflare, Twitch, and Meta have shipped early MoQ deployments. For new builds expecting >100K simultaneous viewers, MoQ is the right thing to test in 2026.

Where this guide goes deeper

Connected guides and references.

Each piece below extends one slice of this — the LiveKit agent runtime, the multimodal cross-cluster, the translation architecture, the e-learning live-class layer, the commercial path to commissioning a build, or the deeper blog dive on P2P vs MCU vs SFU.

Agent runtime

LiveKit for AI agents: a production guide

Read the LiveKit guide →

Multimodal

Multimodal agentic AI for real-time systems

Read the multimodal guide →

Translation

Real-time speech translation architecture

Read the translation guide →

E-learning

LMS development for live video learning

Read the LMS guide →

WebRTC services

WebRTC development services

See the services page →

Blog deep-dive

P2P vs MCU vs SFU for video conferencing apps

Read the comparison →

Have a specific architectural question?

Engineer-to-engineer review on the first call.

If you are scoping a real-time video or audio platform and want a second opinion on the topology, the SFU vendor, the latency budget, the compliance approach, or the build-vs-SDK threshold — write us. A senior engineer with shipped WebRTC platforms in production replies within 24 hours.

Book a discovery call WhatsApp the team Email eager2develop@forasoft.com

A practical guide to WebRTC architecture for production real-time video and audio.

P2P, SFU, MCU, hybrid, broadcast — and the verticals each one fits.

Nucleus shape

StreamLayer shape

Worldcast Live shape

BrainCert shape

WebRTC topology decision picker.

P2P direct — the lowest-latency option WebRTC offers.

SFU starts paying off. Above 4 participants, P2P collapses.

SFU is non-negotiable. Simulcast / SVC enters the architecture.

Publisher count gets gated. Most participants become receive-only.

Single SFU cannot serve. Cascade + hybrid is the only path.

WebRTC is the talent layer. Broadcast goes to LL-HLS / MoQ.

The WebRTC stack, from your code to the wire.

From a video frame to a UDP packet on the wire.

Encoded media payload

Real-time Transport Protocol header (12 bytes fixed)

Secure RTP — authenticated, encrypted, replay-protected

Datagram TLS — the handshake that bootstraps SRTP keys

User Datagram Protocol header (8 bytes fixed)

Internet Protocol header (20 bytes IPv4, 40 bytes IPv6)

How a WebRTC connection actually finds the other peer.

The PeerConnection is built but no checks have started.

Pairing local and remote candidates, running connectivity checks.

At least one working pair. Media flows.

All connectivity checks done. The nominated pair is final.

Lost connectivity — might come back.

No working candidate pair remains. The call is dead.

You called close(). Cleanup is done.

mediasoup vs LiveKit vs Janus vs Pion vs Kurento — five production SFUs compared.

Maximum control. Steep cognitive load. Production-correct.

Fastest time-to-MVP. AI-agent framework. Cloud + self-host.

Mature. Plugin-driven. SIP interop unmatched.

Primitives, not a product. Full control. 12+ months to ship.

Server-side media transformation. Transcoding first-class. JVM ops required.

WebRTC bandwidth and TURN cost calculator.

DTLS-SRTP, SFrame, and the patterns that pass audits.

The DTLS handshake bootstraps every SRTP session.

SRTP runs every packet after the handshake.

SFrame puts end-to-end encryption on top of the SFU path.

The audit log is the artifact that wins procurement.

The four metric families every production WebRTC system tracks.

What users feel

What broke

What you’re paying for

What an auditor sees

WHIP and WHEP — HTTP-native WebRTC signaling.

Publisher uploads media to the server.

Subscriber pulls media from the server.

POST /publish — create the ingest session

201 Created — SDP answer + resource URL

PATCH /publish/abc — trickle ICE candidate(s)

DELETE /publish/abc — tear down

POST /subscribe — create the egress session

201 Created — SDP answer with sendonly media

PATCH /subscribe/xyz — layer / quality switch

DELETE /subscribe/xyz — tear down

How a single call spans seven regions.

The Americas anchor region

The South American footprint

The European hub — GDPR-residency default

South Asia — the fastest-growing footprint

Southeast Asia + Indo-Pacific transit

Japan + Korea anchor

Australia + New Zealand

How a session picks a region

The latency budget that decides whether an AI voice agent feels human.

WebRTC vs LL-HLS vs MoQ vs HLS vs RTMP — what to pick for what.

The recording pipeline that does not lose a frame.

The SFU joins the recorder as a regular subscriber.

One controller per recording job, scaling horizontally.

One recorder per session writes a container in real time.

Post-session transcode to streaming-friendly ladders.

S3-compatible storage with chunked uploads + lifecycle rules.

Signed URLs, CDN caching, observability on access.

Five platforms, one libwebrtc, very different gotchas.

iOS native · libwebrtc + AVAudioSession

Owned gotchas

Android native · libwebrtc.aar + AudioManager

Owned gotchas