Knowledge baseReal-time video & audio
WebRTC architecture · 2026 guide

A practical guide to WebRTC architecture for production real-time video and audio.

How WebRTC actually works end-to-end at production scale. The topology decision (P2P, SFU, MCU, hybrid, broadcast) every architect is asked to answer on the first call. The 12 components every production WebRTC system needs. The build-vs-SDK economics that decide whether a custom stack or Twilio / Agora / LiveKit Cloud is the right call. Written from the platforms we have shipped: Nucleus (600M+ call minutes per month), Worldcast Live (sub-second latency at 10,000 concurrent), StreamLayer (NBC / CBS / Chelsea FC / Sony Music interactive sports), and BrainCert (500M+ classroom minutes per year).

20+ years in real-time video · since 2005
|
625+ projects · 400+ clients
|
Top WebRTC Developer 2022 · Clutch 5.0 / 30 reviews
Industry recognition · 2019–2024
Top WebRTC Developer
2022
category leader
Top Telecom Software Dev
2024
industry recognition
EASA Best Software Partner
2019
category winner
DesignRush Top App Dev
2020
platform recognition
Clutch 5.0 / 30 reviews
Spring 2024
top performer
Quick answer

WebRTC is the browser-native standard for real-time peer-to-peer media — four primitives (getUserMedia, RTCPeerConnection, MediaStream, RTCDataChannel) plus ICE / STUN / TURN for NAT traversal. WebRTC architecture is the system-design pattern for shipping that standard at production scale.

Every production WebRTC system has 12 components surrounding the browser primitives: signaling, a media gateway (SFU / MCU / P2P), TURN servers, recording pipeline, transcoding, CDN distribution, observability, billing, end-to-end encryption, compliance tooling, guardrails, and deployment. Skipping any of them is technical debt that surfaces in the first month of production traffic.

P2P, SFU, MCU — the topology decision. P2P is direct browser-to-browser, lowest latency, but stops at ~4 participants because upload bandwidth scales O(N²). An SFU (Selective Forwarding Unit — mediasoup, LiveKit, Janus, Pion, Kurento) routes each peer’s stream once and forwards selectively; bandwidth at the peer becomes O(1), so SFUs scale to 10,000+ participants per cluster. An MCU mixes peer streams server-side into a single composite stream — useful for legacy SIP interop, costly because the server transcodes everything. Most modern stacks ship hybrid — P2P for 1:1, SFU for groups, MoQ or LL-HLS for broadcast above ~10K viewers.

How you scale WebRTC depends on the participant tier. Below 100 participants, a single SFU node handles it. 100–1K, simulcast + SVC gates the publisher count and serves receive-only viewers cheaply. 1K–10K, cascade SFUs across regions or hybrid with LL-HLS broadcast. Above 10K, WebRTC becomes the talent layer; LL-HLS or MoQ (Media over QUIC, the emerging 2026 standard) handles broadcast fan-out. Fora Soft has shipped at every tier — Nucleus runs 600M+ call minutes per month; Worldcast Live runs HD broadcast at 10K concurrent with sub-second latency; StreamLayer powers interactive sports for NBC, CBS, Chelsea FC, and Sony Music.

Topics & use Cases covered in this guide

P2P, SFU, MCU, hybrid, broadcast — and the verticals each one fits.

Four shapes of WebRTC stack dominate the 2026 landscape. Each gets a different topology, a different vendor stack, and a different scale ceiling.

Choose your scale — the architecture follows

WebRTC topology decision picker.

The central question every WebRTC architect is asked to answer on the first call: at what scale does your real-time call need to operate, and which topology actually holds up there? P2P, SFU, MCU, hybrid, broadcast — each one wins at a different participant count and breaks at the next tier. Pick your scale below to see the topology that fits, the vendor stack, the latency and bandwidth math, the failure mode that kills you at that tier, and the Fora Soft client running production at that scale.

The first decision is concurrency. Once you know it, 80% of the architecture follows.

Tier 01 · 1:1 · 2 participants
Direct peer connection · no media server

P2P direct — the lowest-latency option WebRTC offers.

A direct browser-to-browser RTCPeerConnection. ICE negotiates the path, STUN punches NAT, TURN relays when STUN cannot. No media server in the loop. The lowest-latency real-time path WebRTC offers — but it breaks the moment you add a third participant.

Topology

P2P direct

RTCPeerConnection · no SFU

Latency budget

80–250 ms

Glass-to-glass · same region

Cost shape

~$0 media server

TURN bandwidth only when needed

Vendor stack

Browser RTCPeerConnection API · coturn (TURN/STUN) · custom signaling (WebSocket). No SFU vendor required.

Bandwidth math

O(1) per peer. Each side uploads one stream and downloads one stream. ~2 Mbps each direction for 720p VP9.

Failure mode at this tier

Symmetric NAT on one side breaks STUN — TURN-relayed connection adds 50–150 ms. ~18–35% of cellular connections relay through TURN. Fix: deploy TURN in 3+ regions with anycast routing.

When it breaks

The moment you add a third participant. P2P scales O(N²) across the room — every peer uploads to every other peer. Three peers = 3 uploads each. Four peers = 6 uploads each. Stops working above ~4 participants in practice.

Fora Soft shipped at this tier

Nucleus (Fibernetics) — 600M+ call minutes per month · WebRTC + SIP business communication platform serving 5,000+ businesses with AI phone agents. SOC II + GDPR + HIPAA. 1:1 calls run P2P; group calls route through Kurento.

Tier 02 · 2–10 · Small group
SFU / P2P boundary · hybrid topology shines

SFU starts paying off. Above 4 participants, P2P collapses.

The boundary where most modern production stacks switch to an SFU (Selective Forwarding Unit). Each peer uploads one stream to the SFU; the SFU forwards copies to every other peer. Bandwidth per peer becomes O(1) regardless of group size. Hybrid topology dominates in 2026: P2P for the 1:1 leg, SFU for groups, selected dynamically per session.

Topology

SFU (or hybrid)

P2P at 1:1 → SFU at 3+

Latency budget

150–300 ms

Glass-to-glass · region-local SFU

Cost shape

$200–$2K / mo

Per SFU node bandwidth + compute

Vendor stack

mediasoup (open-source SFU) · LiveKit Cloud or self-host · Janus · Pion (Go-based) · ion-sfu. For 1:1 leg fall-back: native RTCPeerConnection.

Bandwidth math

O(1) per peer · O(N) at the SFU node. For 10 peers at 720p simulcast (low/med/high): peer uplink ~3 Mbps; SFU egress 27 Mbps (9 copies × 3 Mbps).

Failure mode at this tier

Single-region SFU breaks for cross-continent groups (Sydney + London + NYC routed to one SFU = 30%+ failed joins). Fix: regional SFU cascade with simulcast routing.

When it breaks

Above ~100 publishers per SFU node. Above 10 active video streams the simulcast forwarding burns CPU instead of bandwidth — switch to SVC (scalable video coding) or cascade SFUs.

Fora Soft shipped at this tier

Ruume.ai · iMind · Meetric · business video conferencing at small-to-medium group scale. HIPAA-compliant where required (Ruume.ai). Mediasoup + LiveKit + Twilio as the common stack.

Tier 03 · 10–100 · Classroom / meeting
SFU mandatory · simulcast becomes critical

SFU is non-negotiable. Simulcast / SVC enters the architecture.

The classroom-and-meeting tier. SFU is mandatory — P2P is mathematically impossible at this scale. Simulcast (multiple quality tiers per stream) becomes critical because participant networks span 4G cellular to fibre. The SFU drops quality tiers selectively per receiver. Recording moves server-side (client-side recording loses 30%+ of participants when tabs close).

Topology

SFU + simulcast

3 quality tiers per stream

Latency budget

200–400 ms

Region-local · sub-300 ms target

Cost shape

$2K–$10K / mo

SFU autoscale + recording pipeline

Vendor stack

mediasoup with simulcast · LiveKit Cloud or self-host on K8s · Kurento (still production at clients like Nucleus and Worldcast Live) · Janus. Server-side recording: FFmpeg orchestrated from SFU events.

Bandwidth math

~500 publishers per SFU node at production tuning. 100-participant classroom with 3 simulcast tiers: peer uplink ~3 Mbps; SFU egress ~30–90 Mbps depending on active video count.

Failure mode at this tier

Client-side recording drops students when tabs close. Cold-start SFU pods add 1.5–3 seconds to join. Fix: warm pool of 2–3 idle pods per region; server-side recording pulled from SFU, never from client.

When it breaks

Above ~500 publishers per SFU node. Single-region collapse on cross-continent classes. Recording pipeline backlog if transcoding cannot keep up with concurrent class endings.

Fora Soft shipped at this tier

BrainCert · 500M+ classroom minutes / year · 99.995% uptime · the world's first WebRTC + HTML5 virtual classroom LMS. $3M ARR, 100K+ customers, 10 worldwide datacenters. SOC 2 + ISO 27001 + HIPAA + GDPR + PCI DSS + CCPA compliance stack.

Tier 04 · 100–1K · Webinar / large room
SFU + aggressive simulcast / SVC · publisher count throttled

Publisher count gets gated. Most participants become receive-only.

Above 100 participants, every architecture in production gates the publisher count. Only the host plus a handful of panelists publish video; everyone else is receive-only. The SFU still serves 1,000+ receivers, but only 5–20 simultaneous publishers. SVC (scalable video coding, AV1 / VP9) starts to outperform simulcast on bandwidth efficiency. Recording becomes a separate pipeline track from live delivery.

Topology

SFU + receive-only fan-out

5–20 publishers · 1K receivers

Latency budget

250–450 ms

P99 ~ 800 ms · interactive bar

Cost shape

$10K–$50K / mo

Multi-region SFU + CDN start

Vendor stack

mediasoup at production tuning · LiveKit self-host on K8s · Pion-based custom SFU for control · server-side recording with FFmpeg · CDN edge cache for VOD replay.

Bandwidth math

SFU egress dominates: 10 publishers × 3 simulcast tiers × 500 receivers ≈ 15K outbound streams per node. Sharding receivers across SFU instances is mandatory.

Failure mode at this tier

SFU egress saturation. Bandwidth contention when 1,000 students all turn cameras on at once. Fix: camera-default-off policy with one-click enable; aggressive simulcast tier-dropping; SVC for newer browsers.

When it breaks

When you try to keep interactive parity at broadcast scale. Above 1K simultaneous participants the math breaks: cascade SFUs or move to a hybrid SFU + LL-HLS architecture.

Fora Soft shipped at this tier

AllAboutLaw · 1,000 concurrent users · 85,000+ event registrations on serverless AWS (Lambda + DynamoDB + AWS Chime). The UK's biggest virtual law fair — week-long events scaling cleanly through aggressive fan-out and serverless backend.

Tier 05 · 1K–10K · Conference / large event
SFU cascade + LL-HLS hybrid · region cascade non-negotiable

Single SFU cannot serve. Cascade + hybrid is the only path.

The tier where a single SFU stops being viable. Cascade SFUs across regions: streams hop from publisher's regional SFU to receiver's regional SFU once, the receiver SFU serves all local viewers. For one-way fan-out above ~5K, hybrid topology wins: interactive layer on WebRTC SFU (sub-second latency for the talent + Q&A pool), broadcast layer on LL-HLS (2–4 second latency for the rest). Recording fully separated.

Topology

SFU cascade + LL-HLS

Two-tier: interactive + broadcast

Latency budget

Sub-500 ms / 2–4 s

Interactive vs broadcast tier

Cost shape

$50K–$200K / mo

Multi-region + CDN egress

Vendor stack

Custom mediasoup cascade for interactive · Kurento at scale · LL-HLS via Cloudflare Stream or AWS Elemental for broadcast · Mux or Bunny.net for edge delivery · RTMP ingest for legacy broadcast input.

Bandwidth math

Interactive layer: 50–200 SFU-served participants. Broadcast layer: HLS fan-out to 5K–10K via CDN. CDN egress dominates cost (~$0.04–$0.08 per GB).

Failure mode at this tier

CDN cache miss storms at event start. Audio drift between WebRTC and LL-HLS layers (publishers ahead of viewers by 2–3 seconds). Fix: explicit timing-source sync; pre-warm CDN edges; staggered fan-out.

When it breaks

Above ~10K concurrent receivers per region. CDN regional saturation. Need multi-CDN failover and edge anycast routing.

Fora Soft shipped at this tier

Worldcast Live · 10,000 concurrent viewers · sub-second latency (0.4–0.5 s) · custom WebRTC + Kurento HD concert streaming platform with multichannel audio (5 channels) and 1.5 Gb/s bitrate for true HD. Full-duplex two-way streaming for remote performers playing together in real time.

Tier 06 · 10K+ · Broadcast / mass scale
LL-HLS / MoQ / HLS · WebRTC for interactive talent layer only

WebRTC is the talent layer. Broadcast goes to LL-HLS / MoQ.

Above 10K concurrent viewers, WebRTC is not the delivery layer for the bulk of the audience. WebRTC remains the interactive talent layer (50–500 talent + selected interactive viewers at sub-300 ms); LL-HLS, MoQ, or standard HLS handles the broadcast fan-out. MoQ (Media over QUIC) is the emerging 2026 standard for sub-500 ms broadcast at million-viewer scale — Cloudflare, Twitch, Meta have shipped early MoQ deployments. For new builds expecting > 100K simultaneous viewers, MoQ is the right thing to test.

Topology

Hybrid: WebRTC + LL-HLS / MoQ

Interactive talent + broadcast fan-out

Latency budget

Sub-300 ms / 200–500 ms

WebRTC talent · MoQ broadcast

Cost shape

$200K+ / mo

CDN egress dominates · multi-CDN

Vendor stack

Custom WebRTC SFU for talent layer · LL-HLS via Cloudflare Stream / AWS Elemental MediaPackage / Bunny / Mux · MoQ (Media over QUIC) for next-gen sub-500 ms broadcast · multi-CDN failover · RTMP ingest for broadcast camera feeds.

Bandwidth math

CDN egress dominates. 100K concurrent at 720p HLS ≈ 200 Gb/s sustained. Cost: ~$0.02–$0.06 per GB at scale tiers; ~$15K–$40K per million viewer-hours.

Failure mode at this tier

CDN regional saturation; cache-miss storms at major events; codec interop on long-tail receivers; audio drift between WebRTC and HLS layers. Fix: multi-CDN failover; pre-warmed edges; explicit timing sync.

When it breaks

A regional CDN provider has a quarterly outage at the worst possible moment. Plan: multi-CDN with anycast routing on day one.

Fora Soft shipped at this tier

StreamLayer · interactive sports streaming for NBC, CBS, Red Bull, Chelsea FC, Coca-Cola, Sony Music · $14.1M raised across 7 rounds. Polls, prediction games, watch parties, in-stream betting handoff. Interactive viewers watch 33% longer than passive. Platforms running the full stack report 60–100% revenue uplift over basic ad breaks.

Scale is the first decision. The vendor choice (mediasoup vs LiveKit vs Janus vs Kurento vs Pion) matters less than topology — most production-grade SFUs handle Tier 02 through Tier 05 well at the right tuning. The actual production reality is that no architecture is one-tier: most platforms ship hybrid topology that picks per-session. Fora Soft has shipped at every tier above. Book a call to map your concurrency model against this picker on the first conversation.

Reference architecture · at a glance

The WebRTC stack, from your code to the wire.

Four layers, nineteen components. Browser primitives at the top, server infrastructure at the bottom. Click any node in the flow strip to highlight the components on the packet's path. Hover any tile for the vendor or spec shortlist.

RTP packet path · sender → receiver
Application APIs · 5 Media engine · 4 Transport & security · 4 Server infrastructure · 6
Layer 1 Application APIs 5 components What your code writes. Browser-native.
01 App getUserMedia Asks the OS for camera + mic. Returns a MediaStream. W3C spec · navigator.mediaDevices.getUserMedia()
02 App RTCPeerConnection The connection object. Holds codecs, ICE state, senders, receivers. W3C spec · the central WebRTC object
03 App MediaStream Container for audio + video tracks at either end of the pipe. W3C spec · addTrack() / removeTrack() / clone()
04 App RTCDataChannel Reliable or unreliable arbitrary data, multiplexed on the same connection. W3C spec · SCTP over DTLS · chat, telemetry, game state
05 App Signaling client Your WebSocket client. Exchanges SDP offer/answer + ICE candidates. Not standardized — you build it (Socket.io, native WS)
Layer 2 Media engine 4 components Codecs and DSP. Browser-internal.
06 Media Codecs Audio: Opus (default). Video: VP8 / VP9 / H.264 / AV1. Negotiated via SDP. Opus 16 kHz mono → 48 kHz stereo · VP9 + AV1 for newer browsers
07 Media DSP · AEC / NS / AGC Echo cancel, noise suppress, auto-gain. Tuning DSP per-language matters for global users. libwebrtc audio_processing module · per-room tuning
08 Media Jitter buffer Receiver-side. Smooths packet arrival jitter so playback stays smooth. Adaptive depth · trades latency for smoothness
09 Media Simulcast / SVC Publish 2–3 quality tiers; SFU drops tiers per receiver. Mandatory above 10 participants. VP9 SVC (built-in) · H.264 simulcast (multi-encoder)
Layer 3 Transport & security 4 components How packets actually move. The wire.
10 Wire DTLS-SRTP DTLS handshake derives SRTP keys. End-to-end encryption between peers (or peer–SFU). RFC 5764 · mandatory in every browser since 2017
11 Wire SCTP / usrsctp Reliable + unreliable transport for the data channel, multiplexed over DTLS. RFC 4960 · usrsctp library in browsers
12 Wire ICE / STUN / TURN NAT traversal. ICE gathers candidates, STUN punches, TURN relays. 18–35% of cellular relays. RFC 8445 (ICE) · RFC 5389 (STUN) · RFC 5766 (TURN)
13 Wire BWE / GCC Bandwidth estimation, congestion control. Adapts encoder bitrate to network conditions. Google Congestion Control (GCC) · transport-cc feedback
Layer 4 Server infrastructure 6 components What you build and operate.
14 Server Signaling server WebSocket layer that exchanges SDP and ICE between peers. Owns auth and room state. Custom · Node + Socket.io, Nest.js, Go gorilla/websocket
15 Server Media gateway SFU forwards, MCU mixes. The 2026 default is SFU for any room above 4 peers. mediasoup · LiveKit · Janus · Pion · Kurento · ion-sfu
16 Server TURN cluster Relay servers when ICE/STUN can't punch NAT. Anycast-routed across 3+ regions. coturn · Twilio TURN · Cloudflare TURN
17 Server Recording + transcoding Server-side recording from the SFU. FFmpeg transcodes to HLS/DASH for VOD. LiveKit egress · FFmpeg · AWS MediaConvert · never client-side
18 Server CDN egress / broadcast LL-HLS for 2–4 s broadcast latency. MoQ for sub-500 ms at 100K+ viewers. Cloudflare Stream · AWS CloudFront · Mux · Bunny.net
19 Server Observability + billing getStats() + RTCP feedback into a time-series DB. Per-participant-minute metering. callstats.io (legacy) · Datadog · OpenTelemetry · custom

Layer 1 + 2 ship inside every modern browser (Chrome, Firefox, Safari, Edge). Layer 3 is half-spec, half-build — the protocols are standardized, the deployment is yours. Layer 4 is where most of the engineering happens, and where most production bugs live.

Media stack · what actually moves the bits

From a video frame to a UDP packet on the wire.

Every WebRTC media packet stacks six layers, each adding a header on the way down and stripping it on the way up. Click any layer to see what it adds, the protocol that owns it, and the field-by-field byte budget.

One outgoing packet · ~1300 bytes total
IP · 20
UDP · 8
DTLS · 13
SRTP · 12
RTP · 12
Encoded media payload
byte 0 byte ~65 byte ~1300
Layer 05 · RTP · RFC 3550

Real-time Transport Protocol header (12 bytes fixed)

The header that makes media reassemblable at the receiver. Sequence number for reorder detection, timestamp for jitter buffer pacing, SSRC for stream identification, payload type for codec dispatch.

  • Sequence number (16 bits): packet order across the stream
  • Timestamp (32 bits): media clock, not wall clock
  • SSRC (32 bits): unique stream identifier inside the session
  • Payload type (7 bits): tells receiver which codec to dispatch to

Encapsulation reads bottom-up at the sender, top-down at the receiver. Roughly 5% of a 1300-byte packet is headers; the rest is payload. On 50 ms voice, that's ~7 bytes of header per 100 bytes of audio — cheap compared to the engineering it pays for.

NAT traversal · the ICE state machine

How a WebRTC connection actually finds the other peer.

ICE (RFC 8445) doesn’t just find one path — it gathers every candidate it can, pairs them up, runs connectivity checks on the cartesian product, and promotes the winner. Click any state to see what’s happening internally and what trips it into the next state.

Candidate gathering · what each peer collects before the state machine starts
HOST Direct interface IPs (Ethernet, Wi-Fi, cellular). The first to be gathered, fastest to connect when the network allows. priority 126
SRFLX Server-reflexive. STUN binding tells you your public IP+port after NAT. Works for ~65% of consumer connections. priority 100
PRFLX Peer-reflexive. Discovered during connectivity checks when a peer’s NAT shows you a new mapping mid-flight. priority 110
RELAY TURN-allocated. Last resort, but mandatory for symmetric NAT and most cellular. ~18–35% of cellular ends up here. priority 0
new

The PeerConnection is built but no checks have started.

You called new RTCPeerConnection() and maybe even setLocalDescription, but ICE hasn’t fired off any STUN binding requests yet.

What kicks off the next stateFirst STUN binding request sent or first remote candidate received via signaling
Typical durationMilliseconds — if you’re stuck here for >1 s, signaling hasn’t delivered candidates
Failure modeSignaling server is unreachable or addIceCandidate() is being called before setRemoteDescription

In production the path you actually care about is new → checking → connected → completed in under 500 ms, then living in completed for the session’s lifetime. Anything else is an incident.

SFU vendor deep-dive

mediasoup vs LiveKit vs Janus vs Pion vs Kurento — five production SFUs compared.

Every production SFU choice is a trade-off between control and time-to-market. Fora Soft has shipped on all five. Pick a vendor below to see the operational reality from production deployments — language, license, capacity, time-to-MVP, where each one wins, where it loses, and the Fora Soft client running it.

mediasoup · the architect's SFU
C++ media worker · Node.js API · ISC license · since 2016

Maximum control. Steep cognitive load. Production-correct.

Node.js API wrapping a C++ media worker. Each worker is a single OS process pinned to one CPU core; scale by spawning more workers across more cores. The API is explicit — you create routers, transports, producers, and consumers by hand. No magic. The choice for architects who need fine-grained control over simulcast layer activation, custom audio leveling, non-standard topologies, or single rooms with 1,000+ publishers.

Time to MVP

4–6 months

Higher cognitive load

Capacity / node

500–800 participants

Well-tuned m5.4xlarge

Per-core throughput

~500 consumers

~2× LiveKit on the same hardware

License

ISC (BSD-like)

Permissive · safe for commercial

Wins when

You need to control simulcast layer activation per receiver, implement custom audio leveling, run a non-standard topology (mesh of SFUs forwarding selectively), or scale a single room to 1,000+ publishers. Mediasoup's routing graph handles this well.

Loses when

Team is small and wants to ship in 8 weeks. Every transport, every producer, every consumer is yours to manage. No built-in cascade; you build the inter-SFU forwarder yourself.

Resource profile

~30–40 Mbps and ~5% of a Xeon core per active SFU stream. A c5.2xlarge handles ~150 concurrent participants comfortably; an m5.4xlarge tuned deployment reaches 500–800.

Production users

Daily (pre-rewrite), Discord (historical), Slack Huddles, multiple Fora Soft production deployments. The de facto pick when an architect wants minimum abstraction over the WebRTC media stack.

Fora Soft canon

Multiple production deployments. Default pick when the team grows into fine-grained control and the use case justifies the 4–6 month build window. Pairs well with Kubernetes operators and custom recording pipelines.

LiveKit · the SaaS-shaped SFU
Go · Apache 2.0 · single binary · LiveKit Cloud managed option

Fastest time-to-MVP. AI-agent framework. Cloud + self-host.

Go-based, Apache 2.0, ships as a single binary. Room → participant → track API; room configuration handles simulcast and recording declaratively. The LiveKit Agents framework (Python + Node SDKs) makes it the dominant choice for AI voice and video agents in 2026. LiveKit Cloud's globally distributed mesh handles regional SFU placement, routing, and failover transparently.

Time to MVP

2–3 months

Fastest of the five

Per-core throughput

~100 video tracks

Public benchmark

Cloud pricing

$0.0075 / PPM

Self-host is free

License

Apache 2.0

Permissive · safe for commercial

Wins when

You want a 2–3 month MVP, an AI agent layer is in the design, or LiveKit Cloud's managed hosting saves you the multi-region operations cost. Best-in-class globally distributed mesh in 2026.

Loses when

You want fine-grained codec routing (LiveKit's abstractions are higher level than mediasoup), need GPLv3-compatible plugins (Janus territory), or have very unusual topology requirements.

Killer feature

The Egress service handles server-side recording cleanly — pull MP4 or HLS to S3 with one API call. For many products, this alone saves 2–3 months of recording-pipeline engineering.

AI agents framework

First-class support for OpenAI, Anthropic, Google, ElevenLabs, Cartesia, Deepgram. Streaming VAD, audio chunking, turn detection, function calling — all out of the box. The 2026 default for AI voice agents.

Fora Soft canon

Several AI-agent production builds. Default greenfield pick when no SIP and no server-side transcoding is required. Start on LiveKit Cloud for MVP; migrate to self-hosted on Kubernetes when scale crosses 5M participant-minutes / month.

Janus · the SIP gateway
C · GPLv3 (commercial license available) · plugin-driven · since 2014

Mature. Plugin-driven. SIP interop unmatched.

C-based, plugin-driven. The core Janus server handles transport; every feature (VideoRoom, AudioBridge, Streaming, SIP, NoSIP) is a separately loadable plugin. The plugin model means Janus can do things other SFUs cannot — particularly the SIP gateway plugin, which bridges WebRTC to traditional VOIP carriers. The most mature SFU in production (since 2014).

Time to MVP

3–5 months

Plugin-driven

Maturity

Since 2014

Deepest production tooling

SIP gateway

First-class

Unmatched VOIP interop

License

GPLv3

Commercial license adds cost

Wins when

You need SIP interop (Nucleus pattern), stream-level features in plugins (audio mixing without full MCU, broadcast streaming, recording), or mature production tooling. The plugin model is unique in the SFU world.

Loses when

Licensing matters. Janus is GPLv3 by default. Commercial license available for proprietary deployments but adds cost. Greenfield projects without SIP rarely choose Janus over LiveKit or mediasoup in 2026.

Killer plugins

VideoRoom (SFU), AudioBridge (server-side audio mixing), Streaming (RTSP/RTP ingest), SIP gateway, NoSIP (raw WebRTC ↔ raw RTP bridge), Recorder. Each plugin loadable independently.

Production users

Telephony platforms, telehealth with SIP fallback, large conferencing where the audio mixing is server-side. Meetecho (the original sponsors) ship Janus at scale.

Fora Soft canon

Multiple production deployments. Reserved for projects with SIP gateways at the heart — telecom, contact-center extensions, hybrid WebRTC + PSTN architectures. Not the default for AI-agent or pure-WebRTC builds in 2026.

Pion · the build-it-yourself library
Go · MIT · WebRTC primitives · not a server

Primitives, not a product. Full control. 12+ months to ship.

Go library implementing the WebRTC stack from scratch — RTCPeerConnection, RTP, ICE, DTLS. Not an SFU; you build your own SFU server on top. For teams building deeply custom servers (custom load-balancing logic, custom codec routing, server-side signal processing) and willing to own the entire stack in Go.

Time to MVP

6–12 months

You build the SFU

Maturity

Library

Not a server

Control level

Maximum

You own everything

License

MIT

Most permissive

Wins when

You are building a deeply custom server (custom load-balancing logic, custom codec routing, server-side signal processing) and want to own the entire stack in Go. Pion is the foundation many production SFUs are built on (including ion-sfu).

Loses when

You want any kind of out-of-the-box SFU behavior. Pion is primitives. Time to production is 12+ months for a full SFU built on it.

Best for

Custom server-side signal processing (audio analytics, ML on the media stream), embedded WebRTC (IoT, robotics), or research-grade experimentation. ion-sfu and several niche SFUs are Pion-based.

Community

Active GitHub project (~13K stars). Used by Twitch's go-rtmp-to-webrtc bridge. Apache Foundation projects depend on it. Strong upstream commit cadence.

Fora Soft canon

Niche custom builds where standard SFUs do not fit — custom routing logic, embedded WebRTC clients, server-side ML on raw RTP. Not a default pick for typical conferencing or AI-agent builds.

Kurento · the transcoding media server
Java · Apache 2.0 · GStreamer pipeline · mature

Server-side media transformation. Transcoding first-class. JVM ops required.

Java-based media server with a graph-of-media-elements API. You compose pipelines like webrtc-endpoint → filter → recording-endpoint programmatically. Transcoding is first-class — the original design goal was bridging between WebRTC and legacy media. Nucleus and Worldcast Live both run Kurento because transcoding is core to those products.

Time to MVP

4–6 months

Pipeline composition learning curve

Transcoding

First-class

Opus ↔ G.711 · VP9 ↔ H.264

Pipeline model

Media elements

Composable filter graph

License

Apache 2.0

Permissive · safe for commercial

Wins when

You need server-side video manipulation (overlays, compositing, watermarking), audio transcoding (Opus ↔ G.711 for SIP interop), or server-side recording with on-the-fly format conversion. Nucleus and Worldcast Live both ship Kurento for these reasons.

Loses when

"Modern Go-shaped" matters — Kurento is Java, JVM tuning is part of operations. Innovation slowed 2023–2025; active development moved to LiveKit / mediasoup for greenfield work.

Killer feature

Composable media pipelines. Want to overlay a watermark on every recorded stream? It's a filter element in the pipeline. Want to bridge a 5-channel HD WebRTC publisher to a stereo G.711 SIP leg? It's a transcoder element.

Production users

Nucleus (Fibernetics) — WebRTC + SIP. Worldcast Live — HD concert streaming with multichannel audio. Long tail of legacy production deployments where transcoding is core.

Fora Soft canon

Nucleus and Worldcast Live both run Kurento in production because transcoding is core to those products. The pick when server-side media manipulation, multi-codec transcoding, or pipeline-style processing is non-negotiable.

No vendor is universally correct — the right pick depends on team skills, the SIP / transcoding / AI-agent shape of the product, and how much control you need over the media pipeline. Fora Soft canon: greenfield builds with no SIP and no transcoding requirement default to LiveKit for speed; mediasoup when the team grows into fine-grained control; Kurento for products where transcoding is core; Janus for SIP-heavy architectures; Pion for custom servers built from primitives.

Bandwidth & cost calculator

WebRTC bandwidth and TURN cost calculator.

Pick a topology, participant count, bitrate per stream, and TURN relay percentage. The calculator runs the four canonical WebRTC bandwidth formulas live — publisher upload, SFU egress, MCU egress, total room bandwidth — and projects monthly TURN relay cost. The same back-of-the-envelope math architects walk through on the first scoping call.

10
25%
30
10,000

Output · live calculation

Headline result

Publisher upload (per peer)

Server egress (per room)

Receiver download (per peer)

Total room bandwidth

Monthly TURN cost (self-hosted)

At ~$0.01 per GB

Monthly TURN cost (managed)

At ~$0.05 per GB

Self-hosted TURN bandwidth assumes Hetzner / OVH / dedicated tier (~$0.005–$0.02 / GB). Managed TURN reflects Twilio / Cloudflare / Subspace pricing (~$0.04–$0.06 / GB). Add ~20% for retransmit / FEC overhead in production. Numbers exclude SFU compute, recording storage, and signaling — those are roughly 10–30% of total infrastructure cost at typical fleet sizes.

Security architecture · deep-dive

DTLS-SRTP, SFrame, and the patterns that pass audits.

DTLS-SRTP encrypts media between peers, SFrame encrypts media end-to-end through the SFU, and the audit log is what gets you SOC 2 / HIPAA past the procurement gate. Four tabs, four production patterns.

The DTLS handshake bootstraps every SRTP session.

DTLS only runs at session start — one or two round trips, then it’s out of the data path. The whole point is to derive the SRTP keys safely; everything after that is straight UDP carrying SRTP packets.

1ClientServerClientHello + cipher suites + DTLS-SRTP extension
2ServerClientHelloVerifyRequest with cookie (DDoS defense)
3ClientServerClientHello with cookie echoed back
4ServerClientServerHello + cert + cert verify + key share
5ClientServerClient key share + cert + verify + Finished
6Both=SRTP keysEXTRACTOR per RFC 5764 derives master + salt + auth keys
Modern
DTLS 1.3 (RFC 9147)

One round trip instead of two. Reduces session-setup latency by ~100–150 ms.

Cert auth
Self-signed peer certs

The SDP carries the peer’s cert fingerprint. Signaling identity binds to fingerprint binds to cert.

Failure mode
MTU < 1280

DTLS fragments on small MTUs. Below 1280 bytes the handshake silently fails. Always probe MTU with PMTUD or pin to 1200.

DTLS bootstraps trust. SRTP protects the wire. SFrame protects against the SFU. The audit log makes all of it auditable. Skip the last one and you fail the first procurement review.

Observability & SRE · what to measure, what to alert on

The four metric families every production WebRTC system tracks.

Quality tells you what users feel. Errors tell you what broke. Cost tells you what you’re paying for. Compliance tells you what an auditor sees. Click any metric to expand the alert rule.

Quality

What users feel

Five metrics that predict NPS. Track p50 + p95 + p99 on every one.

Errors

What broke

Connection lifecycle failures plus the rate of degraded sessions.

Cost

What you’re paying for

Per-minute, egress, and infrastructure cost decomposition.

Compliance

What an auditor sees

Audit-log completeness and consent flow integrity for SOC 2 / HIPAA / GDPR.

getStats() is the API you build everything on. RTCInboundRtpStreamStats + RTCIceCandidatePairStats + RTCRemoteOutboundRtpStreamStats get you the bulk of the metrics on this page. Sample every 2 seconds, ship to a time-series DB, alert on percentiles — never on means.

Standardized WebRTC ingest / egress

WHIP and WHEP — HTTP-native WebRTC signaling.

RFC 9725 (WHIP, March 2024) and RFC 9737 (WHEP, July 2024). The IETF’s answer to: stop building bespoke WebSocket signaling for every product. Publish with HTTP POST, subscribe with HTTP POST, both with SDP in the body. Click any step to see the actual exchange.

WHIP RFC 9725 · ingest

Publisher uploads media to the server.

One HTTP POST creates the session and exchanges SDP. The server returns a resource URL you DELETE to tear down. That’s the entire protocol — the rest is ordinary WebRTC over UDP.

WHEP RFC 9737 · egress

Subscriber pulls media from the server.

Mirror image. POST the SDP offer; server replies with answer + resource URL. Layer switching (simulcast, SVC) happens via PATCH with SDP renegotiation or out-of-band.

WHIP step 1 · publisher

POST /publish — create the ingest session

POST /publish HTTP/1.1
Host: ingest.example.com
Content-Type: application/sdp
Authorization: Bearer eyJhbGc...

v=0
o=- 7456 2 IN IP4 0.0.0.0
s=-
t=0 0
m=audio 9 UDP/TLS/RTP/SAVPF 111
m=video 9 UDP/TLS/RTP/SAVPF 96
…

Single round trip. The body is a standard SDP offer with the same media descriptions you’d send to any SFU. The Bearer token is what authenticates the publisher — usually a short-lived JWT minted by your control plane.

Why WHIP / WHEP matters in 2026: every WebRTC product used to ship its own bespoke WebSocket signaling protocol. WHIP / WHEP collapses that to standard HTTP. OBS Studio ships native WHIP. Cloudflare, Twitch, AWS, mediasoup, LiveKit, and most CPaaS vendors all expose WHIP / WHEP endpoints. The result: ingesting WebRTC to any cloud is now a one-line config change.

Cascade SFUs · globally distributed mesh

How a single call spans seven regions.

A single SFU node tops out at a few thousand publishers per cluster. Beyond that, you cascade: each region runs its own SFU, peers join the nearest, and SFUs relay only the streams that need to cross regions. Click any region to see its config; hover an edge for inter-region latency.

Virginia São Paulo Frankfurt Mumbai Singapore Tokyo Sydney 7 REGIONS · CLICK ANY NODE
eu-central-1 · Frankfurt

The European hub — GDPR-residency default

EU-only data deployments pin here. Highest density of cross-region edges: Virginia 78 ms, Mumbai 120 ms, Singapore 155 ms, Tokyo 220 ms. Best central position in the global mesh.

Typical load28% of global sessions
SFU nodes20 autoscaled
ComplianceGDPR data residency · ISO 27001
Routing rulePin EU-tenants here regardless of user location

How a session picks a region

  1. Signaling server reads the tenant’s residency rule (EU-only, US-only, AU-only, anywhere).
  2. If residency-locked, pin to that region’s SFU. End of decision.
  3. Otherwise, geo-lookup the user’s IP, pick nearest SFU on the map by network latency (not great-circle distance).
  4. If the user’s nearest SFU is full or unhealthy, fail over to the next nearest. ~3% of sessions hit failover in steady state.
  5. Cross-region cascade only when peers in the same call land on different SFUs. Each cross-region hop adds 70–220 ms.

Cascade only crosses regions for streams that need to. Same-region peers stay on the same SFU node. Cross-region pairs go SFU→SFU over private backbone (or public internet on cheap tiers). The architecture wins when single-region cost is much cheaper than centralizing everywhere — which it always is past a couple thousand concurrent sessions.

AI voice agent latency budget

The latency budget that decides whether an AI voice agent feels human.

Total end-to-end latency from user finishing speaking to agent's first audio frame at the user's speaker determines whether the conversation feels natural or robotic. Three scenarios shape the budget — best case (~450 ms achievable), production median (1.4–1.7 s industry-published 2025–2026 data), and P99 in the wild (3–5 s without disciplined ops). Click a scenario to see how each stage contributes and where the optimization wins sit.

Total end-to-end latency

450 ms

User finishes speaking → first audio frame at speaker

All optimizations stacked. ASR / LLM / TTS all streaming, co-located in one region, pre-warmed model contexts. The conversation feels natural; users interrupt the agent without friction.

User perception thresholds (highlighted band shows where current scenario lands)

< 300 ms

Feels human · interruption flows naturally

300–600 ms

Slightly sluggish · acceptable for routine queries

600 ms – 1 s

Users notice the gap · rhythm breaks

1 – 1.5 s

Talking over agent · UX bug territory

> 1.5 s

Users revert to taps · or hang up

Numbers reflect industry-published medians from 2025–2026 production deployments. The biggest individual lever is converting every component from request/response to streaming — partial ASR transcripts, streamed LLM tokens, streamed TTS audio. The second biggest is co-locating SFU, ASR, LLM, and TTS in one datacenter to eliminate cross-region hops. Speech-to-speech models (OpenAI Realtime, Gemini Multimodal Live) compress this stack but trade observability for latency.

Delivery protocol matrix

WebRTC vs LL-HLS vs MoQ vs HLS vs RTMP — what to pick for what.

Every video product touches at least two delivery protocols — one for interactive (sub-500 ms) and one for fan-out (broadcast scale). The matrix below plots latency against scale ceiling for the five protocols dominating 2026 production. Click a protocol to see the production reality, current 2026 status, and the Fora Soft client running it.

Y-axis: scale ceiling (concurrent viewers)
10M+ 1M 100K 10K 1K 100 ms 500 ms 2 s 5 s 20 s+ WebRTC 80–500 ms · 10K MoQ 200–500 ms · 1M+ LL-HLS 2–4 s · 10M+ RTMP 2–5 s · ingest only HLS 6–30 s · 10M+
X-axis: glass-to-glass latency · lower is better for interactive use cases

Every product above 1K concurrent runs hybrid: WebRTC for the talent and interactive layer, LL-HLS or MoQ for broadcast fan-out, with a transcoding bridge in between. StreamLayer ships this pattern; Worldcast Live runs WebRTC end-to-end at 10K concurrent because the latency requirement is sub-second and the audience size sits in WebRTC's sweet spot.

Recording architecture · server-side, scaled, compliant

The recording pipeline that does not lose a frame.

Client-side recording breaks above ~200 participants and on every tab close. Production WebRTC records on the server, pulls media from the SFU, transcodes in a separate pipeline, writes to object storage with chunked uploads, and serves through a CDN. Six stages, each with its own scale ceiling.

Stage 02 · Recording orchestrator

One controller per recording job, scaling horizontally.

The orchestrator listens to SFU events, decides which sessions need recording, spawns recorder workers (typically Kubernetes pods or AWS Fargate tasks), and tracks lifecycle. It also enforces compliance — consent-flag check, retention policy, jurisdiction routing.

InputsSFU webhooks · tenant config · consent flags · retention rules
OutputsKubernetes Job spec · recorder pod with credentials, output paths, manifest URLs
StackCustom Go / Node service · Temporal or Argo Workflows for state · Kubernetes for compute
Failure modeOrchestrator crashes mid-session — running recorders survive (they’re independent pods), but new sessions don’t start until the orchestrator recovers. Run at HA from day one.

Plan for ~1 GB / hour / room at 720p. Multiply by participant count for multi-track recordings, divide by ~3 if you record a composite mix. Compliance retention (HIPAA 6 years, GDPR “as short as possible”) is the dominant cost driver above 50K recorded hours.

WebRTC on mobile · architectures & pitfalls

Five platforms, one libwebrtc, very different gotchas.

All five mobile WebRTC platforms ultimately link against the same libwebrtc — the difference is what the platform layer makes hard. Audio session handling on iOS, background mode on Android, codec licensing on iOS, screen share permissions everywhere. Pick a platform.

iOS native · libwebrtc + AVAudioSession

You link against libwebrtc.xcframework (Apple-signed Google builds) and call from Swift or Objective-C. The platform layer is dominated by AVAudioSession — configure it wrong and you get one-way audio or no audio at all.

Owned gotchas

  • AVAudioSession category: must be .playAndRecord with .allowBluetooth + .mixWithOthers. Default category breaks Bluetooth headsets.
  • CallKit integration: required for inbound calls or iOS rejects waking the app from background. PushKit + CallKit is mandatory pairing.
  • H.264 only: VideoToolbox hardware encoder ships H.264 hardware acceleration. VP9 / AV1 work but software-only — battery murders.
  • Screen share: ReplayKit Broadcast Extension required. Runs in a separate process with 50 MB memory ceiling; design accordingly.
  • Background audio: only works if Background Modes » Audio is enabled in entitlements. Video freezes when app backgrounds — expected behavior.
SDK source
Google’s WebRTC binary builds via CocoaPods or SPM (LiveKit, mediasoup, Daily all wrap it)
Min iOS
iOS 14 for the modern Encoded Transform API. iOS 12 for baseline DTLS-SRTP.
Battery profile
25–35% per hour for 720p video. H.264 hardware path is half the cost of VP9.
Production canon
Nucleus iOS · BrainCert iOS · Translinguist iOS

If you only have engineering for one platform: ship native iOS and Android. Mobile web is for the “tap to join” flow only. React Native + Flutter are valid for cross-platform teams — just remember the native gotchas don’t go away, they just get one bridge call deeper.

Failure mode playbook · the on-call runbook

Twelve production WebRTC failures, in the order you’ll hit them.

Each entry: what users feel, the root cause, the fix that ships, the telemetry that catches it next time. Filter by severity to focus the runbook view.

PAGE TURN cold-start adds 1.5–3 s to join join latency

Symptom: User joins a call, sees the “Connecting…” spinner for 2–3 seconds longer than normal. Happens in the first hour after a region scales out.

Root cause: TURN pod spawning — the relay server takes time to load coturn config, register with the control plane, and warm DNS. Cold workers cost ~1.5–3 s before they answer their first STUN request.

Fix: Keep a warm pool of 2–3 idle TURN workers per region. Scale by predicted peak +20% buffer. Pre-emptive scale-out on the daily-rhythm signal.

Telemetry: Track p99 ICE-pair-formation-time by region. Alert when it crosses 800 ms for 5 minutes. Tag candidate.type for relay vs srflx breakdown.

PAGE DTLS handshake silently fails when path MTU < 1280 connection establishment

Symptom: ICE shows connected, but no media. The browser doesn’t surface why. iceConnectionState stays at connected forever; getStats() shows no inbound RTP.

Root cause: DTLS records are sent as DTLS packets that exceed MTU. Some intermediate network drops the fragments. The DTLS handshake stalls without alerting. Common on corporate VPNs (1380 MTU after IPsec), Cloudflare WARP, mobile cellular with WAP gateways.

Fix: Pin DTLS record size to 1200 bytes via libwebrtc kMinPacketSize. Or run PMTUD at session start. Most production stacks just pin to 1200 and never look back.

Telemetry: dtlsTransportState transitions to <new, connecting, connected, closed, failed>. Alert when it stays in connecting > 5 s.

PAGE ICE consent freshness expires, calls die after 30 s mid-call drop

Symptom: Users are happily in a call. Around 30 seconds after a network change (Wi-Fi to cellular, VPN reconnect, NAT mapping refresh) the call goes silent and disconnects.

Root cause: ICE consent freshness (RFC 7675) checks every 5 s. After ~6 missed responses the connection transitions to disconnected, then failed. The browser doesn’t auto-restart unless your code calls pc.restartIce().

Fix: Listen to navigator.onLine. On any online event during an active session, call pc.restartIce() to renegotiate ICE candidates. Pair with a UI “Reconnecting…” banner.

Telemetry: count sessions that drop from completed to failed within 30 s of a navigator.onLine event. That’s the recovery you’re missing.

PAGE Untuned simulcast — receivers pin to highest tier cost runaway

Symptom: SFU egress costs balloon 2–3× expected. CDN egress invoice arrives, finance asks what changed. No user-visible quality drop.

Root cause: Simulcast was enabled, but the SFU isn’t dropping tiers. Every receiver gets the high tier regardless of bandwidth. Common when GCC feedback isn’t flowing or the SFU is configured to forward all layers.

Fix: Verify SFU is using BWE feedback to drop tiers per receiver. Inspect a sample session in the SFU debug UI — if egress bitrate equals publisher bitrate × receiver count, you’re leaking. Pin per-receiver bitrate ceilings as a backstop.

Telemetry: egress GB per participant-minute, broken down by simulcast layer. Budget 1.5 GB / participant-hour at 720p; alert at 2.5 GB / hour.

PAGE Recording stitched client-side breaks at 200+ participants recording integrity

Symptom: Recordings show frozen frames, missing speakers, or audio cuts where tabs were closed. Customer complaints about missing footage.

Root cause: The recording is being assembled from browser MediaRecorder on one of the participants’ machines. When a tab closes, refreshes, or backgrounds — gone. At 200+ participants the probability of disruption per session approaches 1.

Fix: Move recording to the server. Recorder joins the SFU as a participant, pulls SRTP, writes the master file in a dedicated pod. Never client-side except for prototype demos.

Telemetry: count recording-success rate per session. Alert when it drops below 99% for any room size.

WARN Aggressive noise suppression cuts whispered speech audio quality

Symptom: Soft-spoken users in healthcare or legal calls report “the other person can’t hear me.” AGC is pumping but transcripts still drop whispered words.

Root cause: Default WebRTC noise suppression treats whispered consonants as noise. Per-language tuning matters — English vs Mandarin vs Russian have very different speech spectra.

Fix: Tune VAD aggressiveness per language. Set noiseSuppression: false when the user is on AirPods (which do their own NS). Use libwebrtc audio_processing module config + pass language hint from the signaling layer.

Telemetry: Inbound RTP audioLevel + voiceActivityFlag. Compare against client-side AGC compression ratio. Outliers are the channels with broken NS.

WARN Audio drift between WebRTC and LL-HLS tracks hybrid sync

Symptom: On hybrid SFU + LL-HLS broadcast deployments, the WebRTC talent audio is 200–500 ms ahead of the LL-HLS audience video. Reactions feel disconnected.

Root cause: RTP and HLS use independent clocks. The LL-HLS pipeline buffers segments; the WebRTC path doesn’t. Without explicit sync, drift grows.

Fix: Stamp every WebRTC packet with absolute capture time in an RTP header extension. Carry the same timestamp through HLS metadata (ID3 in TS or emsg in fragmented MP4). Receiver re-syncs by NTP-anchored timestamps.

Telemetry: A/V offset per session, sampled from a known reference frame embedded in both tracks. Alert at 250 ms drift.

WARN SFU egress dominates compute at Tier 04 cost shape

Symptom: Compute cost per participant looks fine in dev. In production at 200–1000 participants per room, egress bandwidth is 60%+ of total infrastructure spend.

Root cause: Cost shape inverts above Tier 04: bandwidth out-prices CPU on most cloud providers. Each receive-only viewer multiplies egress without adding much CPU.

Fix: Negotiate committed-rate egress with your cloud provider above 1 PB / month. Switch to R2 or B2 for recordings. Above Tier 05, route viewers via LL-HLS through CDN edge cache — cheap egress on cached segments.

Telemetry: Cost per participant-hour, decomposed into CPU + RAM + egress + storage. Decomposition tells you where the next optimization lives.

WARN First 5 s bandwidth estimation undershoots startup quality

Symptom: First 5–10 seconds of a call look choppy / low-res, then improve. Users on fast connections complain about “why does it always start grainy.”

Root cause: Google Congestion Control starts conservative (~300 kbps) and ramps up. Without an initial bandwidth hint, GCC doesn’t know the network is capable of 5 Mbps until it’s measured it.

Fix: Set initialBitrate in PeerConnection config based on the user’s historical bandwidth (cookie or server-side per-user record). Set networkPriority: high for the active speaker’s send stream.

Telemetry: Track p95 time-to-target-bitrate from session start. Alert when it crosses 3 s.

WARN TURN allocations leak when close() is missed cost & capacity

Symptom: TURN GB-relayed bill grows faster than session volume. Capacity alerts fire mid-week even though new-session count is flat.

Root cause: JS code path doesn’t call pc.close() on every exit branch — tab close, navigation, error, browser-kill. The TURN allocation stays open until its timeout (default 600 s on coturn).

Fix: Beacon-API call on beforeunload / pagehide that sends a DELETE to your signaling backend, which tells the SFU to tear down. Also set coturn allocation timeout shorter (60 s) for sessions tagged short-lived.

Telemetry: Active TURN allocations vs active signaling sessions. Mismatch > 5% means leaks.

INFO VP9 SVC layer-drop causes flicker cosmetic

Symptom: Brief blocky flicker on receiver every 2–5 seconds when bandwidth tightens. No freeze, just visual artifacts.

Root cause: VP9 SVC layer-drop is per-frame. When the SFU drops a temporal layer mid-GOP, the receiver renders a partial frame before the next keyframe.

Fix: Increase keyframe interval to ~2 s minimum (drops resilience but reduces flicker). Or force a keyframe when layer state changes via REMB / PLI / FIR.

Telemetry: framesDecoded vs keyFramesDecoded ratio. Spikes correlate with bandwidth events.

INFO Bluetooth headset audio routing fails on first iOS call first-call UX

Symptom: User with AirPods or Bluetooth headphones: first call routes audio to phone speaker. Second call routes correctly.

Root cause: AVAudioSession needs .allowBluetooth + .mixWithOthers in CategoryOptions. iOS caches the routing decision; first call is made before the option propagates.

Fix: Configure AVAudioSession at app launch with the full CategoryOptions set, not at call start. Force re-evaluation on every session activation.

Telemetry: Survey users: rate of “had to switch headphones” complaints. App-level metric, not WebRTC stats.

These twelve cover roughly 90% of production WebRTC incidents at Fora Soft over the past 24 months. The remaining 10% are unique to a particular client’s stack — not generalizable, but they all teach the same lesson: instrument before you launch, alert on percentiles, never on means.

CPaaS vendor matrix · pricing & when each one wins

Five WebRTC clouds. The lens you read them through decides which wins.

Same five vendors, very different optima. The cell highlighting changes based on what you’re optimizing for. 2026 pricing — check vendor calculators for binding numbers.

Optimize for:
Twilio Video Enterprise legacy Agora Asia-first scale LiveKit Cloud AI-native Daily Embedded experience Vonage Programmable comms
Per-minute price (HD) $0.0040 / pm $0.99 / 1K min $0.0040 / pm + agent-hour $0.0050 / pm $0.00475 / pm tiered
Recording (per minute) $0.0040 + storage $0.59 / 1K min $0.0030 / pm egress $0.0040 / pm + storage $0.0040 + storage
Max participants / room 50 (group), 200 (small-group) 1M (broadcast) 100K (broadcast), 1K (interactive) 300 (interactive) 500 (relayed)
AI agent integration Voice Intelligence add-on Conversational AI Engine Agents framework (Python, Node, Go) Pipecat (open-source) + Daily Bots AI Studio (low-code)
Compliance (HIPAA / SOC 2 / GDPR) HIPAA BAA + SOC 2 + GDPR + ISO + PCI HIPAA BAA + SOC 2 + GDPR + ISO HIPAA BAA + SOC 2 Type II + GDPR HIPAA BAA + SOC 2 + GDPR HIPAA + SOC 2 + GDPR + FedRAMP
Region count 12+ 200+ data centers 20+ 10+ 15+
Developer ergonomics Mature, verbose REST + helper libs SDK-heavy, doc gaps Open-source server compatibility Prebuilt UI + minimal config Long history, dated API surface
Self-host fallback No No Yes (same OSS server) No Limited (on-prem option)

Pricing approximate, 2026 Q1 published rates. Always model against actual usage shape: heavy recording shifts the curve toward LiveKit + Daily; sub-second broadcast shifts toward Agora; HIPAA without engineering capacity shifts toward Twilio + Vonage; AI-agent runtime shifts toward LiveKit + Daily. Highlighted cells are the per-lens winners.

Production examples

Four shipped WebRTC systems and the architectural choices that made them work.

Telehealth and business communication at 600M+ call minutes per month. HD broadcast at 10K concurrent with sub-second latency. Interactive sports streaming for major networks. The world’s first WebRTC + HTML5 virtual classroom LMS at 500M+ classroom minutes per year. Four production builds running today across four very different scale shapes.

Nucleus · WebRTC + SIP

600M+ call minutes per month · 5,000+ businesses

Secure on-premise Slack alternative, serving 300,000+ customers on a national network processing 2 billion+ phone calls annually. Nucleus serves 5,000+ businesses with AI phone agents handling 600M+ call minutes monthly. Featured in Financial Post and PRWeb.

The WebRTC architecture combines real-time WebRTC calls with SIP for cellular and landline bridging — chat-to-SMS, voice-to-voice AI translation, video and audio calls, CRM and ERP integration. SOC II, GDPR, and HIPAA-compliant.

Worldcast Live · Sub-second concert

10K concurrent · 0.4–0.5 s latency · 5-channel HD

HD concert streaming platform — one of the first to achieve sub-second latency (0.4–0.5s) at scale, serving 10,000 simultaneous viewers. Custom WebRTC + Kurento architecture with multichannel audio (5 channels), 1.5 Gb/s bitrate for true HD quality, and dynamic video quality adjustment for poor connections.

Full-duplex two-way streaming lets performers in different locations play together in real time. The white-label Multiple Venue Streaming (MVS) plugin syncs live streams across multiple websites simultaneously.

StreamLayer · Interactive sports

NBC, CBS, Red Bull, Chelsea FC, Sony Music

Interactive sports streaming SaaS used by NBC, CBS, Red Bull, Live Nation, Chelsea FC, Coca-Cola, and Sony Music. Powered live engagement at Lollapalooza and Jay-Z’s Made In America festival. Raised $14.1M across 7 rounds ($8M Series A in 2022 led by Las Vegas Sands). Featured on NASDAQ and Gaming America.

Platforms implementing the full StreamLayer stack report 60–100% revenue uplift over basic ad breaks, with effective CPM rates reaching $50–80+. Interactive viewers watch 33% longer than passive viewers — ~2x increases in dwell time, ad revenue, and subscriber growth.

BrainCert · Virtual classroom LMS

500M+ classroom minutes / year · 99.995% uptime

The world’s first WebRTC + HTML5 virtual classroom LMS. Bootstrapped to $3M annual revenue and 100K+ customers — a 12-person team outcompeting VC-backed giants across the $400B+ e-learning market. 500M+ real-time classroom minutes delivered across 10 worldwide datacenters with 99.995% guaranteed uptime. 4 Brandon Hall Awards (Triple Bronze 2021).

Full compliance stack: SOC 2 Type I & II, ISO/IEC 27001:2022, HIPAA, GDPR, PCI DSS, CCPA, NIST SP 800-171. Proprietary QuantumKey DRM encryption shipped at no additional cost to customers. Fora Soft built the product from the ground up; 2+ year ongoing partnership.

Decision framework

Build custom WebRTC, buy SDK, or hybrid — when each one wins.

Three architectural paths for shipping a real-time video product. None is universally correct. The right choice is a function of usage volume, customization depth, compliance scope, and whether the platform is the product or supports the product.

Build custom

When you need control SDK won’t give you

Wins when: real-time video is the product, concurrency clears 1,000 simultaneous, custom server-side routing logic, regulated industry (HIPAA, SOC 2, GDPR), brand-embedded experience, multi-tenant SaaS, or specific media routing (MoQ delivery, multi-region cascade, simulcast tuning by network condition). You own all the IP.

Cost shape: $5K–$40K build over  months. Plus $500-$2K monthly operations (depends on the usage). Break-even on per-minute economics typically falls around 8M participant-minutes per month. Archetypes: Nucleus, Worldcast Live, BrainCert, StreamLayer.

Buy SDK

When you can live in the vendor’s box

Wins when: under 5M participant-minutes per month, standard call patterns, no custom server-side routing, no regulatory or data-residency constraints, no in-house engineering capacity to operate multi-region WebRTC. Twilio Video, Agora, LiveKit Cloud, Daily, Vonage — all do this well.

Cost shape: $0.004–$0.06 per participant-minute (Twilio band), per-minute rate scales linearly with usage. No upfront build cost. Vendor lock-in. Most SDKs don’t offer BAAs.

Hybrid

When half the problem fits SDK

Wins when: the runtime fits an SDK but the differentiator (custom turn-taking, regional routing, voice cloning, embedded experience) needs custom code. The most common 2026 pattern in regulated SaaS plays.

Cost shape: SDK subscription + $–$0K for the custom layer + monthly maintenance. Designed to migrate to full custom if usage scales past ~8M participant-minutes.

Cost ranges are 2026-indicative. Implementation specifics — concurrent participant targets, compliance scope, recording requirements, codec mix, multi-region cascade depth — typically dominate the spread within each tier.

Custom WebRTC architecture costs $-build over  months. SDK alternatives cost $0.004–$0.06 per participant-minute with vendor lock-in. The break-even point typically falls around 8M participant-minutes per month — below, SDK wins; above, custom wins.
FAQ

Twelve questions every WebRTC architecture review covers.

What is WebRTC SFU architecture?

An SFU (Selective Forwarding Unit) is the media routing topology used by every production WebRTC group call above ~4 participants. Each peer uploads a single encoded stream to the SFU; the SFU forwards selective copies to every other peer without decoding or re-encoding. This makes bandwidth at the peer O(1) regardless of room size. SFU vendors include mediasoup, LiveKit, Janus, Pion, ion-sfu, Kurento, and Jitsi Videobridge. Modern SFUs add simulcast and SVC so the SFU can drop quality tiers selectively per receiver based on network conditions.

What is the difference between P2P and SFU?

P2P (peer-to-peer) connects peers directly via RTCPeerConnection — no server in the media path. Bandwidth at the peer scales O(N²) across the room (every peer uploads to every other peer), which works for 1:1 and stops being viable above ~4 participants. SFU routes streams through a server: each peer uploads once, the server forwards. Bandwidth at the peer becomes O(1), making SFU scale to 10,000+ participants. Most production stacks ship hybrid topology — P2P for 1:1 legs, SFU for groups, selected dynamically per session.

How does SFU vs MCU compare?

SFU and MCU are both server-side topologies, but they route media differently. SFU forwards streams without transcoding — low latency (150–300 ms) but consumes bandwidth at the SFU egress. MCU mixes peer streams server-side into a single composite stream — higher latency (250–500 ms with transcoding), lower bandwidth at the SFU, but very compute-heavy. In 2026, SFU is the default for almost every production WebRTC build. MCU is reserved for legacy SIP interop and broadcast composite output where a single mixed stream is required.

How do you scale WebRTC?

Scale depends on the participant tier. Below 100 participants, a single SFU node handles it. 100–1K, simulcast + SVC gates the publisher count and serves receive-only viewers cheaply. 1K–10K, cascade SFUs across regions or move to a hybrid SFU + LL-HLS architecture. Above 10K, WebRTC becomes the talent layer; LL-HLS or MoQ (Media over QUIC, the emerging 2026 standard) handles broadcast fan-out. The Topology Decision Picker above walks through each tier with vendor stack, latency budget, and failure mode.

Which topology should I use — P2P, SFU, MCU, hybrid, or broadcast?

It depends on participant count and use case. P2P for 1:1 (sub-250 ms latency, no server). SFU for 4–10,000 participants (the default in 2026). MCU only for legacy SIP interop or when a single composite output stream is required. Hybrid (P2P + SFU) for products where call patterns vary — consumer apps, telehealth, customer-support video. Broadcast (LL-HLS / MoQ) above 10,000 viewers — WebRTC remains the interactive talent layer.

How does WebRTC handle NAT traversal?

WebRTC uses ICE (Interactive Connectivity Establishment), STUN (Session Traversal Utilities for NAT), and TURN (Traversal Using Relays around NAT). ICE gathers candidate addresses from each peer. STUN punches direct connections through most NATs. TURN relays through a server when STUN cannot — and 18–35% of cellular connections relay through TURN in practice. Production deployments run TURN servers in 3+ regions with anycast routing to minimize relay latency.

What’s the typical latency budget for production WebRTC?

Glass-to-glass latency by topology: P2P direct 80–250 ms; SFU region-local 150–300 ms; SFU cross-region 250–500 ms; MCU 250–500 ms with transcoding; LL-HLS broadcast 2–4 seconds; MoQ broadcast 200–500 ms (emerging). For real-time interactive use cases (telehealth, customer-support video, conferencing) the target is sub-300 ms; for interactive broadcast (sports interactive layer) sub-500 ms; for near-real-time broadcast (LL-HLS sports streaming) 2–4 seconds.

Is Janus / mediasoup / LiveKit / Pion / Kurento better?

All five are production-grade SFUs. Choice depends on team skills and product needs. mediasoup (Node.js-based, lowest-level) for teams that want fine-grained control. LiveKit (Go-based, opinionated, Cloud + self-host) for teams that want fastest time-to-MVP and the AI-agent framework on top. Janus (C-based, plugin-driven, mature) for legacy SIP / VOIP integration. Pion (Go, library not server) for teams building custom servers from primitives. Kurento (Java, mature, used by Nucleus and Worldcast Live) for teams that need a media server with built-in transcoding and recording. Fora Soft has shipped production builds on all five.

How do you handle recording at scale?

Server-side recording, pulled from the SFU media path, never from the client browser (browser MediaRecorder drops participants when tabs close). The SFU emits an event when a publisher joins or leaves; a recording orchestrator spawns an FFmpeg process that subscribes to the SFU as a participant and writes HLS / DASH segments to S3-compatible storage with chunked uploads. Transcoding happens server-side with GPU tiers above 100K minutes / month. Plan for ~1 GB / hour / room at 720p, more for 1080p.

How do you make WebRTC HIPAA-compliant?

Four layers. (1) Signed BAA with infrastructure providers (AWS, GCP, Azure all offer BAAs; many WebRTC SDKs do not). (2) Encryption in transit (DTLS-SRTP is built into WebRTC) and at rest (encrypted recording storage). (3) Full audit logging of every join, leave, recording access, and admin action. (4) Access controls — RBAC, MFA, automatic session termination. Fora Soft has shipped HIPAA-compliant WebRTC platforms including Nucleus (SOC II + GDPR + HIPAA) and BrainCert (full compliance stack). Most SaaS WebRTC SDKs do not offer BAAs, so HIPAA almost always pushes toward custom or self-hosted deployments.

What’s the cost difference between custom WebRTC and Twilio Video / Agora?

A custom WebRTC platform costs $5K–$40K to build over 1–3 months, then $500–$2K per month to operate (depending on the usage). Twilio Video costs $0.004–$0.06 per participant-minute; Agora $0.99 per 1,000 HD participant-minutes; LiveKit Cloud $0.40–$2.00 per agent-hour. Break-even is around 8M participant-minutes per month — below, SDK is cheaper; above, custom wins on per-minute economics. Add regulatory constraints and custom often wins below the break-even, because most SDKs do not flexibly contract around HIPAA, GDPR EU-only data, or FedRAMP.

What is MoQ and how does it relate to WebRTC?

Media over QUIC (MoQ) is an IETF-standardizing delivery protocol for sub-500 ms latency at one-to-many scale. It uses QUIC (the HTTP/3 transport) with publish / subscribe semantics. MoQ is not a WebRTC replacement for interactive group calls — it targets the gap between WebRTC (interactive, sub-300 ms, 10K participants) and HLS (broadcast, 6–30 seconds, millions of viewers). Cloudflare, Twitch, and Meta have shipped early MoQ deployments. For new builds expecting >100K simultaneous viewers, MoQ is the right thing to test in 2026.

Where this guide goes deeper

Connected guides and references.

Each piece below extends one slice of this  — the LiveKit agent runtime, the multimodal cross-cluster, the translation architecture, the e-learning live-class layer, the commercial path to commissioning a build, or the deeper blog dive on P2P vs MCU vs SFU.

Have a specific architectural question?

Engineer-to-engineer review on the first call.

If you are scoping a real-time video or audio platform and want a second opinion on the topology, the SFU vendor, the latency budget, the compliance approach, or the build-vs-SDK threshold — write us. A senior engineer with shipped WebRTC platforms in production replies within 24 hours.

+1 (914) 775-5855
New York · USA
Specialist software house for video, real-time and AI products. Founded 2005.
50 in-house engineers.
Describe your project and we will get in touch
Enter your message
Enter your email
Enter your name

By submitting data in this form, you agree with the Personal Data Processing Policy.

Your message has been sent successfully
We will contact you soon
Message not sent. Please try again.