Zoom, Meet, Discord and Slack run WebRTC at planet scale because they spent years engineering everything around it: SFUs, edge POPs, TURN, congestion control, monitoring, recording and compliance.
This page is the reference architecture we use to ship the same kind of system in 8β16 weeks for SaaS, telehealth, e-learning, fintech and SMB platforms β distilled from 625+ real-time products and 600M+ monthly call minutes our clients run on it.
Five decisions that separate a production WebRTC system from a working demo. Each is expanded in detail below.

A production WebRTC system is six independent subsystems wired together: clients, signaling, media servers, edge/network, recording/storage, and ML/analytics. Each layer has one job, one failure mode, and one scaling axis. Confuse the layers and you ship a demo, not a product.
Every WebRTC product we ship at Fora Soft β video calls for telehealth, virtual classrooms, fintech advisor calls, surveillance, broadcasts β reduces to the same six layers:

Why this matters in production: every outage we have ever debugged in a WebRTC system traced back to confusing two of these layers β putting state in signaling, putting ML in the live path, recording from the client instead of the server, or skipping TURN. Get the boundaries right and the rest is engineering.
The client (Chrome, Safari, the iOS/Android SDK, an Electron desktop app, or a native C++ implementation on hardware) is where camera frames and microphone samples become RTP packets. The client is also where the user can blame their laptop β so we engineer it for minimal latency and predictable CPU on every plausible device, not just the one in your pocket.
What the WebRTC client must do, in order, every 33 ms (30 fps) or 16 ms (60 fps):
Skip a step or pick the wrong codec for one device class and you ship a product that works perfectly on the founderβs MacBook and burns CPU on a 2-year-old Android. We catch this in week one of the engagement, not after launch.
Background blur, RNNoise/Krisp noise suppression, virtual background and face tracking all run far better on-device than server-side. Each round-trip you avoid saves 20β80 ms and offloads server CPU β often the difference between a $400/month and a $4,000/month media-server bill at 1,000 concurrent users.
Signaling is the WebRTC layer everyone underestimates. It exists to answer one question β how do these two endpoints meet and agree on what to send each other? Signaling never carries audio or video. It carries setup and room state β nothing else.
A production signaling server has four jobs:
Typical implementations: WebSocket on Node.js / Go / Elixir for greenfield, SIP/SIMPLE when you must integrate with PBX or telephony, MQTT or NATS when you also publish presence and chat over the same bus.
Push room state into Redis or a Postgres LISTEN/NOTIFY channel β never the signaling nodeβs memory. Then you can blue-green deploy mid-call, autoscale on connection count, and survive an AZ outage with a 5β10-second reconnect rather than a dropped session.
Forwards encrypted RTP packets without decoding β’ Tiny CPU footprint per stream β’ Lowest possible end-to-end latency β’ Supports simulcast and SVC β’ Open-source: mediasoup, LiveKit, Janus, Jitsi Videobridge, Pion, ion-sfu β’ A single Hetzner AX52 (16-core, ~β¬80/mo) handles 1,000β1,500 concurrent forwarded streams
Reach for an SFU when:
Skip an SFU only when:
Decodes every incoming stream, mixes them server-side, encodes one composite back to each participant β’ 3β5Γ the CPU per call vs an SFU β’ Adds 30β80 ms of mixing latency β’ Open-source: Jitsi Videobridge in mixing mode, Janus with the AudioBridge plugin, FreeSWITCH β’ Commercial: Pexip, Vidyo, Cisco β’ Costs jump to ~β¬0.50/concurrent-participant/hour at scale
Reach for an MCU when:
Avoid an MCU when:
MCUs simplify client logic by handing every participant a single pre-mixed stream. They pay for it with 3β5Γ the server cost per participant and an extra 30β80 ms of latency. Across 600M+ monthly call minutes our clients run on this stack, an SFU has been the right answer 9 times out of 10.
Picking the wrong media-server pattern is the single most expensive WebRTC mistake we are hired to clean up.
Codec choice tweaks 5β20 ms. Geography decides the other 70β200 ms. The edge layerβs job is to keep every user within ~50 ms of an SFU and around the broken half of the internet:
Round-trip latency above ~150 ms makes conversation feel awkward; above 250 ms it becomes βhalf-duplex.β The cheapest way to fix it is a closer POP, not a smarter codec. We typically launch products in 3β4 regions, then expand based on web-analytics geography.
Recording is required in almost every B2B WebRTC product we ship β telehealth visit notes, classroom replays, fintech compliance archives, courtroom evidence. Even when end users βnever replay calls,β the regulator does.
A safe recording pipeline always has these four stages:
Audit logs, immutable storage and retention enforcement are 10Γ more expensive to retro-fit than to design in. Build the legal/compliance flow on day one; layer transcription, summaries and search on top later.
Machine learning adds intelligence to a real-time stack β and adds latency, GPU cost, and complexity if it lands in the wrong tier. Three tiers, picked deliberately, almost always beats one big monolithic AI service:
Where each model belongs:
Put ML in the live RTP path only when:
Keep ML out of the live path when:

If a feature does not need real-time feedback, push it onto a side bus (Kafka, NATS, SQS) that consumes a copy of the SFU stream. Same models, same outputs β just decoupled from the call quality budget.
Every video frame travels through 7 stages between camera and far-side eyeball. βFix the latencyβ only means anything if you know which stage is bleeding milliseconds:
Realistic budgets at each layer (mouth-to-ear / glass-to-glass):
Healthy end-to-end target: β€ 300 ms for conversation, β€ 150 ms for music collaboration or voice gaming, β€ 50 ms for surveillance/control loops (those usually need MoQ/QUIC, not stock WebRTC).
Run synthetic calls every 60 seconds from each region, record real-user MOS / round-trip / jitter via getStats(), feed it into Prometheus + Grafana. Server logs alone will hide the worst sessions β the ones your loudest users complain about.
Almost every B2B WebRTC product we ship handles regulated data: PHI in telehealth, FERPA in classrooms, MNPI in financial advisory, evidence chain-of-custody in legal-tech. Compliance is an architecture decision β SRTP, recording retention, regional data residency β not a checklist done after launch.
Five non-negotiable controls in every production WebRTC stack:
Frameworks our clients have shipped against: HIPAA + HITRUST (Nucleus, telehealth), GDPR (every EU client), SOC 2 Type II (every B2B SaaS), FERPA (BrainCert and other LMS), PCI-DSS (some fintech voice products). When the auditor asks, we hand them an architecture diagram β not a promise.

Picking S3 with default keys, hard-coding region, recording on the client, mixing tenants in one bucket β each is a half-day decision in week one and a 6-week migration after the SOC 2 auditor finds it. Build the right way once.
These four habits separate a launchable WebRTC product from a working demo. They are non-negotiable on every Fora Soft engagement β we set them up in the first sprint, before any feature work begins.
Drop 5% of packets, kill an SFU mid-call, sever a region, lose Wi-Fi for 8 seconds, swap from cellular to Wi-Fi mid-handshake. The first time you see those scenarios should not be a Sunday-night incident with paying users on the call.
These six failures account for ~80% of the WebRTC outages we are hired to clean up. Most are systemic, not bugs β they show up the first time real users hit the system at real scale.
βSix failure modes that bite in production:
Why these slip through staging:
When something breaks, look at boundaries first: which layer, which region, which device class. We have never seen a WebRTC outage that survived 24 hours of architecture-first investigation.
Startup π‘
MVP β 1β1 video / audio call, signaling, basic SFU on a single Hetzner box, recording to S3, simple admin. Right for validating an idea or replacing Zoom inside a niche product.
from
$15,000
from 4β6 weeks
Growth π
Multi-party SaaS β multi-region SFU, TURN, recording, transcription, mobile SDKs, RBAC, SOC 2-ready logging, custom UI on web + iOS + Android.
from
$45,000
from 8β12 weeks
Enterprise π’
Enterprise β HIPAA / GDPR / SOC 2 hardening, on-premise or VPC, AI hooks (live captions, moderation, summaries), white-label SDKs, 99.9%+ SLO, on-call support.
from
$90,000
from 12β20 weeks
Ready for a realistic timeline and cost breakdown tailored to your WebRTC system needs? We offer free SRS and a code audit for existing projects.
625+ real-time products shipped β BrainCert virtual classroom (500M+ minutes, $3M ARR), Nucleus secure SMB platform (600M+ call minutes / month), Tradecaster trader video community, V.A.L.T. police video. Production WebRTC, not slide-deck WebRTC.
Every engagement starts with a free SRS β architecture diagram, data-flow, six layers mapped to your tech stack, scaling plan, compliance plan. You see how signaling, media, storage and ML fit together before we write a single line.
Latency budgets are written in week one (sub-300 ms for conversation, sub-150 ms for music or trading floors). Edge POPs, codec choice, simulcast layers and jitter buffers are tuned to hit them β not measured after launch.
We design, customize and scale SFU clusters for multi-party calls, broadcast, recording, AI hooks and SIP gateways. Comfortable in JavaScript, TypeScript, Go, Rust, C++, Swift, Kotlin β whatever the SFU and your stack need.
Encryption, retention, audit logs and access control are part of the day-one design. We have shipped HIPAA + HITRUST (Nucleus), SOC 2 Type II, GDPR (every EU client), FERPA (BrainCert) and PCI-DSS β the auditor sees diagrams, not promises.
5% packet loss, region outages, mid-call Wi-Fi-to-cellular handover, SFU crash, TURN failure β we chaos-test every release against these scenarios. WebRTC must fail gracefully or it does not get to production.
Real-time video and audio, latency budgets, scaling, compliance β short answers from the team that has shipped 625+ WebRTC products.
Real-time video and audio inside SaaS, telehealth, virtual classrooms, fintech advisory, customer support, gaming voice, surveillance and broadcast. Anywhere you need sub-300 ms latency between human participants on commodity browsers and mobile devices.
Only if you stay 1-on-1 forever and need no recording. Past 4 participants the upload bandwidth and CPU on the client implode. Production systems use an SFU plus TURN fallback plus regional edge POPs β always.
Defaults: LiveKit when the team wants a managed-feeling SDK and built-in egress; mediasoup when we need maximum control over signaling and topology; Janus when SIP integration is required; Pion when the rest of the stack is Go. We pick on day one based on call patterns, not vendor preference.
Place SFUs in 3β4 regional POPs so every user is within ~50 ms; tune simulcast and codec choice per device tier; keep ML out of the live RTP path; instrument MOS / RTT / jitter via getStats() and alert on the worst 1% of sessions, not the average.
Yes β we have shipped HIPAA + HITRUST (Nucleus), SOC 2 Type II, GDPR (every EU client), FERPA (BrainCert) and PCI-DSS. SRTP/DTLS, regional data residency, append-only audit logs, KMS-encrypted recordings and per-tenant retention are designed in on day one, not retro-fitted.
Yes. Server-side recording from the SFU into S3 / R2 / B2; live captions via streaming Whisper or Deepgram; async post-processing for summaries, moderation, sentiment and embedding-based search across recorded calls.
Six patterns cover ~80% of incidents: latency spikes from one overloaded POP, audio/lipsync drift, TURN over-use (>50%), client CPU saturation from ML, recording silently failing, compliance gaps surfaced at audit. We chaos-test every release against all six.