Scalable video conferencing architecture handling thousands of concurrent users and high-quality streams

Key takeaways

Architecture matters. SFU handles 100+ concurrent participants cost-effectively; P2P collapses at 6+; MCU scales to 50 but burns CPU.

WebRTC alone isn’t enough. Add STUN/TURN for NAT, signaling servers for orchestration, and jitter buffers for mobile networks.

CPaaS vs custom. Agora, Daily, 100ms under $0.02/min; custom SFU costs $0.005/min at scale but requires 18–24 month engineering investment.

Scaling to 1M+ requires. Multi-region deployment, auto-scaling SFU clusters, geographic redundancy, and sub-150ms RTT across all regions.

Security + compliance. DTLS-SRTP encrypts media; add HIPAA/GDPR/SOC 2 audit trails, consent logging, and data residency rules for regulated verticals.

Why Fora Soft Wrote This Guide

We’ve built and scaled video conferencing platforms and architected comprehensive video conferencing software development solutions for 10+ years. Our engineering teams have shipped:

  • BrainCert (100K+ concurrent users, 500M+ minutes recorded, 10 datacenters, 4 Brandon Hall Awards)
  • ProVideoMeeting (WebRTC + SIP bridging + e-signature integration)
  • CirrusMED (HIPAA-compliant telemedicine, 1,500+ active patients, encrypted recording)

This comprehensive guide to This guide distills 500+ architecture decisions, scaling failures, codec negotiation bugs, and regulatory audits into actionable patterns. Whether you’re deciding between Agora and a custom SFU, or debugging jitter at 50K concurrent users, this is the reference you need.

Evaluate Video Conferencing for Your Stack

1–on–1 architecture review: we assess your scale, latency targets, budget, and compliance needs. 30 minutes, no sales pitch.

Schedule Call WhatsApp Email

What Makes Video Conferencing Scalable

Effective Modern video conferencing software development requires understanding how scalability in video conferencing is measured across three orthogonal dimensions:

1. Concurrent Participants in a Single Call

Small deployments (1K users) can manage with P2P or Mesh topologies. Medium scale (100K users) requires a Selective Forwarding Unit (SFU). Enterprise scale (1M+ users) demands geo-distributed SFU clusters with sub-150ms round-trip time across all regions and auto-scaling.

2. Concurrent Calls Across the System

If your platform runs 10,000 simultaneous calls with 8 participants each, you need 80,000 media streams active. That’s a different problem than a single 100K-person all-hands. You must plan for connection pools, memory per stream, and CPU per SFU instance.

3. Geographic Distribution

A user in Mumbai connecting to an SFU in Virginia will see 300ms+ RTT. Add codec delay, jitter buffer, and network loss: a 2–second stall is common. Deploy regional SFU clusters in US-East, US-West, EU, APAC, and LATAM. Use anycast DNS or a geolocation API to route each user to the nearest SFU.

WebRTC Foundation: Building Blocks

Media Pipeline

Capture audio/video from device → encode with H.264/VP8/VP9 → split into RTP packets → encrypt with DTLS-SRTP → send via UDP to SFU. The SFU decodes, re-encodes, and forwards. Total latency target: <200ms end-to-end. Mobile networks add 50–100ms due to jitter buffers.

Signaling

Signaling servers (Node.js, Go, or custom) handle call initiation, offer/answer exchange, ICE candidate gathering, and call state. Use WebSocket for low-latency signaling. Never send media through signaling (that’s the SFU’s job). Typical signaling message: 1–5 KB, 50–200ms round-trip.

STUN, TURN, and ICE

STUN (Session Traversal Utilities for NAT) discovers your public IP from behind a NAT. TURN (Traversal Using Relays around NAT) relays media when direct P2P fails. ICE (Interactive Connectivity Establishment) chooses the best path: direct > STUN > TURN. Budget 3–5 TURN servers per region; each can relay 50K–100K concurrent flows at $0.30–0.50 per 1 GB egress.

Codec Negotiation

H.264 (hardware support, patent overhead), VP8 (royalty-free, software only), VP9 (higher quality, 2x CPU), and Opus (audio codec) are standard. Offer H.264 + VP8 for maximum compatibility. Opus bitrate: 6–128 kbps for voice. Video bitrate: 500 kbps (360p) to 5 Mbps (1080p) depending on codec and loss.

Architecture Choices: P2P, Mesh, SFU, MCU

P2P (Peer-to-Peer)

Pros: Zero server cost, lowest latency (<50ms), full-quality media streams. Cons: Collapses at 6+ participants; each peer uploads N–1 streams, consuming 5–10 Mbps per person. NAT traversal is fragile. Best for 1–4 person calls (pair programming, one-on-one sales calls).

Mesh

Pros: Low latency, high quality. Cons: Each peer runs an SFU in software; bandwidth scales as O(N²). Viable for 4–8 participants with strong networks. Used by some team collaboration tools (Discord for small groups).

SFU (Selective Forwarding Unit) — the core of modern video conferencing software development

Pros: Handles 100+ participants, O(N) bandwidth, cheap to run ($0.005–0.01 per minute). Cons: 20–100ms added latency, requires reliable signaling. Standard for Zoom, Discord, Google Meet, Jitsi. SFU options: LiveKit (Go, easy to deploy), Mediasoup (Node.js, powerful), Janus (C, modular), ion-sfu (Go), Pion (Go, simple).

MCU (Multipoint Control Unit)

Pros: One video stream per call (composite grid), low bandwidth. Cons: Heavy CPU (re-encoding), 100–200ms latency, cost per call. Scales to 20–50 participants. Used in legacy systems (some PBX integrations) and HIPAA-grade setups where quality is paramount.

Quick Comparison: Architecture at a Glance

Architecture Max Participants Bandwidth per Person Server Cost Latency Best For
P2P 4 5–10 Mbps $0 <50ms One-on-one, pair coding
Mesh 8 2–5 Mbps $0 <100ms Small team sync, strong networks
SFU 100+ 0.5–3 Mbps $0.005–0.02/min 50–150ms Webinars, training, all-hands
MCU 50 0.3–1 Mbps $0.05–0.1/min 100–200ms HIPAA/regulated, legacy interop

CPaaS vs. Custom WebRTC: Decision Framework

CPaaS (Communications Platform as a Service)

Pricing: Agora ($0.0099–0.0199/min), Daily.co ($0.0083–0.0199), 100ms ($0.003–0.01), Dyte ($0.0033–0.0099), Twilio ($0.0085). Pros: No infrastructure cost, global SFU clusters included, compliance templates (HIPAA, GDPR), 24/7 support. Cons: Vendor lock-in, limited customization, billing unpredictability at scale.

Custom WebRTC (SFU)

Cost: $0.005–0.01/min at scale (self-hosted) or $0.01–0.05 with managed SFU (LiveKit Cloud, Mediasoup managed). Pros: Full control, custom UI/UX, integration with your backend, compliance on your terms. Cons: 18–24 month engineering investment, on-call ops, security audits, TURN infrastructure.

Hybrid: CPaaS for Calls, Custom for Recordings

Use Agora/Daily for real-time calls (CDN + SFU). Record to your server and post-process (transcription, diarization, object detection). Best for e-learning, telemedicine, legal firms needing control over compliance and long-term storage.

Build vs. Buy Analysis

Let’s compare cost, timeline, and risk for your specific scale. Talk to our architects.

Schedule Call WhatsApp Email

Signaling Servers: Orchestrating Calls

Your signaling server is the brain. It handles:

  • Room/call creation: Allocate a room ID, assign SFU region, set participant limits, track metadata.
  • Offer/answer exchange: Broker SDP (Session Description Protocol) between peers. Use WebSocket for <200ms latency.
  • ICE candidates: Collect STUN/TURN candidates from each client, forward to peers so they find the best network path.
  • Call state transitions: Track who’s joining/leaving, enforce bandwidth limits, trigger recording, log call duration for billing.
  • Authentication: JWT or OAuth2 to confirm the user is allowed in this room. Prevent session hijacking.

Stack choices: Node.js (Socket.io, Express), Go (gRPC, Gin), Rust (Actix, Tokio). Budget 50–100ms for offer/answer round-trip. Cache SFU endpoints in Redis; don’t query DNS on every call.

SFU Options: LiveKit, Mediasoup, Janus, ion-sfu, Pion

LiveKit

Language: Go. Strengths: Docker-ready, built-in recording to S3, Kubernetes operators, agent mode for AI. Cost: $0.005–0.01/min self-hosted; managed $0.015/min. Best for: Rapid deployment, startups, AI integration.

Mediasoup

Language: C++ (SFU) + Node.js (API). Strengths: Fine-grained control, producer/consumer model, bitrate adaptation, powerful. Complexity: Steep learning curve. Best for: Custom video apps, live streaming, heavy customization.

Janus

Language: C. Strengths: Modular plugins (videoroom, echo, streaming), lightweight, sub-100ms latency. Weakness: Less active community. Best for: Real-time communication frameworks, IoT, embedded.

ion-sfu

Language: Go. Strengths: Simple, fast, scalable. Drawback: Smaller ecosystem. Best for: Greenfield projects, teams comfortable with Go.

Pion

Language: Go (pure implementation). Strengths: Transparent, no C/C++ dependencies, good for learning. Weakness: Lower performance than C-based alternatives. Best for: Education, prototypes, Go shops.

TURN and STUN: Enabling NAT Traversal

Most corporate firewalls block direct P2P. Your signaling server must advertise STUN/TURN servers so clients can discover their public IP (STUN) and relay media if direct connection fails (TURN).

Budget per region:

  • 3–5 TURN servers (t2.medium, 2 vCPU, 4 GB RAM) = $80–150/month per server.
  • Egress traffic: $0.30–0.50 per GB (AWS, GCP, Azure). Each TURN server handles 50K–100K concurrent flows.
  • Use coturn (open-source) or commercial options (Twilio, Xirsys, Netmanaged).
  • Monitor: CPU, connection count, bandwidth. If average RTT > 50ms or packet loss > 2%, add TURN servers or expand capacity.

ICE troubleshooting: If a client gets stuck in "checking" state, it’s failing all candidate pairs. Common causes: TURN server down, firewall blocking UDP 3478/5349, or signaling server not providing credentials. Log every failed candidate pair; set up alerts.

Recording and Transcription at Scale

Recording is a separate pipeline. Don’t mix it with real-time SFU logic.

  • Architecture: SFU sends raw RTP to a recording daemon (FFmpeg, Jitsi Videobridge JVB recorder, or custom Go binary). Record to local disk, upload to S3/GCS asynchronously.
  • Bitrate: 1–5 Mbps per call depending on resolution and codec. 100 concurrent calls = 100–500 Mbps, 10–50 TB/day.
  • Storage cost: ~$0.023/GB/month (S3 Standard). 10 TB/day = $7K/month.
  • Transcription: Use Google Speech-to-Text ($1.44/hour), Azure Cognitive ($1/hour), or open-source Whisper (GPU cost). Latency: real-time (Google API) or batch (Whisper).
  • Diarization: Identify speaker changes. Whisper v2 does this; for older APIs, use pyannote or AWS Transcribe diarization.

Security and Compliance Frameworks

  1. Media encryption: DTLS-SRTP (standard). Key exchange via DTLS handshake. Rotate keys per call. Verify DTLS fingerprint via signaling to prevent MITM.
  2. HIPAA compliance: Audit trails (who, when, duration, data size), encrypted recordings, consent logs, data residency (US only), penetration testing annually, BAA with third parties.
  3. GDPR compliance: Right to be forgotten (delete recordings within 30 days), data residency (EU only or cloud region), consent management, privacy impact assessment, DPA with processors.
  4. SOC 2 Type II: Annual audit, change control, incident response, access logs, encryption standards, vendor assessment.
  5. Signaling security: TLS 1.3 for all signaling, JWT with <5 min expiry, rate-limiting (100 req/sec per IP), CSRF tokens for web apps, CSP headers to prevent XSS.

Scaling Patterns for 1M+ Concurrent Users

Real deployments:

  • Zoom: 19 datacenters, 50K+ SFU instances, geographic load balancer (anycast), auto-scaling by region.
  • Discord: 20 clusters (Pion-based Go), each cluster handles 1M+ concurrent connections, event-driven architecture (Redis pubsub), low-latency voice codec (Opus 20ms frames).
  • Google Meet: gRPC + custom velodyne codec, Spanner (globally distributed database), per-region SFU, traffic routed via Andromeda SDN.

Deployment checklist:

  • Multi-region SFU: us-east-1, us-west-2, eu-west-1, ap-southeast-1, sa-east-1. Each region is independent.
  • Geolocation API to route users to nearest SFU (<100ms RTT target).
  • Auto-scaling: 80% CPU or memory → spawn new SFU instance. Scale down after 15 min idle.
  • Health checks: ping SFU every 10s, remove from pool if 2 pings fail.
  • Recording pipeline on separate tier, can autoscale independently.
  • Monitoring: Prometheus (latency, jitter, packet loss), Grafana dashboards, PagerDuty for alerts >5% error rate.

AI-Assisted Video Platform Engineering

Large language models (Claude, GPT-4, Gemini) accelerate video platform development:

  • Code generation: LLM writes 70% of boilerplate (signaling server, TURN credential service, recording pipelines). Engineers review and customize.
  • Architecture design: Prompt an LLM with your constraints (1M users, 150ms RTT, $0.01/min cost). Get a 10-page document with dataflow diagrams, capacity planning, and failure modes.
  • Debugging: Paste a jitter buffer algorithm; LLM spots edge cases (rounding errors, buffer overflow, timestamp wrap-around at 32 bits).
  • Documentation: Auto-generate API docs, architecture docs, runbooks from comments and type hints.

Prompt template for architecture design: "I’m building a video conferencing platform. Scale: 100K concurrent users, 50 calls/second. Target latency: <150ms. Codec: H.264 + Opus. Compliance: HIPAA. Budget: $5K/month SFU cost. Recommend architecture (signaling, SFU, recording, TURN, databases). Include capacity planning."

Case Study: BrainCert’s Journey to 500M+ Minutes

BrainCert is a learning platform used by 100K+ concurrent learners and educators. Over 10 years, Fora Soft built and scaled their video infrastructure:

  • Year 1–3: P2P WebRTC for small groups (<8 people). Low latency but unpredictable quality.
  • Year 4–5: Migrated to self-hosted Janus SFU in 3 regions (US, EU, APAC). Added recording to local disk.
  • Year 6–7: Scaled to 10 datacenters, auto-scaling Mediasoup clusters, S3 recording with lifecycle policies (30 day auto-delete for compliance).
  • Year 8–10: 500M+ minutes recorded, HIPAA audit passed, added AI transcription (Whisper) for 50M+ minutes. Multi-region SFU failover.

Key metrics: 4 Brandon Hall Awards (e-learning excellence), 99.99% uptime SLA, <120ms p99 latency, <1% dropped calls. Cost: $0.008/min SFU + $0.002/min recording.

Decision Framework: Five Questions

  1. How many concurrent users? <100 → CPaaS (Agora, Daily). 100–10K → Managed SFU (LiveKit Cloud). 10K+ → Custom self-hosted.
  2. Is latency critical? Yes (gaming, live collaboration) → P2P or SFU. No (webinars, training) → CPaaS or MCU.
  3. Regulated industry? Yes (HIPAA, GDPR, finance) → Custom SFU with audit trails. No → Any.
  4. What’s your engineering budget? <$500K/year → CPaaS. $500K–2M/year → Managed SFU. >$2M/year → Custom.
  5. Do you need custom integrations? Yes (CRM, analytics, recording workflows) → Custom SFU. No → CPaaS faster to market.

12 Common Pitfalls (and How to Avoid Them)

  1. Picking P2P for a 10-person call: P2P collapses at 6+. Use SFU from day one if you plan to scale beyond 4.
  2. Underestimating TURN infrastructure: TURN is not optional. 30% of corporate users need it. Budget $500/month minimum per region.
  3. Codec mismatch: Client offers H.264 + VP8; SFU only supports VP8. Negotiation fails silently. Test all codec pairs before prod.
  4. Recording bottleneck: SFU sends RTP directly to disk. At scale, disk I/O becomes the limit. Use FFmpeg or a dedicated recording daemon.
  5. Ignoring mobile networks: Mobile: 20–50ms jitter, 500 Kbps average bitrate. Desktop: <5ms jitter, 2 Mbps. Design codec settings per device type.
  6. Forgetting signaling redundancy: Signaling server goes down = all calls drop. Run 3+ instances in different AZs with auto-failover.
  7. No monitoring: "Latency increased 50ms" without knowing why. Instrument: ICE setup time, codec/resolution, packet loss, jitter buffer depth.
  8. Codec bitrate too high: 2 Mbps H.264 works on 10Mbps WiFi but fails on 4G. Use 500–1500 Kbps for adaptive bitrate; client adjusts based on network.
  9. Not handling connection loss: Network drops for 5 seconds; client shows black screen forever. Implement reconnection with exponential backoff (1s, 2s, 4s, 8s, stop).
  10. DTLS timeout: DTLS handshake takes 200–500ms. If your UI doesn’t show a loading spinner, users think the app froze. Use a 5-second timeout + retry.
  11. Jitter buffer too small: Packets arriving out of order; buffer underruns cause audio dropouts. Use 50–100ms buffer on desktop, 100–200ms on mobile.
  12. No rate limiting on signaling: Attacker sends 10K SDP offers/sec. Signaling server CPU hits 100%, all calls drop. Rate-limit per IP (100 req/sec), per user (10 req/sec).

KPIs: Measuring Video Quality

Tier 1: User experience

  • Mean Opinion Score (MOS): 1–5 scale (5 = excellent). Target: ≥4.2. Measured via user surveys or algorithmic proxies (ITU-T P.863 PESQ).
  • Call setup time: Offer sent → media flowing. Target: <2 seconds.
  • Call drop rate: Target: <0.1%.

Tier 2: Network metrics

  • Round-trip time (RTT): Target: <150ms (can be perceived as “live”). >300ms = awkward delay.
  • Packet loss: Target: <2%. >5% = degraded quality.
  • Jitter: Target: <20ms. >50ms = choppy audio.

Tier 3: Infrastructure

  • SFU CPU per stream: 50–200ms (depends on codec). 100K streams = 5–20 physical cores needed.
  • Memory: 100–500 MB per 100 streams. 100K streams = 100–500 GB RAM.
  • Bandwidth: 0.5–3 Mbps per participant depending on resolution and codec.

When NOT to Build Your Own Video Platform

If any of these apply, use CPaaS:

  • You have <6 months to launch and <10 engineers.
  • Your budget for ops/infrastructure is <$100K/year.
  • Your scale is unpredictable (might need 1K users or 1M users in 6 months).
  • You don’t have 24/7 on-call ops support.
  • You need to be live in multiple countries (GDPR, data residency) and don’t have global ops.

Ready to Build?

We’ve built video platforms from MVP to 100M+ minutes. Let’s talk timeline, tech stack, and team.

Schedule Call WhatsApp Email

Frequently Asked Questions

Can I use WebRTC without a signaling server?

No. WebRTC is only media; you need out-of-band signaling to exchange SDP and ICE candidates. Even P2P calls need a lightweight server to broker the initial SDP.

What’s the cheapest way to stream video to 1 million users?

Not a video call — a one-to-many broadcast. Use a CDN + HLS/DASH. Cost: $0.005–0.01 per GB (Cloudflare, Akamai). 1 Gbps stream = $0.05–0.10/sec = $4K–8K/hour. Video calls with 1M participants are unrealistic; most platforms cap at 100K.

How do I handle video calls when both users are behind NAT?

ICE tries direct connection (STUN), then relayed (TURN). If both clients are behind symmetric NAT, STUN fails and TURN is required. Budget 100–200ms extra latency for TURN. Most calls (70%) succeed with STUN; design for the 30% TURN fallback.

Is H.264 or VP8 better?

H.264 has hardware acceleration (faster, lower CPU); VP8 is royalty-free. For web, offer both. SFU negotiates codec with each peer independently. If peer A sends H.264 and peer B sends VP8, SFU transcodes (expensive). Pick one and standardize.

Can I use AWS or GCP for my video infrastructure?

Yes. Deploy SFU on EC2/GCE, TURN on Compute Engine, recording to S3/GCS, signaling on Lambda/Cloud Functions. Total cost is higher than self-managed (cloud tax) but management is easier. For <100K users, cloud is cost-effective. For >500K, self-managed datacenters become cheaper.

How do I scale recording from 100 to 10K concurrent calls?

Don’t record on the SFU. Use a separate recording daemon. SFU sends RTP to daemon via UDP or writes to a shared message queue (Kafka). Daemon writes to local NVMe SSD, uploads to S3 asynchronously. Scale: add recording instances, auto-scale based on queue depth.

What’s the difference between CPaaS and a CDN?

CPaaS (Agora, Daily) handles real-time media + signaling for group calls. CDN (Akamai, Cloudflare) distributes pre-recorded or live-streamed content. Use CPaaS for meetings; use CDN for broadcast.

What tools help debug video quality issues in production?

WebRTC Statistics API (getStats) exposes jitter, packet loss, RTT, codec details. Tools like Webrtc-internals (Chrome), RTC Inspector (Firefox), and third-party services (Agora Inspect, LiveKit Insights) visualize these metrics. Log getStats every 1-2 seconds to a backend for historical analysis and alerting.

Architecture

P2P vs MCU vs SFU for Video Conferencing

Architecture comparison with concrete scaling thresholds and trade-offs.

Cost

Video Conferencing App Development Cost

Honest 2026 pricing breakdowns from MVP to enterprise.

Vendors

LiveKit vs Agora: Cost & Feature Matrix

Side-by-side comparison of two leading WebRTC platforms.

Security

Video Streaming App Security Features

DRM, encryption, and access control patterns for video apps.

Final Thoughts

Video conferencing is no longer novel. Users expect sub-150ms latency, <1% dropped packets, and crystal audio on 4G. Competing on features (filters, backgrounds, recording) is table stakes. Winning means scaling to millions while keeping ops lean and costs predictable.

The choice between CPaaS and custom SFU hinges on three variables: scale (concurrent users), timeline (months to launch), and customization (how unique is your product). If you’re unsure, start with CPaaS. If you reach 50K+ concurrent users, revisit the cost model. Custom SFU breaks even at scale; CPaaS wins if you want to move fast.

And remember: monitoring, alerts, and runbooks are not optional. Video platforms fail silently. A 50ms spike in latency or a 2% increase in jitter can degrade user experience before you notice. Instrument everything. Then build.

Let’s Talk Architecture

Whether you’re evaluating CPaaS, sketching a custom SFU, or debugging scaling challenges, our architects are here. Free 30-minute strategy call.

Schedule Call WhatsApp Email
  • Technologies