Why this matters
If you run a WebRTC product with more than five participants per room, the choice between simulcast and SVC sits underneath every quality complaint, every CPU bill on the publisher device, every "why does my mobile viewer see a low-resolution stream of a desktop sharing a 4K spreadsheet" support ticket, and a measurable line on your TURN egress invoice. The decision is not abstract: it changes what your encoder pipeline does, what your SFU's RTP forwarding code does, what your subscribers' decoders do, and what you can ship to Safari versus what only works in Chromium. This article gives you the architecture, the standards, the numbers, and the 2026 production reality across the five major SFUs, so when an engineer says "we should switch to AV1 SVC", you know exactly what you're buying and what you're giving up.
The problem a heterogeneous audience creates
Picture a video call with twelve people in it. Three are sitting at desks on gigabit fibre, four are on hotel Wi-Fi that fluctuates between 3 Mbps and 30 Mbps, two are on phones moving between cell towers, one is sharing the call on a smart-TV web browser through a flaky mesh router, and two are AI agents running headless in a data centre that can take any quality you give them. The publisher of each camera is uploading to a Selective Forwarding Unit — a server that, by design, does not decode or re-encode the media; it inspects packet headers and forwards the right packets to the right subscribers. (For the topology context, see SFU vs MCU vs Mesh.) If every publisher uploaded one stream at one bitrate, the SFU would have to forward that exact stream to every subscriber. The high-resolution stream that delights the desktop viewer would saturate the phone's LTE link and trigger an unrecoverable congestion event; the low-resolution stream that fits the phone's pipe would arrive on the desktop as a blurry rectangle.
This is the heterogeneous-audience problem. Every real WebRTC product hits it on day one. The two solutions that the WebRTC ecosystem actually ships in 2026 are simulcast and SVC, and almost no product picks one to the complete exclusion of the other — they coexist inside the same server, often inside the same room.
Simulcast in plain language
Simulcast is the simpler of the two ideas. Instead of asking the publisher's encoder to produce one stream, you ask it to produce several — typically three — at different resolutions and bitrates from the same camera frame. A typical configuration sends 180p at around 150 kbps, 360p at around 500 kbps, and 720p at around 1,500 kbps. Each of these is a fully independent, self-contained stream: the 180p stream is a complete video that can be decoded by any compliant decoder without reference to the others, and the same is true of the 360p and the 720p. The publisher uploads all three to the SFU over the same RTP session. The SFU sees three streams arrive, tagged with identifiers that mark them as alternatives for the same underlying media source.
When a subscriber joins, the SFU decides which of the three streams to forward to that subscriber. The decision is continuous and based on three signals: the subscriber's reported receiver bandwidth estimate, the subscriber's requested layer (because applications can ask for "the lowest layer" when the subscriber's video tile is collapsed to a thumbnail), and the SFU's own knowledge of which streams are actually being produced. When the subscriber's network improves, the SFU upgrades — it starts forwarding packets from the higher-bitrate stream instead of the lower one, dropping the lower-bitrate stream out of the forwarding decision. When the network degrades, the SFU downgrades, forwarding the lower-bitrate stream and stopping the higher-bitrate one.
The mechanism is described normatively in RFC 8853 (January 2021), Using Simulcast in Session Description Protocol (SDP) and RTP Sessions. RFC 8853 specifies how to advertise simulcast capability in the SDP offer/answer exchange (the a=simulcast media-level attribute) and how to identify the different RTP streams of one media source at the RTP level (the RtpStreamId source description item, with the RTP header extension method as the SHOULD-implement option). The IETF MMUSIC working group authored the document; the related RFC 8852 specifies the RtpStreamId itself.
The cost of simulcast — measured in CPU, memory, and uplink
The publisher pays. The publisher's encoder runs three independent encodes of the same camera, simultaneously. The CPU cost is sub-linear because the down-scaled inputs are cheaper to encode and many encoders share motion estimation across resolutions, but the cost is real — figures in the range of 1.5 to 2.5 times a single-stream encode are typical depending on codec, resolution set, and hardware acceleration. Memory usage is similarly higher. Uplink bandwidth is the sum of the three streams: if you encode 180p at 150 kbps, 360p at 500 kbps, and 720p at 1,500 kbps, the publisher uploads roughly 2,150 kbps for that camera, compared to 1,500 kbps for a single 720p stream — about 43% more bytes on the wire to give the SFU the choice.
Here is the arithmetic out loud. A single 720p stream at 1,500 kbps consumes 1,500 kbps of uplink. The three-rung simulcast set at 150 + 500 + 1,500 kbps sums to 2,150 kbps. The ratio is 2,150 ÷ 1,500 = 1.43. The publisher pays 43% more uplink so the SFU can switch quality per subscriber.
The cost on the subscriber side is zero. Each subscriber receives exactly one stream — the one the SFU chose for them — and decodes it as it would decode any normal RTP video stream. The subscriber's decoder is unchanged from the single-stream case. This is the great gift of simulcast: the complexity lives entirely in the publisher and the server. The subscriber, which is usually the more numerous side of the equation, is untouched.
Where simulcast struggles
The publisher's CPU and uplink are the obvious costs. The less-obvious costs show up when you try to optimise. Three independent encodes mean three independent rate-control loops, three keyframe schedules, and three sets of long-tail moments where one stream's keyframe interval lands in the middle of another's update. Quality-switching latency is small but not zero: when the SFU decides to switch the subscriber from 360p to 720p, the subscriber's decoder needs a keyframe at 720p to start decoding it, and the SFU has to either wait for the next scheduled 720p keyframe or send a PLI (Picture Loss Indication) RTCP feedback message to the publisher to request one. PLIs to the publisher cause an extra keyframe in all simulcast layers, which costs extra bytes for a moment. Tuning the keyframe interval to balance switching latency and bandwidth efficiency is one of the standard simulcast operational chores.
Simulcast also gives a coarse-grained quality knob. With three layers, you have three quality choices per subscriber. If you want finer-grained adaptation between those, you cannot — the encoder is producing exactly three streams. SVC, the next section, addresses exactly this constraint.
SVC in plain language
Scalable Video Coding is a different answer to the same question. Instead of asking the encoder to produce N independent streams, you ask it to produce one stream organised into nested layers. A base layer is a complete, decodable, low-quality version of the video. Enhancement layers, when added to the base layer, decode to a higher-quality version. Any prefix of the layers, from "just the base layer" to "all the layers", is itself a valid stream that decodes to a valid picture. The encoder uses motion-compensated and frequency-domain dependencies across the layers so that the total bit cost of "base + enhancement" is close to (but not equal to) the cost of encoding the high-quality version directly.
The layers come in two dimensions, sometimes three. Temporal layers split the frame rate: a base temporal layer at 7.5 fps, plus one enhancement layer that adds frames to get to 15 fps, plus one more enhancement layer that adds frames to get to 30 fps. Spatial layers split the resolution: a base spatial layer at 180p, plus one enhancement layer that adds resolution to get to 360p, plus one more for 720p. Some codecs also support quality (SNR) layers — same resolution and frame rate, different fidelity. The W3C Scalable Video Coding (SVC) Extension for WebRTC standardises a compact naming convention: LxTy describes x spatial layers and y temporal layers. L1T3 is one spatial layer with three temporal layers — pure temporal scalability, available in every WebRTC codec including VP8 and H.264. L3T3 is three spatial layers and three temporal layers — full spatial+temporal SVC, available only in VP9 and AV1.
When the SFU receives an SVC stream, it forwards only the packets belonging to the layers the subscriber needs. Because the encoder has labelled each packet with its layer (through a payload-format-specific header — the VP9 payload descriptor for VP9, the AV1 Dependency Descriptor for AV1), the SFU can decide per packet whether to drop or forward it. The subscriber's decoder reassembles whatever packets arrive into a valid picture at the resolution and frame rate that the forwarded layers represent.
The cost of SVC — measured in encoder complexity and codec support
The publisher's CPU cost is lower than simulcast because there is only one encode in flight, but the encoder is doing more sophisticated work than a single-stream encode: it must maintain reference relationships across the layer hierarchy and label every output packet with its layer membership. Empirically, an SVC encode at the same target quality and resolution costs roughly 1.1 to 1.4 times a single-stream encode — meaningfully less than the 1.5 to 2.5 times a three-rung simulcast encode costs. Uplink bandwidth is also lower: the layer dependencies mean the total bytes for "180p + 360p + 720p" delivered as SVC layers is roughly 10–25% less than the same three streams delivered as simulcast, because the enhancement layers carry differences rather than full restatements.
The subscriber-side cost is again essentially zero. The subscriber's decoder receives a partial stream — only the layers the SFU decided to forward — and decodes it as a normal video stream. The decoder must support the codec's SVC profile, but every modern VP9 and AV1 decoder does.
The catch is codec support. SVC requires a codec that defines scalability semantics, and the SDP / RTP machinery requires that both ends understand those semantics. In 2026 the practical map is:
- VP8: temporal SVC only (
L1T2,L1T3). No spatial layers. Supported in every browser that supports VP8 — that is, all of them. - H.264: temporal SVC only (
L1T2,L1T3). Supported in every browser that supports H.264, which is now all of them, including Chromium-based browsers since simulcast support landed. - VP9: full spatial+temporal SVC up to
L3T3and itsL3T3_KEYvariant. Supported in Chrome, Firefox, and Edge; Safari supports VP9 decode but not VP9 encode, which means a Safari user cannot publish VP9 SVC. - AV1: full spatial+temporal SVC including
L3T3and many additional modes. Supported for both encode and decode in Chromium-based browsers and Firefox (with some restrictions); Safari supports AV1 decode on devices with hardware AV1 (iPhone 15 Pro and newer, Apple Silicon Macs) but not yet AV1 encode.
The Safari gap is the practical wall that keeps simulcast alive. If your product has any Safari publishers, you cannot rely on VP9 or AV1 SVC to cover them; you fall back to simulcast in VP8 or H.264. This is why every production SFU in 2026 supports both simulcast and SVC, often inside the same room.
Scalability modes, by example
The W3C webrtc-svc specification names the scalability mode you ask for on the encoder side. You pass a string like "L1T3" or "L3T3_KEY" to the RTCRtpSender.setParameters() call inside the encodings[i].scalabilityMode field. The encoder, if it supports the requested mode for the requested codec, sets up its dependency graph accordingly and tags every output packet. The receiver does not need to know which mode was used — the layer membership is in the payload descriptor.
The naming convention reads cleanly once you internalise it:
L1T1— one spatial layer, one temporal layer. No SVC; this is a plain single-stream encode.L1T2— one spatial layer, two temporal layers. Splits the frame rate into a half-rate base and a full-rate enhancement. The SFU can deliver "every other frame" to a constrained subscriber and "every frame" to an unconstrained subscriber, from the same uploaded stream.L1T3— one spatial layer, three temporal layers. Splits the frame rate into quarter / half / full. The most common temporal SVC mode in 2026 production.L2T3/L3T3— two or three spatial layers, three temporal layers. Full spatial + temporal SVC. VP9 and AV1 only.L3T3_KEY— variant where lower spatial layers do not depend on higher spatial layers for keyframes, which makes layer switching cleaner at the cost of slightly more bytes. This is the default scalability mode LiveKit asks for when publishing VP9 or AV1.
A worked numeric example: temporal layers and the bandwidth a 360p subscriber consumes
Suppose a publisher encodes a 720p VP9 stream with L1T3 — one spatial layer at 720p, three temporal layers splitting the frame rate into 7.5 / 15 / 30 fps. The encoder targets a total bitrate of 1,500 kbps at 30 fps. A common SVC bit budget allocates roughly half the bits to the base temporal layer and roughly a quarter to each enhancement, so a typical breakdown is 750 / 375 / 375 kbps for the three layers.
A subscriber on a constrained mobile network requests "lowest temporal layer only" — the SFU forwards only the base layer, and the subscriber decodes 720p at 7.5 fps consuming 750 kbps. A subscriber on a healthy desktop link gets all three layers — the SFU forwards everything, and the subscriber decodes 720p at 30 fps consuming the full 1,500 kbps. The same single publisher upload (1,500 kbps) serves both subscribers simultaneously; the SFU's forwarding decision is the only thing that changes. Compare this to simulcast, where the publisher would have had to encode and upload two separate streams — one at 7.5 fps, one at 30 fps — to give the SFU the same choice.
This is the bandwidth-efficiency advantage of SVC, and it scales: every additional layer the SFU can hand out adds a quality option without adding a full new encoded stream to the publisher's uplink. The disadvantage is the codec constraint already discussed. The advantage is largely confined to rooms where the publishers run codecs the room's subscribers can decode.
How the SFU actually decides which layers to forward
A subscriber's effective receive bandwidth is the SFU's primary input. Most production SFUs run a per-subscriber bandwidth estimator on top of transport-wide congestion control — the TWCC RTP header extension that LiveKit, mediasoup, Janus, and Pion-based SFUs all support. TWCC lets the subscriber tell the publisher (through the SFU) about every received packet's arrival time, which feeds a bandwidth estimator on the SFU that produces a per-subscriber estimate of how much bytes-per-second this subscriber can comfortably absorb. The estimator is described in detail in the IETF transport-wide congestion control work and in the Google Congestion Control implementation that ships in libwebrtc.
Once the SFU knows the subscriber's bandwidth, it picks the layer set that uses the most of it without exceeding it. For simulcast: forward the highest-bitrate stream whose mean bitrate fits within the estimate, with hysteresis to avoid thrashing. For SVC: forward the highest prefix of layers (the largest set that still fits). The SFU also exposes an application-level API for hint inputs — a constraint that overrides bandwidth, used when the subscriber's UI shows a video tile small enough that 180p is plenty regardless of what the network can handle. mediasoup's consumer.setPreferredLayers() and LiveKit's per-subscription layer parameters do exactly this; the SFU treats them as a cap on the layer the bandwidth estimator can pick.
The forwarding decision is continuous. Every 100–500 ms (depending on SFU), the per-subscriber loop reruns: re-estimate bandwidth, re-pick the layer, react to PLI feedback from the subscriber (which forces the SFU to either request a keyframe from the publisher or fall back to a layer that already has a recent keyframe). When the subscriber's video tile changes size — a participant gets pinned, a thumbnail expands — the application calls into the SFU to update the preferred layers, and the next loop iteration acts on the new cap.
Common mistake: assuming the SFU "knows" the subscriber's screen
A frequent failure mode at first deployment is forgetting that the SFU cannot, by itself, know how big the subscriber's video tile is. The SFU sees the subscriber's bandwidth and the subscriber's RTCP receiver reports. It does not see the subscriber's DOM or the subscriber's display resolution. If the subscriber has a 90 × 90-pixel thumbnail of a participant in a sidebar, the SFU will happily forward 720p packets to fill that thumbnail unless the application explicitly tells it not to. The result is a bill: the SFU's uplink to the subscriber carries far more bytes than the subscriber actually consumes visually. The fix is to wire the subscriber's UI state — pinned versus thumbnail, expanded versus collapsed, visible versus off-screen — into a call that sets the preferred layer on the subscription. mediasoup, LiveKit, Janus, Jitsi Videobridge, and Pion all expose this hook; missing it is the single most common reason a production WebRTC bill is twice what it should be.
What the five major SFUs actually do in 2026
The five projects compared in Choosing an SFU all support both simulcast and SVC, but they implement the layer-selection logic and the publisher APIs differently. The map below describes the 2026 reality.
mediasoup ships first-class simulcast and SVC support across VP8, VP9, H.264, and AV1. The consumer API exposes preferredLayers and setPreferredLayers() for the spatial+temporal target; the SFU runs its own per-consumer scoring loop that picks the actual layer based on the consumer's estimated bandwidth and the producer's available layers. AV1 SVC support was added to mediasoup's worker C++ core ahead of most other servers and is in active production use at customers running large-scale Pion/mediasoup hybrids.
LiveKit uses simulcast by default for VP8 and H.264 publishers and switches to SVC automatically for VP9 and AV1 publishers, requesting L3T3_KEY as the default scalability mode. LiveKit's Dynacast feature couples the layer-forwarding decision to subscriber visibility, automatically pausing streams entirely when no subscriber is viewing them and pausing individual layers (for simulcast) or whole streams (for SVC, where individual layer pausing is not possible) when subscribers downgrade. LiveKit also implements multi-codec simulcast: the publisher can run a backup codec (VP8 simulcast) in parallel with the primary codec (AV1 SVC) so that legacy Safari subscribers get a working stream while Chromium subscribers benefit from AV1's compression efficiency. The multi-codec simulcast pattern is the only architectural answer in 2026 to the cross-browser AV1 SVC gap that does not require giving up SVC's benefits for the subscribers that can use it.
Janus implements simulcast in the videoroom plugin and added initial AV1-SVC Dependency Descriptor support in 2022 (Meetecho merged the work in pull request #2741); the support has matured since and is production-ready in 2026 for AV1 with L1T2/L1T3 and partial support for spatial AV1 SVC. Janus's plugin model means the layer-selection logic lives inside janus_videoroom and is configurable per subscriber through the plugin's request API.
Jitsi Videobridge has implemented temporal-layer-aware forwarding for VP8 simulcast since before the Octo/relay rebuild, with the bridge selecting the lowest temporal layer that satisfies a subscriber's constraints. The bridge also supports VP9 SVC; AV1 support landed alongside the dependency-descriptor work in libwebrtc but is less production-validated than VP8 simulcast inside the typical Jitsi Meet deployment, which still defaults to VP8 simulcast for cross-browser stability.
Pion is the WebRTC library, not the SFU — but it carries the protocol primitives (RTP header extensions, the AV1 Dependency Descriptor parser, RTCP feedback messages) that a Pion-based SFU needs to implement layer-aware forwarding. ion-sfu and LiveKit (which is built on Pion) ship the actual layer selection on top.
Side-by-side: what you write on the publisher
// Simulcast on the publisher side (works on every codec in every browser)
const sender = pc.addTransceiver(track, {
direction: 'sendonly',
sendEncodings: [
{ rid: 'q', maxBitrate: 150_000, scaleResolutionDownBy: 4 }, // 180p
{ rid: 'h', maxBitrate: 500_000, scaleResolutionDownBy: 2 }, // 360p
{ rid: 'f', maxBitrate: 1_500_000, scaleResolutionDownBy: 1 }, // 720p
],
}).sender;
// SVC on the publisher side (VP9 or AV1)
const sender = pc.addTransceiver(track, {
direction: 'sendonly',
sendEncodings: [
{ scalabilityMode: 'L3T3_KEY', maxBitrate: 1_500_000 },
],
}).sender;
The simulcast version explicitly enumerates three encodings with rid (RTP stream identifier) values and resolution scale factors; the SFU sees three streams. The SVC version asks for one encoding with the scalabilityMode set; the SFU sees one stream with embedded layer metadata. The application code is dramatically simpler for SVC — the price is the codec constraint.
Bandwidth, quality, and the actual production trade-off
For the same target quality at the highest tier, SVC consumes less publisher uplink than simulcast — typically 10–25% less depending on codec and content. For the same publisher CPU, SVC encodes a similar amount of pixels at slightly lower complexity than three independent simulcast encodes. The SFU-side cost is broadly similar — both techniques are RTP-packet-level forwarding decisions, and a mature SFU handles both with roughly equivalent per-stream CPU.
The other side of the ledger is robustness. Simulcast is robust to a layer dying: if the 720p encode breaks for a moment because of a CPU spike on the publisher, the 360p and 180p encodes keep going, and the SFU continues forwarding them. SVC is less robust to the same failure: if the high spatial layer's encode breaks, dependent enhancement frames may become undecodable, and a brief glitch may propagate longer because the layer dependency graph stalls. The variant L3T3_KEY was designed specifically to soften this — its key-frame structure decouples lower spatial layers from higher ones, which makes layer-switching cleaner and recovery from a layer failure faster.
Quality-switching latency is also different. With simulcast, switching layers requires a new keyframe on the destination layer; in the worst case the SFU sends a PLI and waits for the publisher's next forced keyframe. With SVC, switching to a lower layer is essentially free (drop higher-layer packets, decode what remains), while switching to a higher layer needs the dependency chain to be re-established — which L3T3_KEY makes practical without a full publisher keyframe round-trip.
The 2026 production pattern for most teams: simulcast as the default, SVC where the codec coverage allows it, and a multi-codec fallback to cover Safari publishers. Tsahi Levent-Levi's Five WebRTC Predictions for 2026 observed that AV1 will not become the dominant WebRTC codec in 2026; VP8 and H.264 remain the workhorses, and AV1 SVC will keep growing where its bandwidth-efficiency story justifies the extra integration work. The realistic ramp to AV1 SVC as the default for cross-browser publishers points to 2028 at the earliest, gated on Safari shipping an AV1 encoder.
Where Fora Soft fits in
In our WebRTC, conferencing, telemedicine, and e-learning practice, simulcast in VP8 is still the cross-browser default for new builds — particularly for products with Safari publishers where SVC support cannot be assumed. We enable VP9 or AV1 SVC for Chromium-only deployments, AI-agent rooms, and broadcast tails where the bandwidth savings move a real cost line. The most consistent operational win across these projects is wiring subscriber UI state — pin, thumbnail, off-screen — into the SFU's layer-selection hint, regardless of which technique the publisher uses. That wiring routinely cuts SFU egress by 30–60% on rooms with many small video tiles, and it works the same for simulcast and SVC subscribers.
What to read next
- SFU vs MCU vs Mesh: the three WebRTC topologies — the topology context.
- mediasoup, Janus, LiveKit, Jitsi Videobridge, Pion: choosing an SFU — the five servers compared.
- WebRTC bandwidth estimation — the algorithm that feeds layer selection.
CTAs
- Talk to a streaming engineer — design review of your simulcast/SVC strategy for a real production deployment.
- See our case studies — Fora Soft conferencing, telemedicine, and e-learning builds.
- Download the simulcast vs SVC decision sheet — single-page PDF: side-by-side comparison, scalability modes by codec, and the production decision tree. Download (PDF)
References
- IETF RFC 8853, Using Simulcast in Session Description Protocol (SDP) and RTP Sessions, January 2021. The normative source for SDP simulcast signalling.
- IETF RFC 8852, RTP Stream Identifier Source Description (SDES) Items, January 2021. The companion document that defines RtpStreamId.
- W3C, Scalable Video Coding (SVC) Extension for WebRTC, Working Draft. The browser-facing API for
scalabilityMode. - W3C, WebRTC 1.0: Real-Time Communication Between Browsers, W3C Recommendation. The base RTCPeerConnection / RTCRtpSender API surface that simulcast and SVC plug into.
- IETF, RTP Payload Format For AV1, AV1 RTP specification. The Dependency Descriptor that the SFU uses for AV1 layer-aware forwarding.
- IETF draft, RTP Payload Format for VP9 Video. The VP9 payload descriptor that carries per-layer metadata.
- IETF draft, RTP Extensions for Transport-wide Congestion Control. The TWCC RTP header extension the SFU uses to estimate per-subscriber bandwidth.
- LiveKit Documentation, Codecs and more, accessed 2026-05-25. LiveKit's behaviour for simulcast versus SVC and the multi-codec simulcast feature.
- mediasoup API, Consumer.setPreferredLayers and ConsumerOptions.preferredLayers. The mediasoup API for layer preferences.
- Meetecho / Janus, PR #2741: Initial support for AV1-SVC Dependency Descriptor. The history of AV1 SVC support in Janus.
- Tsahi Levent-Levi, Five WebRTC Predictions for 2026, WebRTC.ventures, December 2025. The industry's 2026 outlook on AV1 / SVC adoption.
- LiveKit blog, An introduction to WebRTC Simulcast. Per-publisher overhead numbers for simulcast.


