Why this matters
If you are a media founder, a product manager, or a streaming CTO, the scariest line item on your platform is not the build — it is the bill on the night your biggest event goes live. Scaling is where OTT platforms either earn their margin or quietly lose it: the same stream that costs a few hundred dollars a month for a small audience can cost tens of thousands of dollars an hour at a million concurrent viewers, and a control plane that was never load-tested for a login storm will fail at exactly the moment the most people are watching. This article gives you the mental model and the numbers to plan capacity before you sign a CDN contract, talk to engineers without being sold the wrong architecture, and tell the difference between a platform that was designed to scale and one that merely works in a demo. The downloadable capacity-planning worksheet lets you run the arithmetic for your own audience.
Concurrency is the number that matters, not catalog size
The first instinct of someone new to streaming is to size a platform by how much content it holds. That is the wrong axis. A catalog of ten thousand films sitting in storage costs almost nothing to keep; what costs money is people watching it at the same time. The single most important capacity number in OTT is concurrency: the count of streams being delivered simultaneously at a given instant. A platform with a million registered users but only five thousand watching at once is a five-thousand-concurrency platform, and that is the number every delivery decision is sized against.
Concurrency matters because video is heavy and live. Unlike a web page, which is fetched once and then sits on the screen costing nothing, a video stream is a continuous river of bytes that keeps flowing for as long as the viewer watches. Ten thousand viewers do not download a file ten thousand times and stop; they each pull several megabits every second for an hour or more. The platform's job is to keep all those rivers flowing at once, and the cost and engineering of doing that scale almost linearly with how many flow together.
This is why the same platform behaves like a completely different animal at different concurrency levels. At a thousand concurrent viewers, a single well-configured CDN and a modest origin handle everything and you barely think about it. At a million, you are moving more data per second than many national internet backbones, every component has to be redundant, and a single mis-sized service can take the whole event down. The architecture from how an OTT platform works does not change shape as you scale — the boxes are the same — but each box must be engineered for the peak.
The arithmetic of a million viewers
Numbers make this concrete, so let us walk the math out loud. The core figure is per-viewer bitrate: a 1080p high-definition stream on a modern codec runs around 4 megabits per second (Mbps). To find the total throughput your delivery must sustain, multiply per-viewer bitrate by concurrency:
1,000,000 viewers × 4 Mbps = 4,000,000 Mbps = 4 terabits per second (Tbps)
Four terabits per second is the wall of water your CDN must push out continuously while the event runs. That is the throughput figure. To get the cost, convert to gigabytes, because CDNs bill per gigabyte of egress — the data sent out to viewers. One viewer-hour at 4 Mbps is:
4 Mbps × 3,600 seconds = 14,400 megabits = 1,800 megabytes ≈ 1.8 GB per viewer-hour
So a million viewers watching for one hour move:
1,000,000 × 1.8 GB = 1,800,000 GB = 1.8 petabytes in one hour
At a representative discounted CDN egress rate of $0.04 per gigabyte (high-volume tier; egress pricing is tiered, commit-dependent, and changes — see below), that single hour costs:
1,800,000 GB × $0.04 = $72,000 for one hour of the event
Seventy-two thousand dollars an hour is why scaling is a margin question, not a feature. The same arithmetic at a thousand concurrent viewers is $72 an hour — a rounding error. Nothing about the platform's logic changed between the two; only the multiplier did. The full cost picture, including storage, encoding, and DRM, lives in the OTT cost model; here the point is narrower: egress dominates at scale, and egress is a function of concurrency.
A note on the egress price itself, because it is the most misquoted number in streaming. CDN egress is never a single universal figure. It is tiered (the per-gigabyte rate falls as monthly volume rises), commit-dependent (annual spend commitments unlock lower rates), and often the bandwidth component is billed at the 95th percentile of your throughput rather than a flat average — meaning a few short spikes can set the rate you pay all month. Amazon CloudFront, for example, starts around $0.085/GB for the first tier in the US and falls below $0.040/GB at high volume, with commitments saving up to 30% (AWS CloudFront pricing, 2026). Always model your own tier and date the assumption. The engineering of that bill is covered in CDN cost engineering.
Figure 1. The cost of concurrency. Throughput and hourly egress rise almost linearly with concurrent viewers; the lever that bends the curve is cache offload, not a cheaper per-gigabyte rate.
The lever that bends the curve: cache offload
If egress scaled purely with concurrency and nothing could be done about it, every large platform would be unprofitable. The reason they are not is that most of those bytes never travel the expensive path. A CDN is a fleet of edge caches — servers placed close to viewers that keep copies of the most-requested video segments. A CDN edge cache is a corner store stocked with the items the neighbourhood asks for most, so the platform does not drive to the warehouse every time. When two thousand viewers in one city are all watching the same live segment, the edge fetches it from the origin once and serves it from local memory two thousand times.
The measure of how well this works is the cache-hit ratio (or origin offload): the share of bytes served to viewers that came from the cache rather than from your origin. If 95% of bytes are served from cache, your origin — the authoritative store everything is built from — only has to produce 5% of the traffic, and your most expensive infrastructure shrinks by twentyfold. Industry practice targets a cache-hit ratio above 85% for video, with healthy live platforms well into the 90s (Fastly and CacheFly origin-offload guidance, 2024). The arithmetic is direct: the same 4 Tbps event with a 95% offload ratio puts only 200 Gbps on the origin instead of 4 Tbps.
This is why concurrency, counterintuitively, can make delivery cheaper per viewer rather than more expensive. A live event where everyone watches the same stream at the same moment is the best possible case for a cache: one fetch, served to everyone. The hard case is the opposite — a large catalog of rarely-watched titles, where each request is for something different and the cache is cold. That "long tail" is why video-on-demand (VOD) and live scale differently, a distinction drawn in live vs VOD: two pipelines.
Origin shielding — protecting the source of truth
Even with edge caches, a problem appears at scale. A large CDN has dozens of edge locations, and on a cache miss each one independently asks the origin for the same segment. The origin, which must run compute-heavy work like just-in-time packaging, can be overwhelmed by dozens of simultaneous identical requests — the classic thundering herd. The fix is a single intermediate cache that all the edges fetch through, called origin shielding (Amazon CloudFront Origin Shield, Cloudflare Tiered Cache, Fastly Shield POPs). With a shield, the dozens of edge misses collapse into as few as one request to your origin per object. Amazon reports customers using Origin Shield for live streaming and multi-CDN workloads see up to a 57% reduction in origin load (AWS, Origin Shield announcement, 2020). The deeper mechanics are in origin and origin shielding.
Figure 2. Why a million viewers do not crush your origin. A shield collapses edge misses into one origin fetch per object; the edge serves the rest from cache. Offload compounds down the tree.
The two planes scale on different axes
Here is the idea that organizes all capacity planning: an OTT platform is two systems that scale on different measures, and confusing them is the most common scaling mistake.
The data plane is the path the video bytes travel — encode, package, origin, CDN, player. It scales with bandwidth: terabits per second, gigabytes of egress. Its cost and capacity are a function of concurrency × bitrate, and its main lever is cache offload, as we just saw.
The control plane is everything that decides and records without the video flowing through it — sign-in and identity, entitlement (may this account watch this title, here, now?), billing, ad decisioning, metadata, and analytics. It scales not with bandwidth but with requests per second: how many sign-ins, license requests, entitlement checks, and analytics beacons hit it each second. A four-megabit video stream and a tiny one-kilobyte "may I watch?" request live in completely different cost universes. The control plane moves almost no data, but at a million concurrent viewers it must answer millions of small questions in a tight window.
This split, drawn in detail in the streaming pipeline box by box, explains why platforms fail in surprising ways. A team that obsesses over CDN capacity and forgets the control plane will watch their sign-in service collapse under a login storm while the video edge sits half-idle. The two planes need separate capacity plans.
Figure 3. Two planes, two axes. The data plane is sized in terabits per second and tamed by cache offload; the control plane is sized in requests per second and tamed by autoscaling and queues.
The thundering herd: when a million people arrive at once
Steady-state concurrency is the easy part. The hard part is the arrival curve — how fast viewers show up. A popular VOD catalog grows and shrinks its audience smoothly over the day, and capacity can follow it gently. A live premiere or a sports match does the opposite: the audience can go from near zero to its peak in the first sixty seconds after kickoff, as everyone presses play within moments of each other. That synchronized rush is the thundering herd, and it stresses the control plane far more than the data plane.
Think about what happens in that minute. Hundreds of thousands of apps simultaneously open a session, check entitlement, request a DRM license, and begin sending analytics. The video edge can absorb this because they are all asking for the same first segment — a cache's best case. But the control plane receives hundreds of thousands of distinct small requests in seconds: a genuine traffic spike of the kind that topples under-provisioned services. The 2026 ICC T20 World Cup final streamed on JioHotstar reportedly reached around 821 million concurrent viewers (Indian Television, March 2026) — an audience that arrives in a wave no reactive system can scale into fast enough.
The answer is to provision ahead of the curve, not in reaction to it. Three techniques do this. Predictive autoscaling spins up control-plane capacity before the event based on the expected curve rather than waiting for load to appear; JioHotstar reported using machine-learning-driven, ladder-based autoscaling on Kubernetes that pre-scales for known cricket traffic patterns (engineering accounts, 2026). Pre-warming asks the CDN to pre-position content and stand up capacity at the edge before doors open. And queuing and graceful degradation — a waiting room, a retry with backoff, a brief static image instead of a hard error — keep the platform standing when demand briefly exceeds even a generous plan. Live-event specifics are covered in live event delivery and the premiere spike.
Common mistake: sizing for the average, not the peak. A platform that provisions for its average concurrency will be comfortably over-provisioned 95% of the time and catastrophically under-provisioned during the one event that matters. Capacity is sized to the peak of the arrival curve plus headroom, not to the daily mean. The whole business case for a live event can be undone by a control plane that was sized like a VOD service.
More than one CDN: failover and content steering
At a thousand concurrent viewers, one CDN is fine. At scale, relying on a single CDN is a bet that one company's network will never have a bad night in your biggest market — and that bet eventually loses. Serious platforms deliver through more than one CDN for two reasons: resilience (if one degrades during a spike, traffic shifts to another) and reach (different CDNs are strongest in different regions). JioHotstar's record streams ran across four CDNs — Akamai, Amazon CloudFront, Cloudflare, and its own Jio CDN — with sub-250-millisecond automated failover between them (engineering accounts, 2026).
The mechanism that makes multi-CDN practical without rebuilding your players is content steering, now a published standard. A small remote service hands the player a ranked list of delivery sources and can change that list at start-up or mid-stream, moving viewers off a struggling CDN onto a healthy one without interrupting playback. It is standardized so the same steering server controls both HLS and DASH players: HLS exposes it through the EXT-X-CONTENT-STEERING master-playlist tag with a SERVER-URI pointing at the steering manifest (Apple HLS Content Steering Specification), and DASH defines the equivalent in ETSI TS 103 998 v1.1.1 (DASH-IF Content Steering, January 2024), implemented in the dash.js reference player. Because the protocol is shared, one control service can steer your entire audience across CDNs in real time. The architecture itself — origins, shields, orchestration — is covered in the Video Streaming section's multi-CDN article.
A minimal HLS content-steering tag looks like this:
#EXTM3U
#EXT-X-CONTENT-STEERING:SERVER-URI="https://steer.example.com/manifest.json",PATHWAY-ID="cdn-a"
#EXT-X-STREAM-INF:BANDWIDTH=4000000,PATHWAY-ID="cdn-a"
https://cdn-a.example.com/1080p.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4000000,PATHWAY-ID="cdn-b"
https://cdn-b.example.com/1080p.m3u8
The two PATHWAY-ID values are the two CDNs; the steering server decides, per viewer and per moment, which pathway to prefer.
Latency is a scaling lever too
There is one more dimension that scales against you: latency. Low-latency live streaming — getting glass-to-glass delay down from the 20–30 seconds of standard HLS to the 3–8 seconds of Low-Latency HLS (LL-HLS) and Low-Latency DASH — is what makes a live sports stream feel live. But low latency is bought with shorter segments and more frequent requests, which means more requests per second hitting your edge and your control path. Choosing aggressive low latency at a million concurrency multiplies the small-request load you must plan for. It is a real product decision with a real capacity cost, not a free toggle; the trade-off is detailed in the Video Streaming section's low-latency article.
Capacity planning in practice
Putting it together, planning capacity for a platform that must grow from a thousand to a million viewers comes down to a short, disciplined sequence. First, forecast peak concurrency, not average — the single highest number of simultaneous streams you expect, separately for steady catalog traffic and for any live spike. Second, compute data-plane throughput and egress from that peak (concurrency × bitrate, then × watch-hours for cost), and size the CDN and origin for the peak with a realistic cache-offload assumption. Third, size the control plane in requests per second for the arrival curve, and choose predictive autoscaling for known events. Fourth, engineer for failure: more than one CDN with content steering, origin shielding, and graceful degradation so a spike that exceeds the plan bends instead of breaks. Fifth, load-test the whole thing at the target peak before the event, because the only way to know a platform scales is to push it there in a rehearsal, not in production on the biggest night. The reference architecture that ties these components together is the OTT reference architecture.
Where Fora Soft fits in
Fora Soft has built video streaming and OTT/Internet TV software since 2005, across 625+ shipped projects for 400+ clients. The recurring scaling problem we are hired to solve is the gap between a platform that works in a demo and one that holds up when the audience arrives all at once: sizing the data plane for peak concurrency, engineering cache offload and multi-CDN delivery so egress stays affordable, and building the control plane — entitlement, billing, analytics — to absorb a live premiere's thundering herd. We are vendor-neutral on CDNs and DRM services; we translate the scale requirement into an architecture, then load-test it against the real peak before the event, not after.
What to read next
- The OTT cost model: what a platform actually costs to build and run
- CDN cost engineering: egress, commits, and the 95th percentile
- The OTT reference architecture: the full picture
Download the OTT scaling & capacity-planning worksheet
Call to action
- Talk to a streaming engineer — book a 30-minute scoping call to talk through your ott platform scaling plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the OTT scaling & capacity-planning worksheet — A one-page worksheet to run the concurrency arithmetic for your own audience (peak concurrency, throughput, hourly egress), tick the cost-curve levers (cache offload, origin shielding, multi-CDN content steering), size the data and….
References
- IETF, HTTP Live Streaming, RFC 8216 (August 2017) — the HLS specification. Content steering is added in the in-progress 2nd edition, draft-pantos-hls-rfc8216bis. https://datatracker.ietf.org/doc/html/rfc8216 — Tier 1 (primary spec).
- Apple, HLS Content Steering Specification — defines the
EXT-X-CONTENT-STEERINGmaster-playlist tag andSERVER-URIattribute. https://developer.apple.com/streaming/HLSContentSteeringSpecification.pdf — Tier 2 (issuing body's guidance / spec). - ETSI / DASH-IF, Content Steering for DASH, ETSI TS 103 998 v1.1.1 (January 2024) — the multi-CDN content-steering standard for MPEG-DASH, compatible with HLS steering at the client level. — Tier 1 (primary standard).
- ISO/IEC, Dynamic adaptive streaming over HTTP (DASH), ISO/IEC 23009-1 — the DASH specification underlying multi-CDN delivery. — Tier 1 (primary spec).
- ISO/IEC, Common Media Application Format (CMAF), ISO/IEC 23000-19 — the package-once segment format that lets one cached object serve both HLS and DASH, improving cache-hit ratio. — Tier 1 (primary spec).
- Amazon Web Services, Announcing Amazon CloudFront Origin Shield (October 2020) — "up to a 57% reduction in origin load" for live-streaming and multi-CDN workloads. https://aws.amazon.com/about-aws/whats-new/2020/10/announcing-amazon-cloudfront-origin-shield — Tier 3 (first-party vendor). Re-verify the figure; vendor claims change.
- Fastly, Origin Offload: A measure of CDN efficiency for reducing egress cost (2024) — defines origin offload and its relationship to egress cost. https://www.fastly.com/blog/origin-offload-a-measure-of-cdn-efficiency-for-reducing-egress-cost — Tier 3 (first-party vendor).
- Amazon Web Services, Amazon CloudFront pricing (accessed 2026-06) — tiered US egress rates and commitment discounts. https://aws.amazon.com/cloudfront/pricing/ — Tier 3 (first-party vendor). Date-sensitive; re-verify before quoting.
- SVTA, Multi-CDN Delivery (October 2024) — Streaming Video Technology Alliance guidance on multi-CDN systems and content steering. https://www.svta.org/2024/10/24/multi-cdn-delivery/ — Tier 4 (industry alliance).
- Indian Television, JioHotstar hits record 821 million concurrent views (March 2026) — reported peak concurrency for the ICC T20 World Cup 2026 final. https://indiantelevision.com/television/jiohotstar-hits-record-821-million-concurrent-viewers-as-india-crushed-new-zealand/ — Tier 6 (trade press; orientation only). Concurrency figure as reported, not independently verified.
Where popular sources disagreed with the specs, the article follows the spec: many articles describe multi-CDN as a custom per-player integration, but content steering is now a published standard (ETSI TS 103 998, Apple HLS) that both HLS and DASH players implement — so multi-CDN does not require bespoke player code. We also override the common "one CDN is enough" framing with the spec-backed steering approach.


