Why this matters

If you are a media founder, a product manager, or a streaming CTO making your first build, you need one mental model that holds the whole platform — not nine disconnected diagrams. Without it, teams wire entitlement into each player, treat analytics as an afterthought, and encrypt with a scheme that locks out a third of their devices, then discover the mistakes in production where they are expensive to fix. This reference architecture gives you the boxes, the standard on every hop, and the boundaries that must not move: where the protection boundary sits, what belongs in the control plane, and which component decides your margin. Read it once and you can talk to engineers, content owners, and CDN salespeople without being sold the wrong thing. The downloadable component-notes pack lets you check your own design against it.

The whole picture in one view

Most OTT diagrams fail because they draw a single left-to-right arrow and stop. A real platform has two flows running at right angles. The data plane is the path the video bytes travel, from a source file or a live camera all the way to the screen. The control plane is everything that decides whether a viewer may watch, charges them, picks what to show, and records what happened — none of which the video itself flows through. Think of an airport: the data plane is the runway and the planes; the control plane is ticketing, security, gate assignment, and the operations dashboard. Both are essential, and confusing them is how platforms get built wrong.

The data plane, left to right, is seven boxes: ingest (accept the source), encode (build the adaptive bitrate ladder), package (wrap segments as CMAF for HLS and DASH), encrypt (apply cbcs Common Encryption), origin (the authoritative store of segments and manifests), CDN (the edge caches that actually serve viewers), and player (decrypt and play on every screen). The control plane sits beside it: identity and authentication, entitlement (may this account watch this title now?), billing and monetization, ad decisioning for ad-supported tiers, metadata and recommendations, and analytics and quality of experience (QoE). The rest of this article walks each box, names the standard on it, and marks the one boundary that protects your catalog.

Complete OTT reference architecture: a left-to-right data-plane pipeline from ingest to player with the standard labeled on each hop, the cbcs protection boundary drawn around encrypt-origin-license, and a control-plane band of identity, entitlement, billing, ads, metadata, and analytics services beneath it. Figure 1. The full picture. The data plane (top) carries bytes left to right; the control plane (bottom) decides and records. The protection boundary wraps the segments from encryption through licensed playback.

The data plane, box by box

Ingest — accept the source

Ingest is the front door. For video on demand (VOD), it accepts a finished high-quality master file, often called the mezzanine — the clean, lightly-compressed copy everything downstream is built from. For live, it accepts a continuous feed over a contribution protocol such as RTMP, SRT, or the newer WHIP, from an encoder at the venue. The job of ingest is narrow: validate the source, normalize it, and hand it to encoding. Get this wrong — accept a broken file, lose a live segment — and every box downstream inherits the defect.

Encode — build the ladder

A single file cannot serve a viewer on hotel Wi-Fi and a viewer on fibre equally well, so the platform builds several versions at different resolutions and bitrates. That set is the encoding ladder, and the player climbs up or down it as the network changes — a technique called adaptive bitrate (ABR) streaming. The analogy is seat classes on a plane: same flight, different price and comfort, and the player books the class the network can afford. The encoder is a cost-heavy box because compute is billed per output minute; a typical managed cloud transcoder such as AWS Elemental MediaConvert charges roughly $0.015 per output minute for HD (AWS MediaConvert pricing, 2026), paid once per title. We link out to the Video Encoding section for codec mechanics; here the ladder is a product decision, covered in encoding ladder explained.

Package — wrap the segments once

Packaging chops each rung of the ladder into short segments (typically 2–6 seconds) and writes the manifest — the index file the player reads to find the segments. The modern practice is to package once into Common Media Application Format (CMAF, ISO/IEC 23000-19): one set of fragmented-MP4 segments that both HLS (IETF RFC 8216) and MPEG-DASH (ISO/IEC 23009-1) can index. Before CMAF, teams packaged the same video twice — once for Apple's HLS, once for DASH — doubling storage and cache load. CMAF ends that. One set of segments, two manifests, every device.

Encrypt — seal it once, the right way

This is the box most teams get wrong, and the error is expensive because it can breach a studio licence. Premium content must be protected by digital rights management (DRM) — the system that keeps the decrypted video from being copied. The umbrella standard is Common Encryption (CENC, ISO/IEC 23001-7), and it defines two schemes: cenc (AES-CTR mode) and cbcs (AES-CBC mode). They are not interchangeable. Apple FairPlay requires cbcs; the modern convergence is cbcs everywhere. Encrypt your segments once with cbcs, and you can issue licences for all three DRM systems — Google Widevine, Microsoft PlayReady, and Apple FairPlay — from the same files. This is the "encrypt once, license many" truth, and it is the single most valuable line in this article. Encrypt with cenc only, and FairPlay devices — every iPhone, iPad, Apple TV, and Safari browser — cannot play your catalog.

Origin — the source of truth

The origin is the authoritative store that holds every segment and manifest. It is where the CDN goes when it does not already have a requested segment cached. The origin is not where most viewers' bytes come from — that is the CDN's job — but it must be correct and available, because a cache miss that the origin cannot answer is a stalled stream. Object storage such as Amazon S3 (around $0.023 per gigabyte per month, AWS S3 pricing, 2026) is the common origin store. A layer called origin shielding — a single designated cache that absorbs misses on the origin's behalf — protects it from the thundering herd of a popular premiere; we cover it in origin and origin shielding.

CDN — where viewers actually get their video

The content delivery network (CDN) is the fleet of edge servers near viewers that serve the vast majority of segments. A CDN edge cache is a corner store stocked with the items the neighbourhood asks for most, so the platform does not drive to the warehouse — the origin — every time. The CDN is the cost-deciding box, because its egress — the charge per gigabyte of video sent out to viewers — is the recurring bill that sets your margin. Egress is tiered, commit-dependent, and often billed at the 95th percentile, so it is never a single universal price; Amazon CloudFront for the US starts at the first 1 TB per month free, then about $0.085/GB for the next 9 TB, falling to $0.040/GB and below in higher tiers, with annual commitments saving up to 30% (AWS CloudFront pricing, updated 2026-06-08). Because delivery decides margin, serious platforms use more than one CDN — see multi-CDN architecture — and engineer cache offload aggressively, covered in CDN cost engineering.

Player — decrypt and play on every screen

The player is the app on each screen — web, iOS, Android, smart TV, streaming stick — that reads the manifest, fetches segments, runs the ABR algorithm, gets a licence, decrypts, and plays. On the web, playback uses two W3C standards: Media Source Extensions (MSE) to feed segments to the video element, and Encrypted Media Extensions (EME, a W3C Recommendation since 2017) to talk to the device's Content Decryption Module for DRM. Each platform speaks its native DRM — Safari and Apple devices use FairPlay, Chrome and Android use Widevine, Edge and Windows use PlayReady — which is exactly why cbcs matters: one set of segments, three licence types, every screen. The player is also where you measure quality, instrumented to send QoE beacons back to analytics.

The control plane — decide and record

The control plane never touches the video bytes, yet it decides whether your business works. Six services live here.

Identity and authentication establishes who the viewer is — sign-up, sign-in, and the session token that travels with each request. Entitlement is the gate that answers one question on every play: may this account watch this title, in this region, right now? Entitlement must be a single first-class service, not logic copied into each player, or the rules drift and viewers hit contradictory answers on different devices. Billing and monetization runs subscriptions, transactions, or the ad relationship, depending on whether the tier is SVOD, TVOD, or AVOD — business models that carry different requirements, covered in SVOD, AVOD, TVOD business models.

For ad-supported tiers, ad decisioning chooses which ad to show. The dominant production pattern is server-side ad insertion (SSAI), which stitches ads into the stream so ad-blockers and device quirks cannot break them; ad opportunities are marked in the stream with SCTE-35 (ANSI/SCTE 35 2023r1) and the ad itself is requested with IAB VAST 4.3 (December 2022). Metadata and recommendations is the catalog's nervous system — titles, artwork, genres, and the discovery logic that decides what each viewer sees first; recommendation model internals live in the AI for Video Engineering section. Finally, analytics and QoE is how you see into the platform: viewership counting (a "play" is a defined event, not a guess), and quality metrics like startup time and rebuffering, defined in the Video Streaming section. Build entitlement and analytics as their own services from day one; both are cheap in bytes and decisive in outcome.

The protection boundary — the one line that protects your catalog

Draw a box around the segments from the moment they are encrypted to the moment a licensed player decrypts them. That box is the protection boundary, and it is the architectural feature a studio licence audits. Inside it, the video is cbcs-encrypted CMAF; it stays encrypted at the origin, across the CDN, and in transit to the player. The keys live in a separate key store, and license servers — one logical service issuing Widevine, PlayReady, and FairPlay licences — hand a decryption key to a player only after entitlement says yes. The clear, decrypted video exists only inside the player's protected pipeline on the device.

Be exact about what this protects. DRM stops copying of the decrypted bytes; it does not stop a camera pointed at the screen — that is a separate problem addressed by forensic watermarking. Output protection such as HDCP (High-bandwidth Digital Content Protection) is yet another, separate layer that guards the cable between a device and a display. Multi-DRM is one locked box with three differently-cut keys, one for each brand of lock the devices use — not three separate boxes. The mechanics live in this section's unique core: multi-DRM, one workflow, every device and CENC, CTR, CBCS explained.

Protection boundary diagram: a single cbcs Common Encryption step feeding one logical license service that issues Widevine, PlayReady, and FairPlay licenses to the matching devices, with the key store inside the boundary and the clear zone marked only inside the player. Figure 2. Encrypt once, license many. One cbcs encryption feeds three DRM license types from the same segments; the clear video exists only inside the player.

Live and VOD on the same architecture

A platform usually serves both pre-recorded video on demand and live events, and a common mistake is to imagine two entirely separate platforms. They share most of the architecture; they differ at the front. The VOD path ingests a finished file, encodes the ladder once, packages, encrypts, and parks the segments at the origin — there is no clock. The live path ingests a continuous feed, encodes in real time, and packages segments as they are produced, so latency — the delay between the real event and the viewer's screen — becomes a budget measured in seconds. Both paths converge at the same packager output, the same cbcs encryption, the same origin, CDN, and player. Designing them to converge is what lets one platform do both without doubling cost; the detail lives in live vs VOD: two pipelines, and the spike of a live premiere is handled in scaling and concurrency.

Topology diagram of live and VOD pipelines drawn side by side, the VOD path ingesting a file and the live path ingesting a continuous feed, both converging at a shared packager, cbcs encryption, origin, CDN, and player. Figure 3. Two front ends, one back end. Live and VOD differ only at ingest and timing; they converge at the packager and share everything downstream.

A worked example: sizing delivery for 50,000 concurrent viewers

Because delivery decides margin, the number worth working out loud is the egress bill at scale. Suppose a live event peaks at 50,000 concurrent viewers, each streaming an HD rendition that averages 5 megabits per second (Mbps). The math, one line at a time:

bytes per viewer per hour = 5 Mbps ÷ 8 bits/byte × 3,600 s = 2,250 MB ≈ 2.25 GB/hour
total egress per hour     = 50,000 viewers × 2.25 GB       = 112,500 GB ≈ 112.5 TB/hour
egress cost (blended)     = 112,500 GB × ~$0.05/GB         ≈ $5,625 per hour

A three-hour event therefore costs roughly $16,900 in egress alone at a blended $0.05/GB rate — before encoding, origin, and DRM. Two levers move that number hard. First, a more efficient codec at the same quality cuts the 5 Mbps and scales the whole bill down linearly. Second, the blended rate falls with commit tiers and a higher cache-hit ratio, because every segment served from the CDN edge instead of the origin is a segment you do not pay origin egress for twice. This is why the architecture is drawn to maximize offload, and why the recurring egress line — not the one-time build — is the number that decides whether the business has margin. The full month-level model is the OTT cost model.

The architecture as a component table

When the boxes sit in a table with the standard on each and a column for what you actually own, the build decisions become concrete. Managed components are faster to ship; the ones marked "own it" are where margin and portability live.

Component Standard / protocol Typical implementation Plane Own it?
Ingest RTMP / SRT / WHIP (live); file (VOD) Managed live encoder or upload Data Optional
Encode (ladder) H.264 / HEVC / AV1 MediaConvert, Bitmovin, ffmpeg Data Optional
Package CMAF (ISO/IEC 23000-19) Shaka Packager, managed packager Data Recommended
Encrypt cbcs CENC (ISO/IEC 23001-7) Multi-DRM service Data Yes — plan day one
Origin HTTP origin + object storage S3 + origin service Data Yes
CDN HLS (RFC 8216) / DASH (23009-1) CloudFront, Akamai, Fastly (multi) Data Yes — margin lever
Player MSE + EME (W3C); native DRM Shaka, hls.js, ExoPlayer, AVPlayer Data Yes
Identity / auth OAuth / JWT sessions Custom or managed identity Control Yes
Entitlement Internal API Single first-class service Control Yes — never per-player
Billing Subscriptions / transactions Stripe, Cleeng, custom Control Yes
Ad decisioning SCTE-35 + VAST 4.3 (SSAI) MediaTailor, custom SSAI Control If AVOD
Analytics / QoE QoE beacons Mux Data, Conviva, custom Control Yes

Table 1. Every component, the standard on it, the plane it belongs to, and whether you should own it. The "own it" column marks where margin (CDN), portability (CMAF, cbcs), and product control (entitlement) live.

A common mistake: bolting the control plane onto the player

The most frequent architecture failure we see is not in the data plane — it is treating the control plane as an afterthought wired into each player. When entitlement logic is duplicated inside the web app, the iOS app, and the TV app, the three drift apart, and a viewer who is allowed to watch on their phone is refused on their TV. When analytics is added late, you ship a platform you cannot see into, and you debug rebuffering blind. The fix is structural: draw entitlement and analytics as their own first-class services beside the data plane from day one, with every player calling the same entitlement API and emitting the same QoE beacons. The second common mistake — cenc-only encryption that silently breaks every FairPlay device — is avoided by the single rule already stated: encrypt once with cbcs.

Where Fora Soft fits in

A reference architecture is only useful if a team can wire it correctly at the scale the business needs, and the hard part is exactly the boundaries: keeping delivery portable across CDNs, putting the protection boundary in the right place with cbcs from day one, and making entitlement and analytics first-class rather than bolted on. Fora Soft has built video streaming, OTT/Internet TV, e-learning, telemedicine, and video surveillance software since 2005 — 625+ shipped projects for 400+ clients — and that work is precisely this assembly problem: connecting managed encoding, multi-CDN delivery, and standards-based multi-DRM into a custom platform whose unit economics hold as the audience grows from a thousand viewers to a million. When a media company needs an architecture that scales without surrendering its margin to a reseller's egress markup, that is the engineering we bring.

What to read next

Call to action

References

  1. RFC 8216 — HTTP Live Streaming (HLS) — IETF. The manifest/segment format delivered as one of the two outputs from a single CMAF package; defines the .m3u8 playlist the player reads. Tier 1 (official standard). https://www.rfc-editor.org/rfc/rfc8216 (accessed 2026-06-15)
  2. ISO/IEC 23009-1 — Dynamic Adaptive Streaming over HTTP (MPEG-DASH) — ISO/IEC. The DASH delivery standard indexed from the same CMAF segments as HLS. Tier 1. https://www.iso.org/standard/83314.html (accessed 2026-06-15)
  3. ISO/IEC 23000-19 — Common Media Application Format (CMAF) — ISO/IEC. One set of fragmented-MP4 segments for both HLS and DASH; the "package once" foundation of the data plane. Tier 1. https://www.iso.org/standard/85623.html (accessed 2026-06-15)
  4. ISO/IEC 23001-7 — Common Encryption (CENC) — ISO/IEC. Defines the cenc (AES-CTR) and cbcs (AES-CBC) schemes; cbcs enables encrypt-once for Widevine/PlayReady/FairPlay; FairPlay requires cbcs. Tier 1. https://www.iso.org/standard/84637.html (accessed 2026-06-15)
  5. Encrypted Media Extensions (EME) — W3C Recommendation, 18 September 2017. The browser API the web player uses to talk to the device Content Decryption Module for DRM playback. Tier 1. https://www.w3.org/TR/encrypted-media/ (accessed 2026-06-15)
  6. Media Source Extensions (MSE) — W3C. The browser API that feeds adaptive-bitrate segments to the HTML video element. Tier 1. https://www.w3.org/TR/media-source/ (accessed 2026-06-15)
  7. ANSI/SCTE 35 2023r1 — Digital Program Insertion Cueing Message — SCTE. The in-stream signaling standard that marks ad opportunities for server-side ad insertion; latest revision published 2023-11-30. Tier 1. https://account.scte.org/standards/library/catalog/scte-35-digital-program-insertion-cueing-message/ (accessed 2026-06-15)
  8. VAST 4.3 — Video Ad Serving Template — IAB Tech Lab, December 2022. The XML standard used to request and track the ad once an SCTE-35 opportunity fires. Tier 1 (industry standards body). https://iabtechlab.com/standards/vast/ (accessed 2026-06-15)
  9. Amazon CloudFront pay-as-you-go pricing — Amazon Web Services. US egress tiers (first 1 TB free, ~$0.085/GB next 9 TB, down to $0.040/GB and below); annual Savings Bundle up to 30%. Page updated 2026-06-08. Tier 4 (vendor pricing). https://aws.amazon.com/cloudfront/pricing/pay-as-you-go/ (accessed 2026-06-15)
  10. AWS Elemental MediaConvert pricing — Amazon Web Services. Per-output-minute transcode (~$0.015/min HD Basic AVC). Tier 4. https://aws.amazon.com/mediaconvert/pricing/ (accessed 2026-06-15)
  11. Amazon S3 pricing — S3 Standard — Amazon Web Services. ~$0.023/GB-month first 50 TB (us-east-1); the common origin store. Tier 4. https://aws.amazon.com/s3/pricing/ (accessed 2026-06-15)

Source note (per §4.3.2): every claim about a delivery format, encryption scheme, DRM behavior, or ad-signaling standard traces to a tier-1 primary source (refs 1–8: RFC 8216, ISO/IEC 23009-1, 23000-19, 23001-7, W3C EME and MSE, SCTE 35 2023r1, IAB VAST 4.3). Cost figures are dated 2026 vendor pricing (refs 9–11, tier 4). The "encrypt once, license many" framing follows ISO/IEC 23001-7 cbcs; the common listicle claim that "multi-DRM means three encodes" is overridden by the spec. No lower-tier source overrode a standard.