Why this matters

If you are a founder, product manager, or first-time streaming CTO, the word "OTT" hides a dozen moving parts that vendors are happy to keep mysterious. You cannot make a build-versus-buy decision, talk credibly to engineers, or read a cloud bill until you can name each part and say what it does. This article gives you that vocabulary and that mental model in plain language, with the costs marked on the map. By the end you will be able to look at any streaming product and point to the box that is failing — or the box that is quietly eating your budget.

What "OTT" actually means

Start with the name. OTT stands for over-the-top, and the "top" it goes over is the traditional pay-TV system — the cable box, the satellite dish, the telecom's managed network. An OTT platform delivers video "over the top" of the ordinary public internet, straight to an app, with no cable subscription in between. Netflix, Disney+, YouTube, and a regional sports app are all OTT. So is the training-video portal a corporation builds for its staff.

The companion term is VOD, short for video on demand — content sitting in a library that a viewer can start whenever they choose, as opposed to a live broadcast that everyone watches at the same moment. Most OTT platforms serve both: a catalog of on-demand titles and, sometimes, live events on top. A later article in this course separates the vocabulary — OTT, VOD, live, linear, and FAST — in detail. For now, hold one sentence: OTT is the delivery method; VOD is one kind of content it delivers.

The reason OTT is worth building is the size of the prize. Worldwide OTT video revenue is projected at roughly $353 billion in 2026, with average revenue per user around $81 a year (Statista, 2026). And the audience has moved to the living room: connected TVs now capture about 58% of streaming watch time, ahead of mobile, and in US homes two platforms — Roku at 28% and Samsung at 23% — control most of the connected-TV gateway (Parks Associates, April 2026). Where your viewers watch shapes which boxes in the pipeline you cannot skip.

The pipeline in one sentence

Here is the whole platform in a single line, the order content travels:

Ingest → Encode → Package → Protect (DRM) → Deliver (CDN) → Play → Monetize → Measure.

Think of it as a factory. Raw material arrives at one end (ingest), gets cut into sellable sizes (encode), boxed for shipping (package), sealed with a tamper lock (DRM), trucked to regional depots near the customer (CDN), unwrapped and consumed (play), rung up at the till (monetize), and counted for the next planning cycle (measure). Every box below is one station on that line. We will walk through each, say what it does, name the typical technology, mark the common failure, and flag where the cost concentrates.

Horizontal eight-stage OTT pipeline from ingest through encode, package, DRM, CDN, player, monetization, to analytics, with cost-heavy stages marked. Figure 1. The end-to-end OTT pipeline. The three orange stages — encode, CDN delivery, and DRM — are where recurring cost concentrates.

Box 1 — Ingest: getting the video in

Ingest is simply the act of getting source video into your platform. There are two shapes, and they look almost nothing alike.

For on-demand content, ingest is a file upload. A studio or a content team hands you a high-quality master file — called a mezzanine, the pristine "negative" you keep and re-encode from, never the version a viewer sees — and you store it. The mezzanine is large (a feature film can be tens or hundreds of gigabytes) and you keep it because every future device, codec, or quality bump re-encodes from it.

For live content, ingest is a real-time feed arriving over a contribution protocol — the link that carries video to your platform, as opposed to distribution, which carries it onward to viewers. The common contribution protocols are RTMP (Real-Time Messaging Protocol, old but everywhere), SRT (Secure Reliable Transport, the modern broadcast-grade choice), and WHIP (WebRTC-HTTP Ingestion Protocol, for sub-second latency). The mechanics of these live in our Video Streaming section; here, the point is only that live video has to arrive before anything else can happen, and there is no second chance to re-pull a frame that was dropped on the way in.

Typical technology: object storage (Amazon S3 and equivalents) for mezzanines; an ingest endpoint or media server for live feeds. Common failure: treating the viewer-facing rendition as the master, then being unable to re-encode for a new device years later. Cost note: storage of mezzanines is real but rarely the budget-breaker — it is cheap per gigabyte and grows slowly.

Box 2 — Encode: making the video streamable

A camera or a mezzanine file is far too large to stream as-is. Encoding (also called transcoding when you are converting from one compressed form to another) shrinks the video using a codec — the coder-decoder algorithm that throws away visual information a human eye will not miss. This is the most computation-heavy station on the line, and one of the three that decide your margin.

But you do not encode the video once. You encode it many times, into a ladder.

The encoding ladder

An encoding ladder is a set of versions of the same video at different resolutions and bitrates — think of the seat classes on a plane: same flight, several prices and comfort levels, and the passenger takes the one they can afford. Here the "passenger" is the viewer's network connection, and the player picks the rung it can stream without stalling. A modest ladder might look like this:

Rung Resolution Bitrate Who it serves
1 1920×1080 (1080p) 5,000 kbps Fast home Wi-Fi, big screen
2 1280×720 (720p) 3,000 kbps Solid broadband
3 854×480 (480p) 1,200 kbps Mobile on a good cell signal
4 640×360 (360p) 600 kbps Weak or congested connection

The viewer's player measures the network and hops up or down the ladder in real time. That technique — switching rungs to avoid the spinning wheel — is adaptive bitrate streaming (ABR), and its internals are covered in our adaptive bitrate streaming guide. For this map, hold the idea: one title becomes many files, and you pay compute to make every rung.

Which codec

The codec you choose sets both quality-per-bit and device reach. Four matter in 2026:

  • H.264 (AVC) — the universal floor. Every device made in the last fifteen years decodes it. Least efficient of the four, but you almost always include it as a fallback rung.
  • HEVC (H.265) — roughly 40–50% more efficient than H.264, strong on TVs and Apple devices, with a tangled licensing history.
  • AV1 — a royalty-free codec from the Alliance for Open Media; the best efficiency of the mainstream four and growing fast on newer hardware.
  • VVC (H.266) — the newest, most efficient, least widely deployed; one to watch, not yet a workhorse.

Codec internals — how each one squeezes the bits — belong to our Video Encoding section; see how to choose a codec in 2026. Here, the product fact is the trade-off: newer codecs cut your delivery bill but reach fewer devices, so most platforms ship a mix.

Typical technology: cloud transcoders (AWS Elemental MediaConvert, Bitmovin) or open tools (ffmpeg, Shaka Packager). Common failure: a fixed ladder applied to every title — encoding a simple cartoon at the same high bitrate as a fast action film, wasting compute on the way in and egress on the way out. The fix, per-title encoding, gets its own article in Block 2. Cost note: encoding is cost-heavy. You pay per minute of video per rung, and a deep ladder across a large catalog adds up quickly.

Box 3 — Package: boxing the video for shipping

Encoding gives you compressed video. Packaging wraps that video into the container format and the small downloadable chunks a streaming player understands. Streaming does not send one giant file; it cuts the video into segments — short pieces, typically 2 to 6 seconds each — plus a manifest, a text index that lists every segment and every ladder rung so the player knows what to request next.

Two packaging formats dominate, and they correspond to two manifest types:

  • HLS (HTTP Live Streaming) — Apple's format, specified in IETF RFC 8216 (August 2017), with a current second edition in draft. Its manifest is the .m3u8 playlist. HLS is mandatory for good playback on Apple devices.
  • MPEG-DASH (Dynamic Adaptive Streaming over HTTP) — the ISO standard, ISO/IEC 23009-1. Its manifest is the .mpd file. Common on Android, smart TVs, and the open web.

For years, supporting both meant packaging the video twice — two sets of segment files, double the storage. CMAF (Common Media Application Format), standardized as ISO/IEC 23000-19, ended that. CMAF defines a single fragmented-MP4 segment format that both an HLS .m3u8 and a DASH .mpd manifest can point at. You package one set of segments and serve every device. The deep mechanics live in our CMAF explainer and the HLS-versus-DASH comparison; the product takeaway is: package once with CMAF, address it from two manifests.

Typical technology: Shaka Packager, AWS MediaPackage, Bitmovin. Common failure: packaging separate HLS and DASH files when CMAF would have served both, doubling storage and cache footprint for no benefit. Cost note: modest. Packaging is light compute; its real leverage is letting one set of files serve all devices, which lowers storage and improves cache efficiency downstream.

Box 4 — Protect: locking premium content

If your catalog includes licensed movies, sports, or any content a studio expects you to guard, you need digital rights management (DRM) — the technology that keeps the decrypted video from being copied off the device. Skip it and you can breach the very license that let you carry the content. DRM is the third of the three margin-deciding boxes, and the one most often misunderstood.

A crucial distinction first: DRM protects the stream and the keys, not the screen. It stops a viewer from saving the decrypted file; it cannot stop a camera pointed at the display. A separate layer, HDCP (High-bandwidth Digital Content Protection), guards the cable between a device and an external monitor. Be precise about this threat model — it is the single most common DRM misconception.

Three systems, one workflow

There are three DRM systems in the world, split by ecosystem:

  • Widevine — Google's, on Android, Chrome, and most smart TVs.
  • PlayReady — Microsoft's, on Windows, Xbox, and many TVs.
  • FairPlay — Apple's, on iPhone, iPad, Mac, and Apple TV.

The naïve fear is that three systems mean three separate encryptions. They do not — and this is where most articles get it wrong. The umbrella standard Common Encryption (CENC), ISO/IEC 23001-7, defines encryption schemes that all three DRMs can read. It specifies two schemes used in practice: cenc (AES in counter mode) and cbcs (AES in cipher-block-chaining mode with a sampling pattern). Apple's FairPlay has only ever supported cbcs. Because Widevine and PlayReady now also support cbcs, the industry has converged: encrypt the segments once with cbcs, then issue Widevine, PlayReady, and FairPlay licenses from those same files.

The slogan is "encrypt once, license many." Picture one locked box with three differently-cut keys — one for each brand of lock the devices carry. You seal the box a single time; you hand out three shapes of key. Common Encryption's deeper mechanics live in our Common Encryption (CENC) article; the multi-DRM workflow gets its own deep dive later in this course.

Typical technology: multi-DRM services (Axinom, EZDRM, PallyCon, Verimatrix) that broker all three licenses. Common failure: encrypting with cenc only, which silently breaks FairPlay and locks every Apple viewer out. Cost note: DRM is a recurring cost — multi-DRM services charge per license issued or per active stream. Budget it from day one for any licensed catalog.

Box 5 — Deliver: getting bytes to the viewer

Now the segments exist, packaged and protected. Delivery is the job of moving them from your storage to a viewer who might be anywhere on earth — and the system that does it is the CDN (content delivery network). This is the largest recurring line on most streaming bills, and the first of the three cost-heavy boxes in the order a viewer experiences them.

A CDN is a network of edge servers spread across the world that keep copies of your popular segments close to viewers. Think of an edge cache as a corner store stocked with the items the neighborhood asks for most, so you do not drive to the central warehouse — your origin server — for every request. When a viewer in Berlin presses play, the segments come from a German edge, not from your storage in Virginia. That cuts latency and, crucially, cuts the bill, because every request the edge serves from cache is one your origin did not have to.

Why egress decides your margin

The recurring charge that matters is egress — what the CDN bills to send bytes out to viewers. Egress is tiered (the price per gigabyte drops as volume rises), commit-dependent (you negotiate lower rates by promising volume), and often billed on a percentile of peak traffic rather than a flat per-gigabyte rate. So a single quoted price is never the whole story — always cite the model and the date.

Walk the arithmetic once, because the number surprises people. Take a public list rate as the example: AWS CloudFront charges roughly $0.085 per gigabyte for the first 10 TB each month in the US and Europe, tiering down toward $0.06 at higher volumes (AWS, Q2 2026). Now suppose 10,000 viewers each watch one hour at the 3,000 kbps rung:

bytes per viewer-hour = 3,000 kbps ÷ 8 × 3,600 s = 1,350,000 KB ≈ 1.35 GB
total = 1.35 GB × 10,000 viewers ≈ 13,500 GB ≈ 13.5 TB
cost ≈ 10,000 GB × $0.085 + 3,500 GB × $0.080 ≈ $850 + $280 = $1,130

One ordinary hour for a modest audience is over a thousand dollars in egress alone — and it recurs every time anyone watches. That is why CDN cost engineering (multi-CDN, cache offload, the 95th-percentile bill) gets a whole block later, and why per-title encoding matters: shave the bitrate without hurting quality and you shave this bill on every play.

Typical technology: Akamai, Amazon CloudFront, Cloudflare, Fastly; often two or more in a multi-CDN setup for resilience and price leverage — see our multi-CDN architecture article. Common failure: single-CDN lock-in with no failover, so one provider's regional outage takes your whole service down — and no second vendor to negotiate price against. Cost note: egress is usually the single largest recurring cost. Protect it with caching and bitrate discipline.

Box 6 — Play: the app on every screen

Everything so far is invisible to the viewer. The player is the part they touch — the app on the phone, the web page, the smart-TV channel — and it does far more than show pixels. It downloads the manifest, runs the adaptive-bitrate logic that hops up and down the ladder, requests a DRM license to decrypt protected segments, manages the buffer so playback stays smooth, and reports what happened back to your analytics.

The hard part of the player is that it lives on many screens at once, and they do not share code. A serious OTT platform supports some mix of: the web (HTML5 video using the browser's Media Source Extensions and Encrypted Media Extensions — W3C standards, with EME a 2017 Recommendation), iOS and Android apps, and the high-value living-room targets — Roku, Samsung Tizen, LG webOS, Apple tvOS, and Amazon Fire TV. Each platform has its own player framework, its own DRM quirks, and its own certification process. Because connected TVs now carry most of the watch time, the TV apps are not optional polish — they are where the audience is.

Typical technology: hls.js and Shaka Player (web), AVPlayer (Apple), ExoPlayer/Media3 (Android), plus native SDKs per TV platform. Common failure: shipping a great web and mobile player and a weak TV app — which is exactly where most viewing now happens. Cost note: client development is a build cost, not a recurring one, but it scales with the number of platforms you support; each new screen is a new codebase to maintain.

Box 7 — Monetize: turning views into revenue

A platform that no one pays for is a hobby. Monetization is the layer that turns viewing into money, and the shape you pick bends the architecture of every box before it. There are four basic models:

  • SVOD (subscription VOD) — viewers pay a recurring fee for access. Netflix is the archetype. Needs billing, entitlement (who is allowed to watch what), and usually strong DRM.
  • AVOD (advertising VOD) — free to watch, paid for by ads. Needs an ad-insertion stack and ad-decision servers.
  • TVOD (transactional VOD) — pay per title, to rent or buy. Needs a storefront and transactional billing.
  • FAST (free ad-supported streaming TV) — ad-supported linear channels, the streaming cousin of broadcast TV.

Most real platforms run a hybrid — subscription with an ad-supported tier, say. The business-model choice is consequential enough to get its own article (SVOD/AVOD/TVOD), as is the ad-insertion machinery, which uses standards like SCTE-35 to mark where ads go and VAST to serve them. The one fact to carry from this map: the monetization model is not a feature you bolt on at the end — it decides your paywall, your ad stack, your DRM tier, and your analytics. Choose it before you build, not after.

Typical technology: subscription billing platforms, server-side ad insertion (Google DAI, AWS MediaTailor), paywall and entitlement services. Common failure: building the pipeline first and the business model second, then discovering the ad stack or the entitlement rules force a re-architecture. Cost note: mixed — billing and ad-serving fees are recurring and model-dependent; the cost lives in the stack the model requires.

Box 8 — Measure: knowing what happened

The last box closes the loop. Analytics is how you learn whether the platform is working — both the business questions (how many people watched, for how long, did they churn) and the quality questions, grouped under quality of experience (QoE): how long playback took to start, how often it stalled to rebuffer, what video quality viewers actually received.

QoE matters because it maps directly to revenue. A slow start or a mid-show rebuffer is the most common reason a viewer abandons a stream, and an abandoned stream is lost watch time and, for ad-supported models, lost ad revenue. The discipline of measuring startup time and rebuffering is covered in our video QoE metrics article; the OTT-specific analytics map gets its own block here.

One precision point, because it bites everyone: "a play" is a defined event, not a guess. Autoplay, bots, and the difference between "video element loaded" and "viewer actually watched 30 seconds" can inflate your numbers two- or threefold if you do not define the metric first. Decide what counts as a play before you report one.

Typical technology: Mux Data, Conviva, Datazoom, plus your own data pipeline. Common failure: counting autoplays as plays, then making programming decisions on inflated engagement. Cost note: low relative to encoding and egress, but the data it produces is what tells you where to spend the other budgets.

Where the cost actually concentrates

Step back from the line and three boxes glow orange: encode, deliver, and protect. Encoding is compute you pay per minute per rung. Delivery (CDN egress) is the recurring bill that scales with every viewer-hour. DRM licensing is a per-stream or per-license fee on protected content. Storage, packaging, and analytics are real but secondary. Client development is a build cost that scales with the number of screens, not with viewers.

The practical consequence: the two biggest levers on your margin are the encoding ladder (a smarter ladder cuts both compute and egress) and CDN strategy (caching, multi-CDN, and commit negotiation cut the largest recurring line). Almost every cost-optimization story in OTT comes back to one of those two. The deeper OTT cost model walks the full arithmetic with a worked example.

Bar chart ranking OTT cost categories, with CDN egress, encoding compute, and DRM licensing as the three largest recurring costs. Figure 2. Where recurring cost concentrates. Egress, encoding compute, and DRM licensing dominate; storage, packaging, and analytics trail.

Build, buy, or assemble

Knowing the eight boxes lets you answer the first real decision: do you build each box, buy a finished platform, or assemble from cloud parts? An off-the-shelf OTT product (Brightcove, JW Player, Vimeo OTT, Muvi) hands you every box pre-wired, fast, with the least control and the thinnest margin. Assembling from cloud primitives (AWS MediaConvert and MediaLive for encode, MediaPackage for package, CloudFront for delivery, a multi-DRM broker for protection) gives more control for more integration work. Fully custom gives the most control and the best long-run margin for the most up-front cost.

There is no universally right answer — only the answer that fits your scale, your catalog, your monetization model, and how much of the margin you need to keep. The build-versus-buy article walks the trade-offs in money, time, and control. And yes — to answer the question every founder asks — you can build a "Netflix clone" in the sense of assembling these eight boxes; the hard part is not the architecture, which is well understood, but operating it at scale and cost, which is what the rest of this course is about. For a shorter, commercial overview, see our OTT platform guide.

Where Fora Soft fits in

The reason this pipeline reads as second nature to us is that we have built each box, repeatedly, at scale. Fora Soft has shipped video streaming, OTT and Internet-TV, WebRTC and conferencing, e-learning, telemedicine, and AR/VR software since 2005 — 625+ projects for 400+ clients. The recurring engineering problem in OTT is not making one stream play; it is making a hundred thousand concurrent streams play affordably, with a protected catalog, across every screen — the encode-deliver-protect triangle this article marked in orange. When a media company needs a platform that scales without the egress bill outrunning the subscription revenue, that intersection of streaming, encoding, content protection, and monetization is exactly where we work.

What to read next

Call to action

References

  1. IETF. RFC 8216 — HTTP Live Streaming (HLS). August 2017. https://www.rfc-editor.org/rfc/rfc8216 — HLS manifest (.m3u8) and media-segment format; protocol version 7. (Tier 1.)
  2. ISO/IEC. 23009-1 — Dynamic Adaptive Streaming over HTTP (MPEG-DASH). https://www.iso.org/standard/83314.html — DASH .mpd manifest and adaptive delivery. (Tier 1.)
  3. ISO/IEC. 23000-19 — Common Media Application Format (CMAF). https://www.iso.org/standard/85623.html — single fragmented-MP4 segment format serving both HLS and DASH. (Tier 1.)
  4. ISO/IEC. 23001-7 — Common Encryption in ISO base media file format files (CENC). https://www.iso.org/standard/68042.html — defines the cenc and cbcs encryption schemes used by multi-DRM. (Tier 1.)
  5. ISO/IEC. 14496-12 — ISO base media file format (ISOBMFF). https://www.iso.org/standard/83102.html — the MP4 container family CMAF segments build on. (Tier 1.)
  6. W3C. Encrypted Media Extensions. Recommendation, 18 September 2017. https://www.w3.org/TR/encrypted-media/ — the browser API that lets a web player request a DRM license. (Tier 1.)
  7. W3C. Media Source Extensions. https://www.w3.org/TR/media-source/ — the browser API for adaptive-bitrate playback on the web. (Tier 1.)
  8. Statista. OTT Video — Worldwide market forecast. 2026. https://www.statista.com/outlook/amo/media/tv-video/ott-video/worldwide — 2026 revenue ≈ $352.96B; ARPU ≈ $81.44. (Tier 5.)
  9. Parks Associates. Roku (28%) and Samsung (23%) dominate US connected-TV platforms. April 2026. https://www.parksassociates.com/blogs/press-releases/roku-and-samsung-dominate-connected-tv-platforms — connected-TV platform share; CTV ≈ 58% of streaming time. (Tier 5.)
  10. Amazon Web Services. CloudFront Pricing. Q2 2026. https://aws.amazon.com/cloudfront/pricing/ — US/EU egress ≈ $0.085/GB first 10 TB, tiering down; illustrates the tiered, commit-dependent egress model. (Tier 4.)
  11. Apple. HTTP Live Streaming — authoring specification and FairPlay Streaming. https://developer.apple.com/streaming/ — FairPlay requires the cbcs scheme; HLS is required for Apple-device playback. (Tier 3.)

Per the section's conflict rule, where popular articles say "FairPlay uses CENC," the article follows the spec: FairPlay requires the cbcs scheme of Common Encryption (ISO/IEC 23001-7), not the cenc (AES-CTR) scheme. Vendor blogs that blur the two schemes were overridden by the standard.