A/B Testing and Experimentation for Streaming

Why this matters

On a large catalog, every change to recommendations, the home screen, the player, the sign-up flow, or the pricing page is worth a great deal of watch time and retention — and at scale a one-percent improvement is millions of hours, so you cannot afford to guess which changes help and which quietly hurt. Controlled experimentation is the only reliable way to tell a real improvement from random noise, and running it well at scale is itself an engineering discipline with a platform behind it. This article is for the founder, product manager, or streaming CTO who has to decide what to test, how long to run it, which metric to trust, and how much experimentation machinery is worth building. You will not derive the statistics by hand, but you do have to set the rules — what counts as a win, when a test is allowed to stop, and which numbers can never be allowed to get worse — and understand why a change that "obviously" improves the product so often does not survive a fair test.

What an A/B test actually is

Start with the plainest version. An A/B test — also called a controlled experiment or split test — takes the people using your product, divides them into groups at random, gives each group a different version of one thing, and then compares a chosen outcome between the groups. One group, the control (the "A"), sees the current product. The other group, the treatment (the "B"), sees the change. Because the only systematic difference between the groups is the change itself, any reliable difference in the outcome can be credited to that change. That is the whole idea, and its power is that it replaces opinion with evidence.

The reason this matters so much is a problem the experimentation literature named bluntly: the HiPPO, the "Highest Paid Person's Opinion" (Kohavi et al., 2009). Without experiments, product decisions default to whoever is most senior or most confident in the room, and intuition about what viewers want is wrong far more often than teams expect. The discipline of A/B testing exists to let data, not hierarchy, decide — and the consistent finding across companies that test heavily is that a large share of changes everyone was sure would help turn out to do nothing or to hurt.

Randomization is the quiet hero here. A useful analogy: if you want to know whether a new fertilizer works, you do not put it on the sunny field and leave the old one on the shady field, because then you cannot tell the fertilizer from the sunlight. You scatter both treatments randomly across the same field so that sun, soil, and water even out between them. Random assignment of viewers does the same job — it balances out age, device, country, taste, and the heavy-viewer-versus-light-viewer mix, so the groups are comparable and the change is the only thing left to explain a difference.

The metric that matters: optimize watch time, not clicks

The most important decision in an experiment is made before it runs: what outcome are you measuring? The experimentation literature calls this the Overall Evaluation Criterion, or OEC — the single agreed measure of whether the change met its objective (Kohavi et al., 2009). Choosing it well is harder than it sounds and matters more than any statistical detail, because a test optimizes exactly what you point it at, including the wrong thing.

For a streaming product, the trap is the click. Clicks, taps, and "plays started" are easy to measure and feel like success, so teams reach for them — and a change tuned to maximize clicks will cheerfully win on clicks while losing on everything that pays the bills. A more sensational thumbnail earns the tap and then disappoints. A row of clickbait titles lifts plays-started and lowers the watch time those plays produce. The fix is to anchor the OEC on the outcome you actually care about: watch time and longer-term retention. Netflix is explicit that its core A/B evaluation metrics are "month-to-month subscription retention and member streaming hours" — the durable signals of member satisfaction — not the easy proxies (Netflix Technology Blog, 2017). This is the same watch-time-over-clicks discipline that governs how recommendations and merchandising are tuned: a click that leads to a thirty-second bounce is a failure wearing the costume of a win.

Retention is the truest OEC and also the most painful, because it is slow — you cannot know a month-to-month retention effect in an afternoon. That tension drives much of the craft that follows: teams use faster, more sensitive proxy metrics (engagement, streaming hours over a few weeks) to move quickly, while validating the changes that matter against the slow retention metric that decides whether subscribers actually stay.

Three families of metric belong in every streaming experiment, and keeping them distinct prevents most disasters.

Metric role	What it is	Streaming examples	What you do with it
Primary / OEC	The one outcome that defines success	Member streaming hours; month-to-month retention	Optimize it — the change wins or loses on this
Secondary	Supporting signals that explain why	Plays started, completion rate, titles browsed, search use	Diagnose and understand the result
Guardrail	Numbers that must not get worse	Rebuffering ratio, video start time, crash rate, app errors, unsubscribes	Block the launch if they regress, even on an OEC win

Table 1. The three roles a metric plays in a streaming experiment. The primary metric (the OEC) is what you optimize; secondary metrics explain the movement; guardrail metrics are the trip-wires. A recommendation change that lifts streaming hours but pushes up the rebuffering ratio has failed a guardrail and should not ship — engagement bought with a worse-quality stream is not a real win (Kohavi, Tang & Xu, 2020).

Streaming experiment metrics: optimize watch time and retention; QoE guardrails (startup, rebuffering) can veto a launch. Figure 1. The metric stack for a streaming experiment. Point the experiment at the OEC you truly want — streaming hours and retention — use secondary metrics to understand the movement, and set quality-of-experience guardrails (startup time, rebuffering, crashes) that can veto a launch. Optimizing clicks alone is the classic way to win the test and lose the product.

The guardrail row deserves emphasis because it is where streaming differs from a generic web A/B test. A change to the player or the encoding ladder can lift one metric while silently degrading playback quality. So the quality-of-experience numbers — how long video takes to start, how often it stops to rebuffer, how often the app crashes — sit alongside the OEC as guardrails that can veto a launch no matter how good the headline number looks. The precise definitions of those quality metrics — startup time, rebuffering ratio, and the rest — are covered in the streaming-side reference on video quality-of-experience metrics; here the point is simply that they must be in every experiment as guardrails.

How big a test, and for how long: the arithmetic of power

The two questions every team asks — "how many viewers do we need?" and "how long do we run it?" — have a real answer, and walking through it once removes most of the mystery. The driver is statistical power: the ability of a test to detect a true difference of a given size without being fooled by random noise. Three things set the sample size you need: how noisy the metric is (its variance), how small an effect you want to be able to catch (the minimum detectable effect, or MDE), and how confident you want to be.

A widely used rule of thumb from the experimentation literature captures it (Kohavi et al., 2009). For a test with the standard settings — 95% confidence and 80% power — the number of viewers you need per group is roughly:

n ≈ 16 × variance / (effect size)²

Make it concrete. Suppose your OEC proxy is "the fraction of new accounts that start a qualifying play in their first week," and the current rate is 60%. For a yes/no rate like this, the variance is p × (1 − p) = 0.60 × 0.40 = 0.24. Say you want to detect a 1 percentage point absolute improvement — moving 60% to 61% — so the effect size is 0.01 and its square is 0.0001. Then:

n ≈ 16 × 0.24 / 0.0001 = 38,400 viewers per group

So about 76,800 viewers total to reliably catch a one-point move. Two lessons fall out of this arithmetic, and both shape real planning. First, smaller effects cost dramatically more: because the effect size is squared, detecting half the effect (0.5 points instead of 1) needs four times the sample — about 153,600 per group. Second, the sample size translates directly into duration: if 76,800 of the right kind of viewer flow through the tested surface in a week, the test takes a week; if only a tenth do, it takes more than two months. This is why a small platform cannot run the firehose of experiments a large one can — and why the techniques in the next two sections, which buy sensitivity, are not luxuries but the difference between learning quickly and barely learning at all.

The cardinal sin: peeking, and how sequential testing fixes it

Here is the most common and most expensive mistake in all of experimentation, and it is a discipline problem, not a math problem. A team launches a test planned for two weeks, watches the dashboard daily, sees the result cross the "statistically significant" line on day four, and ships the winner. This is called peeking, and it quietly destroys the reliability of the result (Johari et al., 2017).

Why it breaks is worth understanding in plain terms. A standard A/B test's promise — "only a 5% chance of a false alarm" — holds only if you look once, at the predetermined end. Each extra time you peek and allow yourself to stop, you give randomness another chance to wander across the significance line by luck. Peek every day for two weeks and your real false-positive rate is not 5% but something far higher; you will "find" wins that are pure noise and ship changes that do nothing. The naïve fix — "just don't look" — is a non-starter for a streaming business that must catch a harmful rollout fast.

The real fix is a different statistical method built for continuous monitoring: sequential testing. Where a classic test is valid only at a single fixed endpoint, a sequential test produces always-valid results — p-values and confidence intervals that remain trustworthy no matter how often you look — so you are allowed to monitor continuously and stop as soon as the evidence is conclusive, in either direction (Johari et al., 2017). The cost is a modest one: to stay honest under constant looking, the method demands slightly stronger evidence before declaring a win. The payoff is large for streaming: you can end a clearly winning test early and roll it out, and — more importantly — you can catch a change that is harming playback and kill it within hours instead of weeks. Netflix built exactly this kind of sequential methodology specifically to monitor safe rollouts and protect the streaming experience in real time, stopping regressions early while controlling false alarms (Netflix Technology Blog, 2023).

Peeking breaks a fixed A/B test; sequential testing stays valid under continuous monitoring and can stop early and safely. Figure 2. Why peeking breaks a fixed test, and how sequential testing fixes it. A classic test's 5% false-alarm promise holds only at the single planned endpoint; stopping early the first time it looks significant inflates false positives. A sequential test uses always-valid boundaries that permit continuous monitoring, so you can stop a clear winner — or a harmful change — as soon as the evidence is conclusive.

If you take one operational rule from this article, make it this: decide the stopping rule before the test starts. Either fix the sample size and duration in advance and look only at the end, or adopt a sequential method designed for continuous monitoring. What you must never do is run a fixed-horizon test and stop it early because it looked good — that is peeking, and it manufactures wins that are not real.

Interleaving: comparing rankings with a fraction of the traffic

Recommendations pose a special problem for A/B testing. Improvements to a ranking algorithm are often small but valuable, and as the system gets better, detecting the next improvement against the noise of streaming hours needs ever-larger samples and ever-longer tests (Netflix Technology Blog, 2017). Running every candidate ranker through a full A/B test would let only a handful of ideas be tried a year. Streaming platforms solve this with a technique borrowed from search-engine research: interleaving.

The idea is a direct analogy to a taste test. To learn whether a population prefers Coke or Pepsi, the slow way is to split people in two, give one group only Coke and the other only Pepsi, and compare how much each group drinks — but consumption varies so wildly from person to person that the signal is buried in noise (Netflix Technology Blog, 2017). The sharper way is to offer each person both, unlabeled, and see which they reach for. Removing the person-to-person variation makes the preference jump out with far fewer tasters. Interleaving does the streaming version: instead of showing ranker A to one group and ranker B to another, it blends both rankings into a single row shown to one viewer, then watches which ranker's recommendations earn the viewing.

The blending has to be fair, and the method most platforms use is team-draft interleaving, which works exactly like two captains picking a pickup team (Chapelle et al., 2012). A coin toss decides who picks first; then the two rankers alternate, each contributing its highest-ranked title not already taken, until the row is full. Because the algorithms take turns, each one is equally likely to occupy the attention-grabbing left-most slots — which neutralizes position bias, the strong tendency of viewers to play whatever sits on the left regardless of quality (Netflix Technology Blog, 2017). The platform then attributes each play to the ranker that contributed that title and tallies which ranker won the viewer's hours.

Team-draft interleaving: two rankers alternate picks into one blended row for one viewer; each play credited to its ranker. Figure 3. Team-draft interleaving for recommendations. Two candidate rankers alternate picks — like two captains choosing a team — into one row shown to a single viewer, so each ranker is equally likely to hold the high-attention left slots. Plays are attributed to the ranker that contributed the title, and the ranker that wins more viewing hours is preferred. The result is a preference signal that needs a fraction of the audience an A/B test would require.

The payoff is dramatic. Netflix reports that interleaving identifies the better ranker reliably while needing over 100× fewer members than its most sensitive A/B metric to reach the same confidence, and that the interleaving preference is strongly predictive of how a ranker later performs in a full A/B test (Netflix Technology Blog, 2017). That is why interleaving is used as the first stage of a two-stage process: a fast pruning round that finishes in days narrows a large field of ideas to the few most promising, and only those go on to a full A/B test that measures the slow, decisive outcomes like retention.

Two-stage experimentation: interleaving prunes ranking ideas in days, then an A/B test measures retention on the survivors. Figure 4. The two-stage experimentation pipeline for recommendations. Stage one uses interleaving as a fast, highly sensitive filter that ranks a large set of candidate algorithms by viewer preference in a matter of days. Stage two takes only the strongest survivors into a traditional A/B test, where larger samples measure the longer-term outcomes — streaming hours and retention — that interleaving cannot directly capture.

Interleaving has one important limit, and naming it keeps you honest: it measures only relative preference between two rankings, not the absolute change in a business metric like retention (Netflix Technology Blog, 2017). It tells you viewers prefer ranker B's recommendations; it cannot tell you that ranker B keeps subscribers longer. That is precisely why the slower A/B second stage still exists. Interleaving makes you fast; the A/B test makes you sure.

When you cannot randomize: quasi-experiments

Some streaming changes cannot be split across individual viewers. If you switch a content delivery network in one region, every viewer there is affected at once — there is no clean random control group next door. For these cases, teams use quasi-experiments: comparisons that approximate a controlled test without true per-viewer randomization, for example by comparing a changed region against a similar unchanged region over the same period, or against its own behavior before and after, with statistical adjustments for the differences. Quasi-experiments are weaker than a randomized A/B test because something other than the change could explain the difference, so they are the fallback when randomization is genuinely impossible — infrastructure swaps, pricing changes, region-wide launches — not the default. The honest move is to randomize at the individual viewer whenever you can, and treat the quasi-experiment as the best available evidence when you cannot.

Method	What it measures	Sensitivity / speed	Best for	Key limitation
Classic A/B test	Absolute change in any metric, incl. retention	Baseline; needs large samples, runs weeks	The decisive test before a launch; UI, pricing, sign-up, player	Slow for small effects; invalid if you peek
Interleaving	Relative preference between two rankings	Very high — >100× fewer viewers (Netflix, 2017)	Pruning many recommendation rankers fast	Rankings only; relative, not retention
Sequential test	Same as A/B, valid under continuous monitoring	Lets you stop early; safe to watch live	Safe rollouts; killing harmful changes fast	Slightly stronger evidence bar to declare a win
Quasi-experiment	Region- or time-based before/after comparison	Variable; weaker causal claim	CDN swaps, pricing, region-wide changes	No true random control; confounding risk

Table 2. The streaming experimenter's toolkit, and when each method is the right one. A classic A/B test is the decisive instrument; interleaving is the fast filter for rankings; sequential testing makes continuous monitoring honest and rollouts safe; quasi-experiments are the fallback when individual randomization is impossible. Mature platforms use all four, choosing by what is being changed and how fast an answer is needed.

Variance reduction: getting more from the viewers you have

There is one more lever worth knowing because it directly attacks the sample-size arithmetic from earlier. Recall that the viewers you need scale with the variance of the metric — the noisier the metric, the more viewers you need. Variance reduction techniques shrink that noise so the same number of viewers yields a sharper answer, or the same answer arrives sooner.

The best-known method is CUPED — "Controlled-experiment Using Pre-Experiment Data." The intuition is that a viewer's behavior before the experiment predicts a lot of their behavior during it: a heavy viewer last month is a heavy viewer this month, regardless of which test group they land in. By subtracting out that predictable, pre-existing part of each viewer's behavior, CUPED removes noise that has nothing to do with the change being tested (Deng et al., 2013). In the original work on Bing's experimentation system, CUPED roughly halved the variance of key metrics — equivalent to reaching the same conclusion with about half the viewers, or in half the time (Deng et al., 2013). For a streaming platform with finite traffic, that is the difference between running twice as many experiments and leaving good ideas untested. Variance reduction does not change what you measure; it makes the measurement more efficient, which on a traffic-constrained product is its own kind of competitive advantage.

The experimentation platform a streaming product needs

Running one experiment is a project; running hundreds is a platform. As experiments multiply — large streaming services run thousands a year, most of them short-lived — the bottleneck stops being statistics and becomes infrastructure (Netflix Technology Blog, 2016). A streaming experimentation platform has to do a handful of jobs well. It must assign viewers to experiments consistently, so a viewer sees the same variant every session and overlapping tests do not contaminate each other. It must track allocation — how many of the right viewers each test has gathered, and therefore how long until it can conclude (Netflix Technology Blog, 2016). It must compute metrics — the OEC, the secondary metrics, and the guardrails — from the same event data that feeds the rest of the product, which is exactly the event-collection and data pipeline that personalization depends on. And it must surface results in a way that protects non-statisticians from the traps, especially peeking, ideally by building the stopping rule into the tool rather than trusting discipline.

One organizational point matters as much as any feature. The platform's job is to make the right analysis the easy one — to default to the trustworthy method, flag a sample-ratio mismatch (when the groups did not split as planned, a classic sign the experiment is broken), enforce guardrails automatically, and present always-valid results so a product manager cannot accidentally peek their way into a false win. A platform that makes good experimentation effortless is what lets a streaming company test ten thousand ideas a year and trust the answers — the real source of a durable advantage in recommendations and discovery.

A common mistake: shipping the novelty effect

Beyond peeking, the pitfall that catches experienced teams is the novelty effect (also called the primacy effect). When you change something visible — a new home-screen layout, a redesigned player control — existing viewers react to it because it is new: they click the unfamiliar button, explore the rearranged rows, and engagement spikes. Run a short test and you will measure that spike and ship the change, only to watch the lift evaporate weeks later once the novelty wears off and behavior returns to normal. The reverse also happens: a genuinely better change can underperform at first because regular viewers are briefly disoriented. The defense is to run visible-change experiments long enough for the novelty to fade, and to look at how the effect trends over time rather than at a single early snapshot — a change whose lift is shrinking day over day is a novelty effect, while a change whose lift holds is real (Kohavi, Tang & Xu, 2020). This is one more reason the "stop early because it looks good" instinct is so dangerous: early is exactly when novelty is loudest.

Where Fora Soft fits in

Experimentation is a scale discipline before it is a statistics one: the value of a one-percent improvement is enormous at volume, but only a platform that runs many trustworthy tests can find those improvements without being fooled by noise or novelty. Across 625+ shipped projects for 400+ clients since 2005 in video streaming, OTT/Internet TV, e-learning, and video surveillance, the pattern we build is the full experimentation stack: consistent viewer assignment across web, mobile, and TV; an OEC anchored on watch time and retention rather than clicks; quality-of-experience guardrails wired in so an engagement win never ships on the back of a worse-quality stream; sequential monitoring so a harmful rollout is caught in hours; and interleaving for fast recommendation-ranking comparisons where it fits. Our approach is scalability-first and vendor-neutral: we start from your traffic volume and the size of effect you need to detect, decide where a hosted feature-flag and experimentation service is enough and where a custom platform earns its cost, and connect experimentation to the same recommendation, metadata, and analytics systems so that what you learn from a test flows straight back into the product.

Call to action

Talk to a streaming engineer — book a 30-minute scoping call to talk through your a/b testing for streaming plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Streaming Experimentation Readiness Checklist — One Page — Choosing the OEC (watch time and retention, not clicks), setting quality-of-experience guardrails, sizing a test for power, avoiding the peeking trap with sequential testing, and picking between A/B, interleaving, sequential, and….

References

Controlled experiments on the web: survey and practical guide. Kohavi, R., Longbotham, R., Sommerfield, D. & Henne, R. M. Data Mining and Knowledge Discovery, 18(1): 140–181, 2009. DOI 10.1007/s10618-008-0114-1. Tier 1 (peer-reviewed primary literature). Source of: the definition and randomization basis of controlled experiments (A/B tests); the Overall Evaluation Criterion (OEC); the HiPPO ("Highest Paid Person's Opinion") problem; the sample-size rule of thumb n ≈ 16σ²/Δ² for 95% confidence and 80% power; the warning that click-based proxies can mislead. https://ai.stanford.edu/~ronnyk/2009controlledExperimentsOnTheWebSurvey.pdf — accessed 2026-06-18.
Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data (CUPED). Deng, A., Xu, Y., Kohavi, R. & Walker, T. Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM '13): 123–132, 2013. DOI 10.1145/2433396.2433413. Tier 1 (peer-reviewed primary literature). Source of: variance reduction using pre-experiment data; the result that CUPED roughly halved variance on Bing's experimentation system — equivalent to the same conclusion with about half the users or half the duration. https://dl.acm.org/doi/10.1145/2433396.2433413 — accessed 2026-06-18.
Peeking at A/B Tests: Why It Matters, and What to Do About It. Johari, R., Koomen, P., Pekelis, L. & Walsh, D. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17): 1517–1525, 2017. DOI 10.1145/3097983.3097992. Tier 1 (peer-reviewed primary literature). Source of: the peeking problem (continuous monitoring of a fixed-horizon test inflates the false-positive rate); always-valid p-values and confidence intervals; the mixture sequential probability ratio test (mSPRT) underpinning sequential A/B testing. https://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-why-it-matters-and-what-to-do-about-it — accessed 2026-06-18.
Large-scale validation and analysis of interleaved search evaluation. Chapelle, O., Joachims, T., Radlinski, F. & Yue, Y. ACM Transactions on Information Systems, 30(1): Article 6, 2012. DOI 10.1145/2094072.2094078. Tier 1 (peer-reviewed primary literature). Source of: interleaving as a sensitive online evaluation method; team-draft interleaving and its handling of position bias; the empirical finding that interleaving needs far less data than absolute A/B metrics to detect a ranking preference. https://www.cs.cornell.edu/~tj/publications/chapelle_etal_12a.pdf — accessed 2026-06-18.
Innovating Faster on Personalization Algorithms at Netflix Using Interleaving. Parks, J., Aurisset, J. & Ramm, M. Netflix Technology Blog, 2017. Tier 3 (first-party engineering, "what ships"). Source of: the two-stage experimentation process (interleaving to prune, A/B to confirm); core OEC metrics being month-to-month retention and member streaming hours; the Coke/Pepsi repeated-measures analogy; the >100× sensitivity gain over the most sensitive A/B metric; team-draft interleaving at Netflix and its position-bias control; interleaving as a relative-preference measure that cannot directly measure retention. https://netflixtechblog.com/interleaving-in-online-experiments-at-netflix-a04ee392ec55 — accessed 2026-06-18.
It's All A/Bout Testing: The Netflix Experimentation Platform. Diaz, S., Parks, J. et al. Netflix Technology Blog, 2016. Tier 3 (first-party engineering). Source of: the scale of streaming experimentation (thousands of short-lived tests a year); the platform jobs of viewer allocation and allocation forecasting; metrics of importance typically being streaming hours and retention; experimentation as the antidote to deciding by opinion. https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15 — accessed 2026-06-18.
Sequential A/B Testing Keeps the World Streaming Netflix (Part 1: Continuous Data; Part 2: Counting Processes). Lindon, M., Sanadhya, S., Moore, D. et al. Netflix Technology Blog, 2023. Tier 3 (first-party engineering). Source of: Netflix's use of sequential testing to monitor rollouts continuously and safely, stopping harmful changes early while controlling the false-positive rate — the streaming-specific application of always-valid inference. https://netflixtechblog.com/sequential-a-b-testing-keeps-the-world-streaming-netflix-part-1-continuous-data-cba6c7ed49df — accessed 2026-06-18.
Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Kohavi, R., Tang, D. & Xu, Y. Cambridge University Press, 2020. ISBN 978-1-108-72426-5. Tier 2 (authoritative practitioner text, peer-reviewed authors). Source of: guardrail metrics and the principle that an OEC win that regresses a guardrail should not ship; sample-ratio mismatch as a validity check; the novelty/primacy effect and the need to run visible-change tests long enough for it to fade. https://experimentguide.com — accessed 2026-06-18.
Experimentation is a major focus of Data Science across Netflix. Netflix Technology Blog, 2021. Tier 3 (first-party engineering). Source of: the breadth of experimentation across the streaming product (recommendations, UI, streaming/QoE, sign-up, payments, messaging); experimentation as the default decision-making tool; OEC anchored on member-centric long-term metrics. https://netflixtechblog.com/experimentation-is-a-major-focus-of-data-science-across-netflix-f67923f8e985 — accessed 2026-06-18.

Where popular explanations oversimplified, the peer-reviewed literature was followed. The peeking/sequential-testing, CUPED, interleaving, and OEC claims are cited to the original papers (refs 1–4) rather than to vendor paraphrases; Netflix's product-specific behaviors and numbers (the two-stage process, the >100× interleaving sensitivity, streaming-hours/retention as the OEC, sequential monitoring of rollouts) are cited to Netflix's own engineering blog (refs 5–7, 9). This topic is not governed by a delivery/encryption/DRM specification, so the ≥3-primary-source bar is met with peer-reviewed primary literature rather than standards documents — consistent with the section's treatment of other product/ML topics in Block 7.

Why this matters

What an A/B test actually is

The metric that matters: optimize watch time, not clicks

How big a test, and for how long: the arithmetic of power

The cardinal sin: peeking, and how sequential testing fixes it

Interleaving: comparing rankings with a fraction of the traffic

When you cannot randomize: quasi-experiments

Variance reduction: getting more from the viewers you have

The experimentation platform a streaming product needs

A common mistake: shipping the novelty effect

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

A/B Testing and Experimentation for Streaming

Why this matters

What an A/B test actually is

The metric that matters: optimize watch time, not clicks

How big a test, and for how long: the arithmetic of power

The cardinal sin: peeking, and how sequential testing fixes it

Interleaving: comparing rankings with a fraction of the traffic

When you cannot randomize: quasi-experiments

Variance reduction: getting more from the viewers you have

The experimentation platform a streaming product needs

A common mistake: shipping the novelty effect

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Retention

Watch time

A/B testing

Rebuffering

Engagement

Startup time

Completion rate

Encoding ladder