Video Recommendation System: How It Works

Why this matters

If you run or are building a streaming service, recommendations are the engine that turns a catalog you paid for into hours actually watched — on a large catalog, Netflix reports that its recommender drives about 80% of streamed hours (Gomez-Uribe & Hunt, 2015). You will not train the underlying models yourself; you will hire or buy that. But you do have to decide what the system optimizes for, what data it needs, how it behaves on day one when it knows nothing about a viewer, and how you will prove a change helped. This article is the product wiring — the parts a founder, product manager, or first-time streaming CTO must understand to make those calls and talk to the engineers and vendors who build the rest. For the mathematics inside the models, we link out to the recommendation-model internals in AI for Video Engineering rather than re-deriving them here.

"Because you watched" is two questions, not one

Start with the row every streaming app has: Because you watched [something], here are ten titles. It looks like one decision. It is actually two, and keeping them separate is the key that unlocks everything else.

The first question is what could this viewer plausibly watch? Your catalog might hold tens of thousands of titles. You cannot carefully evaluate every one of them for every viewer every time the home screen loads — there is not enough time in the few hundred milliseconds you have before the screen must paint. So the first stage casts a wide, cheap net and pulls back a few hundred titles that are probably relevant. This stage is called candidate generation (sometimes "retrieval").

The second question is of those few hundred, what should we show first, and in what order? Now the numbers are small enough to spend real effort on each one, scoring it with a richer model and far more detail about the viewer and the title. This stage is called ranking.

This two-stage split is not a quirk of one company; it is, in the words of the YouTube engineers who published their architecture, "the classic two-stage information retrieval dichotomy" (Covington, Adams & Sargin, Deep Neural Networks for YouTube Recommendations, ACM RecSys 2016). Think of it like hiring. You do not interview every applicant for three hours each. You skim thousands of résumés quickly to get a shortlist of a few dozen (candidate generation), then interview that shortlist in depth (ranking). The skim is fast and approximate; the interview is slow and precise. A recommender works the same way, and for the same reason: you cannot afford the expensive step on the whole pool.

Because-you-watched row splits into two stages: candidate generation narrows millions to hundreds, ranking scores to dozens. Figure 1. The recommendation funnel. Candidate generation casts a wide, cheap net over the whole catalog (millions → hundreds); ranking scores that shortlist in depth and orders the few dozen the viewer actually sees (hundreds → dozens). Splitting the work this way is what lets a recommender stay fast on a large catalog.

Why does the split matter to you as a product owner and not just to an engineer? Because it tells you where your two hardest problems live. Coverage — making sure good but obscure titles ever get a chance to surface — is a candidate-generation problem. Quality — making sure the first three rows are truly the best of the shortlist — is a ranking problem. When someone complains "the recommendations are stale," they usually mean candidate generation is too narrow. When they complain "the recommendations are obvious," they usually mean ranking is playing it safe. Different stage, different fix.

The three ways a system decides what is "relevant"

Both stages need some notion of what makes a title a good match for a viewer. There are three classic strategies, and a real product almost always blends them. You do not need the math to choose between them — you need to know what each one is good and bad at, because that shapes what data you must collect.

The first strategy is collaborative filtering, which means deciding what to recommend from the behavior of many viewers rather than from anything about the title itself. The plain-language version is "people who watched what you watched also watched this." The classic production example is Amazon's, whose engineers described an approach called item-to-item collaborative filtering: instead of finding viewers similar to you (which is expensive and unstable), it precomputes, for each title, the other titles most often watched by the same people, then recommends from the titles you have already watched (Linden, Smith & York, Amazon.com Recommendations: Item-to-Item Collaborative Filtering, IEEE Internet Computing, 2003). Its strength is that it needs to know nothing about the content — it works purely from the pattern of who-watched-what, so it can surface a connection no human tagger would think of. Its weakness is that it is helpless about a title nobody has watched yet.

The second strategy is content-based filtering, which means recommending titles that resemble what a viewer already liked, judged by the attributes of the titles themselves — genre, cast, director, language, tone, theme. The plain-language version is "this is similar to something you watched." The fuel for this approach is metadata: the structured description of every title. A content-based system can recommend a brand-new title the moment it is added, because it can see that the new thriller shares a director and genre with three thrillers you finished. Its weakness is the mirror image of collaborative filtering's strength: it tends to recommend more of the same and rarely surprises, and it is only ever as good as the metadata behind it — which is why a clean metadata pipeline is the unglamorous core of discovery.

The third strategy is a hybrid, which simply means using both, because each one fails exactly where the other is strong. Collaborative filtering is blind to new titles; content-based filtering can see them. Content-based filtering is repetitive; collaborative filtering finds the unexpected. Nearly every serious streaming recommender is a hybrid, and the modern direction is to fold both kinds of signal into a single learned representation — Netflix's 2025 account of its unified recommender describes combining learned behavioral patterns with content metadata in one foundation model so that the same system can handle both well-known and unseen titles (Netflix Technology Blog, Foundation Model for Personalized Recommendation, 2025).

Collaborative, content-based, and hybrid relevance strategies compared, including which can recommend a brand-new title. Figure 2. The three relevance strategies at a glance. Collaborative filtering runs on behavior and finds the unexpected but cannot see a new title; content-based filtering runs on metadata and carries new titles but repeats; the hybrid blends both. The bottom row — "handles a new title?" — is the cold-start question that decides most product designs.

The table below adds the failure mode of each — the thing to watch out for — so you can plan around it.

Approach	Plain-language idea	Main input	Surprises?	New title?	Watch out for
Collaborative filtering	"People who watched what you watched also watched this"	Viewing behavior across all users	Yes — finds links	No — needs history a new title lacks	Cold start; popularity bias
Content-based filtering	"This is similar to what you watched"	Title metadata (genre, cast, theme)	Rarely — repeats	Yes — judges by attributes	Filter bubble; only as good as metadata
Hybrid (most real systems)	Both, blended so each covers the other's blind spot	Behavior + metadata	Yes, controlled	Yes — content carries new titles	More moving parts to wire and tune

Table 1. The three ways a recommender decides what is relevant, with the failure mode of each. The "New title?" column — whether the approach can recommend a title nobody has watched yet — is the one that most shapes a real product, because it is the cold-start problem in disguise.

The mathematics that turns "who watched what" into a recommendation — matrix factorization, embeddings, neural ranking models — belongs to the machine-learning layer, and we deliberately leave it to the recommendation-model internals article in AI for Video Engineering. The product point stands without the math: choose your data collection to feed a hybrid, because a hybrid is what you will end up needing.

The cold-start problem: the day the system knows nothing

Every recommender has an embarrassing first day. It is called the cold-start problem: the system cannot make good recommendations when it has no history to learn from. It shows up in three distinct forms, and they have different fixes, so it pays to name them separately.

The new-viewer cold start is the one you feel most sharply. A subscriber signs up, the home screen loads, and the system has never seen them watch anything. Collaborative filtering has nothing to work with — it runs on behavior, and there is none. The usual fixes are practical, not clever: ask a few taste questions during sign-up (the "pick three shows you like" onboarding step), lean on what is broadly popular right now as a safe default, and use any context you do have — country, language, device, time of day — to avoid showing a child the wrong thing. As the viewer watches even two or three titles, behavior arrives and the system warms up fast.

The new-title cold start is quieter but more expensive, because it directly wastes the content you paid to license. A title nobody has watched has no co-watch pattern, so collaborative filtering will not surface it, and it can sit invisible in the catalog. This is exactly where content-based filtering earns its place: because it judges a title by its metadata, it can recommend the new thriller to thriller-watchers on day one. The YouTube engineers handled the related "fresh content" problem with an explicit signal they called an "example age" feature, so the model would actively favor recently uploaded videos rather than drift toward the well-worn back catalog (Covington et al., 2016). The product lesson is blunt: without a content signal and clean metadata, your newest, most-promoted titles are the ones the recommender is worst at showing.

The new-platform cold start is the one founders underestimate. On launch day you have neither viewer history nor co-watch data for anything. A pure collaborative-filtering system has nothing to stand on. A new service therefore leans on content-based recommendations and editorial curation first, and only shifts weight toward collaborative filtering as watch data accumulates over the first weeks and months. Plan for this explicitly: the recommender you launch with is not the one you run a year later.

Cold-start matrix: new viewer, new title, new platform, each mapped to its fix — onboarding, content metadata, editorial. Figure 3. The three faces of cold start and what actually fixes each. A new viewer is warmed with onboarding, popularity, and context; a new title is carried by content metadata; a new platform leans on editorial and content-based recommendations until behavior accumulates. Collaborative filtering is the weakest in every cold case — which is why every real system keeps a content-based fallback.

The pattern that makes it scale: candidate generation, then ranking

Now put the funnel and the strategies together, because this is the architecture you will actually be specifying. The reason the two-stage split exists at all is a hard limit you can feel in arithmetic, so let us walk it out loud.

Suppose your catalog has 50,000 titles and your home screen must render in about 200 milliseconds to feel instant. Imagine a good ranking model takes 1 millisecond to score one title for one viewer with all the features you care about. Scoring the whole catalog would take 50,000 × 1 ms = 50,000 ms, or 50 seconds. That is not a slow home screen; it is a broken one — 250 times over the budget. Even parallelized across servers, paying the expensive per-title price on the entire catalog for every viewer on every load is economically and physically out of reach. (YouTube's corpus is in the millions and its latency budget is "tens of milliseconds," which is the same problem several orders of magnitude harder; Covington et al., 2016.)

So you split the work. Candidate generation uses a cheap operation — not a careful per-title score, but a fast lookup — to pull, say, the 500 most plausible titles out of the 50,000. The modern way to make that lookup fast is to turn every viewer and every title into a list of numbers (an "embedding") arranged so that similar things sit near each other, then ask a specialized index for the nearest titles to this viewer. That nearest-neighbor lookup runs in well under a millisecond regardless of catalog size, because the index does not check every title one by one. YouTube describes exactly this: at serving time the scoring "reduces to a nearest neighbor search in the dot product space" (Covington et al., 2016), and the same two-tower-plus-nearest-neighbor pattern is now standard practice for large-scale retrieval (Google Cloud, Implement two-tower retrieval for large-scale candidate generation, 2024).

Then ranking does the expensive work on just those 500. At 1 ms each that is 500 ms — still over budget, so in practice the shortlist is a few hundred and the ranker is tuned to fit, but the point is the math now lives in a range you can engineer down to your 200 ms. The expensive model only ever sees a shortlist, never the catalog. That is the whole trick, and it is why "candidate generation then ranking" appears in essentially every production video recommender, from YouTube's 2016 paper to Netflix's 2025 foundation model, which still produces embeddings used "for candidate generation" feeding downstream ranking (Netflix Technology Blog, 2025).

One more practical consequence: because candidate generation can blend several sources, you can mix a personalized collaborative-filtering retriever with a content-based retriever (for new titles) and an editorial list (for things you want promoted) into one shortlist, then let ranking sort it out. The YouTube paper notes the design "enables blending candidates generated by other sources" (Covington et al., 2016). This is how a hybrid is actually wired in production: not one clever model, but several retrievers feeding one ranker.

What the ranker must optimize: watch time, not clicks

Here is the single most important decision in the whole system, and it is a product decision, not a technical one: what does "good" mean to the ranker? Whatever you tell it to maximize, it will maximize — including in ways you did not intend.

The tempting target is clicks, because clicks are easy to count. A discovery layer tuned to maximize clicks learns, reliably, to promote whatever gets tapped: shocking thumbnails, misleading titles, the cheap hook over the satisfying watch. The viewer taps, watches ninety seconds, bounces, and trusts the home screen a little less next time. The click went up; the watching, and the trust, went down.

The engineers who run the largest video recommender on earth say this in plain terms, and it is worth quoting because it settles the argument. In the ranking stage, YouTube's objective "is generally a simple function of expected watch time per impression. Ranking by click-through rate often promotes deceptive videos that the user does not complete ('clickbait') whereas watch time better captures engagement" (Covington et al., 2016). They go further and refuse to even train on clicks alone: positive examples are weighted by how long the video was actually watched, so a title that is clicked and abandoned counts for almost nothing.

This lines up exactly with the thesis of this block's anchor article — that discovery must be measured by watch time and return, not clicks — and with how Netflix describes tuning its recommender toward member retention rather than taps (Gomez-Uribe & Hunt, 2015). The practical instruction for your team is short: define success as engagement that predicts a viewer coming back — watch time, completed sessions, next-week return — and make the ranker optimize that, even though it is harder to measure and slower to move than clicks. Then prove every change against that target with a real experiment, which is the subject of the A/B testing and experimentation article.

A recommender is a product system, not a model

It is easy to picture the recommendation system as "the model." In production it is mostly everything around the model: the events flowing in, the metadata feeding it, and the experiments judging it. Seeing the whole loop is what lets you budget and staff the work realistically.

The loop starts with events: every play, pause, completion, search, and skip is logged as a signal of what the viewer actually did. These implicit signals — what people watch — are far more plentiful and honest than explicit ones like thumbs-up, which is why YouTube trains on watches rather than ratings (Covington et al., 2016). Those events flow into the personalization data pipeline, which cleans them, joins them to title metadata, and stores the result where the models can use it. The models — candidate generation and ranking — produce the rows, which are then arranged into the personalized home screen of rows and artwork the viewer sees. The viewer responds, generating new events, and the loop closes. The recommender gets better not because the model is retrained in a vacuum but because the loop keeps turning.

Two boundaries on that loop matter to a product owner. First, the loop runs on viewing data, which is sensitive; what you may collect and feed back is constrained by privacy rules, and the pipeline must respect that consent boundary — the privacy of viewing data is a design input, not an afterthought. Second, the loop is only as trustworthy as your measurement: a recommendation that is perfect but hidden behind a slow-loading player still loses the viewer, so quality-of-experience metrics like startup time and rebuffering have to be watched alongside recommendation metrics, or each will be blamed for the other's failures.

Recommendation feedback loop: events feed the pipeline and metadata, then models, then personalized rows, then new events. Figure 4. The recommendation system as a loop, not a model. Events (plays, completions) feed the data pipeline and metadata store; those feed candidate generation and ranking; the resulting rows produce watch time and new events. Metadata is the fuel, the pipeline is the plumbing, experiments are the judge — the model is one box among many.

A worked example: warming up a new subscriber

Tie it together with a single viewer's first week, because the abstract pieces click into place when you watch them run.

On day one, a new subscriber signs up. The system has no behavior for them — the new-viewer cold start. Onboarding asks them to pick three titles they like; they choose two thrillers and a documentary. Candidate generation now has a seed: a content-based retriever pulls titles whose metadata resembles those three, a popularity retriever adds what is broadly trending this week, and the two lists merge into a shortlist of a few hundred. Ranking orders them using the little context available — device, country, time of evening — and the home screen paints in under 200 ms. It is not yet personal, but it is not empty.

By day three, they have watched two thrillers to completion and abandoned a comedy after five minutes. Those events flow through the pipeline. Now collaborative filtering has something: other viewers who finished those same two thrillers also finished a particular limited series, which no content-based rule would have connected. That series enters the candidate set. The abandoned comedy, weighted by its near-zero watch time, teaches the ranker to demote similar comedies — exactly the watch-time weighting the YouTube engineers describe.

By day seven, the home screen is meaningfully personal: a "Because you watched" row built from real completions, a collaborative-filtering row surfacing the unexpected series, and a content-based row carrying two newly added thrillers that no one has watched yet but whose metadata fits. The same viewer, the same catalog — but the system has warmed from popularity-plus-onboarding to a genuine hybrid in a week, because the loop kept turning. That arc, from cold to warm, is the thing to design for, and it is why a launch recommender and a mature one are different systems.

A common mistake: one giant model that scores the whole catalog

The most expensive architecture error teams make is to skip the funnel — to build a single model that tries to score every title for every viewer on every load. It is intuitive ("just rank the whole catalog"), and it works in a demo with 200 titles. It collapses the moment the catalog and the audience grow, for the reason the arithmetic above showed: the expensive per-title score cannot run across the whole catalog inside a 200 ms budget, and the bill for trying scales with catalog × viewers × loads. The fix is the two-stage funnel — a cheap retrieval step to get the shortlist, then expensive ranking only on the shortlist.

The second, subtler version of the mistake is optimizing the wrong thing — shipping a recommender tuned for clicks and celebrating when taps rise, while watch time and retention quietly fall. The fix is the discipline this article and its anchor insist on: rank for expected watch time and return, and prove every change with an experiment measured against retention, not taps. And the third version is neglecting the cold cases — launching with collaborative filtering only, then wondering why new titles never surface and new subscribers see a generic wall. The fix is to keep a content-based retriever and clean metadata in the system from day one. Skip the funnel and it will not scale; skip watch time and it will mislead; skip cold start and your newest content stays invisible.

Where Fora Soft fits in

A recommendation system is mostly plumbing — an event stream of plays and completions, a clean metadata pipeline, a fast candidate-generation index, and a ranking step wired to watch time rather than clicks — and most of the engineering cost is in that plumbing, not in the model that sits on top. Fora Soft has built video streaming and OTT/Internet-TV platforms since 2005, across 625+ shipped projects for 400+ clients, which means we have wired the viewing-event pipelines and metadata stores that feed discovery, integrated recommendation and search services into real apps on web, mobile, and TV, and instrumented the watch-time and retention metrics that tell you whether any of it works. Our approach is scalability-first and vendor-neutral: we start from the size of your catalog, the concurrency you must serve, and the engagement you need to retain subscribers, then build the candidate-generation-and-ranking layer — or integrate a recommendation service — that your scale actually requires, leaving the model internals to the specialist layer and owning the product wiring around them.

Call to action

Talk to a streaming engineer — book a 30-minute scoping call to talk through your video recommendation system plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Recommendation-System Build Checklist — One Page — The two-stage candidate-generation/ranking funnel, the three relevance strategies, the cold-start fixes, and the watch-time-not-clicks metric to specify before you build — on a single sheet.

References

Deep Neural Networks for YouTube Recommendations. Covington, P., Adams, J. & Sargin, E. Proceedings of the 10th ACM Conference on Recommender Systems (RecSys '16), 2016. DOI 10.1145/2959100.2959190. Tier 1 (peer-reviewed, first-party engineering). Source of the two-stage candidate-generation/ranking dichotomy, the millions→hundreds→dozens funnel, nearest-neighbor retrieval at serving time, training on implicit watches, the "example age" freshness signal, candidate blending, and the central claim that ranking optimizes expected watch time rather than click-through because CTR "often promotes deceptive videos … ('clickbait')." https://research.google/pubs/deep-neural-networks-for-youtube-recommendations/ — accessed 2026-06-18.
The Netflix Recommender System: Algorithms, Business Value, and Innovation. Gomez-Uribe, C. A. & Hunt, N. ACM Transactions on Management Information Systems, 6(4), Article 13, 2015. DOI 10.1145/2843948. Tier 1 (peer-reviewed, first-party engineering). Source of the "≈80% of hours from recommendations" figure and the framing of the recommender as tuned to retention rather than clicks. https://dl.acm.org/doi/10.1145/2843948 — accessed 2026-06-18.
Amazon.com Recommendations: Item-to-Item Collaborative Filtering. Linden, G., Smith, B. & York, J. IEEE Internet Computing, 7(1), pp. 76–80, 2003. DOI 10.1109/MIC.2003.1167344. Tier 1 (peer-reviewed, first-party engineering; IEEE's 2017 "test of time" selection). Source of item-to-item collaborative filtering — recommending from similar items rather than similar users to scale to a huge catalog — and the "customers who bought this also bought" pattern. https://dl.acm.org/doi/10.1109/MIC.2003.1167344 — accessed 2026-06-18.
Foundation Model for Personalized Recommendation. Netflix Technology Blog, 2025. Tier 3 (first-party engineering blog). Source of the shift from many specialized models to one unified model, embeddings used for candidate generation, and combining learned ID embeddings with content metadata to handle unseen (cold) titles. https://netflixtechblog.com/foundation-model-for-personalized-recommendation-1a0bd8e02d39 — accessed 2026-06-18. Vendor engineering blog — dated 2025; re-verify capabilities at publish.
Cold start (recommender systems). Wikipedia, 2026. Tier 6 (educational, orientation only). Used to frame the three cold-start forms (new user, new item, new system) and why collaborative filtering is most affected while content-based filtering is resilient to the new-item case; the substantive claims are anchored to the primary sources above. https://en.wikipedia.org/wiki/Cold_start_(recommender_systems) — accessed 2026-06-18.
Implement two-tower retrieval for large-scale candidate generation. Google Cloud Architecture Center, 2024. Tier 3 (first-party vendor engineering documentation). Source of the modern candidate-retrieval pattern: separate user and item embedding "towers," precomputed item embeddings, and approximate nearest-neighbor search to retrieve a few hundred candidates from a corpus of 100M+ items. https://docs.cloud.google.com/architecture/implement-two-tower-retrieval-large-scale-candidate-generation — accessed 2026-06-18. Vendor docs — capabilities and product names change; dated.
The history of Amazon's recommendation algorithm. Amazon Science, 2019. Tier 3 (first-party engineering). Context on how item-to-item collaborative filtering scaled e-commerce personalization and how recommendations outperform untargeted merchandising. https://www.amazon.science/the-history-of-amazons-recommendation-algorithm — accessed 2026-06-18.
Artwork Personalization at Netflix. Netflix Technology Blog, 2017. Tier 3 (first-party engineering blog). Used for the point that even the home screen and its artwork are recommendation decisions (contextual bandits), connecting recommendations to the merchandising surface. https://netflixtechblog.com/artwork-personalization-c589f074ad76 — accessed 2026-06-18.

Where sources disagreed, the peer-reviewed and first-party engineering papers were preferred over secondary explainers. The two-stage funnel, the nearest-neighbor retrieval, the implicit-watch training, and above all the "rank for watch time, not clicks ('clickbait')" claim are cited directly from the YouTube RecSys 2016 paper (ref 1), not from the many blog paraphrases of it. Item-to-item collaborative filtering is cited from the original Amazon paper (ref 3). The cold-start framing draws its structure from a tier-6 educational source (ref 5) but anchors each fix in the primary sources. This article covers the product wiring of recommendation systems; the model mathematics is delegated to the AI for Video Engineering section, so no delivery-format, encryption, DRM, ad-signaling, or legal spec citation is required here.

Why this matters

"Because you watched" is two questions, not one

The three ways a system decides what is "relevant"

The cold-start problem: the day the system knows nothing

The pattern that makes it scale: candidate generation, then ranking

What the ranker must optimize: watch time, not clicks

A recommender is a product system, not a model

A worked example: warming up a new subscriber

A common mistake: one giant model that scores the whole catalog

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Video Recommendation System: How It Works

Why this matters

"Because you watched" is two questions, not one

The three ways a system decides what is "relevant"

The cold-start problem: the day the system knows nothing

The pattern that makes it scale: candidate generation, then ranking

What the ranker must optimize: watch time, not clicks

A recommender is a product system, not a model

A worked example: warming up a new subscriber

A common mistake: one giant model that scores the whole catalog

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Metadata

Collaborative filtering

Watch time

Content-based filtering

Retention

Recommendation system

Cold-start problem

Engagement