Capstone — Building A Video Archive Search And Q&A Engine With Multimodal RAG

Why This Matters

This article is for the founder, product lead, or operations head at a media company, streaming service, broadcaster, or any business sitting on years of recorded video who has been asked the obvious question: "Can we just ask it things?" You have webinars, past broadcasts, a back catalogue, recorded calls, lecture libraries, surveillance footage — and right now the only way to find a moment inside them is for a human to remember where it is or to scrub through hours of timeline. This capstone shows you what an archive search-and-Q&A engine actually costs, how long it takes to build, which parts are bought versus built, and where the law and the access rules draw hard lines. It is equally for the engineer who has read the individual multimodal lessons and wants them welded into one deployable system with named technologies, real prices, and an accuracy bar. It assumes you have met the underlying ideas already, because a capstone assembles rather than re-derives; the cross-links point back to each foundational lesson when you need the detail. By the end you will be able to draw the engine on a whiteboard, name the exact 2026 technology in every box, defend the cost per question to a finance team, sequence the build so a first version ships in weeks, and tell a design a security and privacy review will pass from one a legal team will stop at the door.

What You Are Building, Stated Precisely

Fix the product before any technology. You are building an internal or customer-facing service with one job: a person types or speaks a question about the contents of a video library, and the service returns a short written answer together with the specific clips and timecodes that answer came from. It is two features that share a spine. The first is search — type a description of a moment and get back the moments that match, ranked by relevance. The second is question-answering, usually shortened to Q&A — ask a real question and get a written answer assembled from the footage, not just a list of links. Both rest on the same machinery; the Q&A feature simply adds a final step that reads the retrieved clips and writes prose.

Two terms in that description carry weight. OTT stands for "over-the-top" — video delivered to viewers over the open internet rather than through a cable or satellite box, which is what every modern streaming service is. An archive, here, means the accumulated library of finished and raw video a platform holds: past episodes, recorded live events, the unused footage, the whole back catalogue. So the product is a search-and-Q&A engine pointed at the archive of an OTT platform — not a recommendation system for viewers, but a tool that the platform's own people, or its power users, use to find and reason about what is inside the library.

The technique underneath has a name you have met: retrieval-augmented generation, almost always shortened to RAG. The plain version of the idea is the way a librarian answers a hard question — not by reading every book in the building, but by walking to the right shelf, pulling the two or three books that cover the topic, and answering from those pages. Retrieval is the walk to the shelf; generation is the answer written from what was pulled. "Multimodal" means the shelf holds moving pictures and sound, not just text, and the system can search across all of it. The full mechanics are the subject of the video-RAG lesson; this capstone is the buildable system that wraps them into a product.

The scope line that matters most is what the engine is allowed to answer, and to whom. It answers from the archive, with citations, and only over the footage the asker is permitted to see. It does not answer from the model's own memory — that is how a model invents confident, wrong facts, a failure called hallucination — and it does not let a junior user search footage a senior user locked down. Hold that line and the engine is a trustworthy tool that an operations team relies on. Cross it — let the model answer from memory, or let anyone retrieve anything — and you have built a confident liar that also leaks confidential footage. The whole architecture below is shaped to keep the engine on the safe side of that line by construction, not by good intentions.

The Spine: Two Rules That Outlast The Models

Two ideas carry the entire build. Get them right and everything else is detail — and, as with every capstone in this course, both rules exist because the technology underneath is moving faster than any single article can track.

The first rule is index once, retrieve per question — the archive never enters the model. You do the expensive work of understanding and indexing each video exactly once, when it enters the library; from then on, every question reads only the handful of moments that actually relate to it. This is not a preference; it is forced by arithmetic the cost section makes concrete. A modern frontier model can read a few hours of video in one request, but an archive is measured in thousands of hours, and feeding the whole thing into the model on every question is both impossible — it does not fit — and, even at the margin where it fits, ruinously expensive and less accurate, because a model's attention thins out across a giant input. So you split the system in two: a slow, do-it-once ingestion half that turns each video into searchable form, and a fast, cheap query half that runs on every question. Keeping those two halves straight in your head is the single most useful mental model for budgeting the whole engine.

The second rule is no answer without a citation. Every answer the engine produces must point back to the exact source — the video, and the timecode inside it — that it was drawn from. This is the difference between a party trick and a tool a business will trust. An answer a user can verify in one click, by jumping to 14:32 in the March recording and seeing the moment for themselves, is an answer they will rely on; an answer with no source is one they must independently re-check, which defeats the purpose. The citation is not decoration bolted on at the end. It is a constraint that runs backwards through the whole pipeline: every searchable piece must carry its source video and timestamp from the moment it is created, or the final answer has nothing to cite. Build the citation in from the first milestone, because retrofitting provenance into an engine that threw timestamps away is a rebuild.

Hold the two rules together and the engine has a clean shape. The first rule decides where the cost lives — in a one-time ingestion step, walled off from the per-question path that must stay cheap. The second rule decides what makes the engine trustworthy — not the fluency of the answer, which is now easy, but the chain back to the footage, which is what a user actually needs. Everything in the rest of this article fills in the boxes between those two rules, decides what you build versus buy, and prices it.

Figure 1. The engine has two halves and a governance floor. The left half runs once per video and carries the cost; the right half runs on every question and stays cheap because it touches only a few retrieved moments; permissions and privacy are enforced at both index time and query time.

The Production Architecture, Box By Box

A real deployment is more than a call to a search model. Nine kinds of component show up in every archive search-and-Q&A engine we have scoped, and naming them precisely is the first hour of any project. They divide cleanly into the two halves of the first spine rule.

The ingestion half runs once per video. It opens with the chunker, which cuts each video into the small searchable pieces that get retrieved and shown. A chunk is the unit a user gets back, so its size is a real product decision: cut every thirty seconds by the clock and you slice through the middle of a sentence or an action; cut on scene boundaries — a new shot, a new speaker, a new topic — and each chunk is one coherent idea that retrieves cleanly. The rule is to align chunks to meaning, not to the clock.

Next is the frame selector, which exists because video has far too many frames to process them all. A single minute at sixty frames per second holds 3,600 frames, and successive frames are nearly identical. The selector first downsamples — keeping, say, four frames per second instead of sixty — and then keeps only the keyframes that actually change, which in practice takes a minute of footage from 3,600 frames down to a few dozen. That ninety-fold reduction is the difference between an archive you can afford to index and one you cannot.

The extractor turns each chunk into searchable meaning, and it is where you choose between the two honest designs the video-RAG lesson laid out. The cheap, mature path is text grounding: convert everything to text first — the spoken words through automatic speech recognition (ASR, the technology behind captions, which crucially records when each word was said), the on-screen text through optical character recognition (OCR), and the purely visual content of each scene through a vision-language model prompted to describe what it sees — then search that text with an ordinary, cheap text pipeline. The richer, heavier path is direct visual embedding: run the frames themselves through a multimodal model so that purely visual meaning survives even when nobody narrated it. Most strong 2026 systems are hybrid, leaning on text grounding for the bulk and layering visual embeddings in where pure-visual questions matter. The ASR and OCR engineering live in the WhisperX lesson and the PaddleOCR lesson.

The embedding model converts that extracted meaning into an embedding — a list of numbers, a few hundred to a few thousand of them, that captures the meaning of a piece of content as a single point in space, so that two clips about the same thing land near each other and a question can be turned into a point on the same map. The indexer then stores every embedding in a vector database — a database built for one job, finding the nearest stored points to a query point fast across millions of them — together with the metadata that makes the second spine rule possible: the source video, the exact timecode, the chunk's text, and the permission tags that say who may see it. That metadata is not an afterthought; it is what lets the final answer cite "minute 14 of the 3 March recording" and what lets the engine refuse to show a clip to someone who lacks access.

The query half runs on every question and must stay fast and cheap. It opens with query understanding — a light step that cleans the question, expands it, and decides whether it is a search ("show me clips of…") or a real question ("what did the analyst say about…"). Then the hybrid retriever does the central work: it turns the question into a point on the same meaning map and asks the vector database for the closest chunks (semantic search), and at the same time runs an ordinary keyword search (lexical search) so that exact terms, names, and codes are not lost. Blending the two — semantic for meaning, lexical for precision — is what production systems in 2026 do, because each catches what the other misses. The reranker then takes those few candidates and re-scores them with a slower, more careful model, keeping only the best two or three; reranking a short list is cheap because the list is already short.

Finally, the answer model — a vision-language or large language model — reads only those few retrieved chunks together with the original question and writes the answer using only what it was given, ending in the citation the second rule demands. The answer-and-citation surface is what the user actually sees: the written answer, the source clips with their timecodes, and a one-click jump to the moment. Running underneath all of it is the governance and audit layer — the permission checks, the privacy controls over the people in the footage, and the log of every question asked and every clip returned. For an engine that reaches into a company's whole video memory, that layer is not optional infrastructure; it is what lets the engine be deployed at all.

Two of these boxes are deliberately treated as swappable parts rather than fixed choices: the embedding model and the answer model. Both are the fastest-moving, least durable pieces of the system, and both should sit behind a thin internal interface so that replacing one is a configuration change, not a rewrite — the reason the next section spends its time on abstraction rather than on picking a single winner.

Build Versus Buy: The 2026 Verdict, Component By Component

A capable team does not write all of this from scratch, and does not buy all of it either. The rule of thumb mirrors the other capstones in this course: adopt the mature infrastructure, rent or self-host the fast-moving models, and build only the parts that are your actual product — here, the chunking-and-extraction pipeline tuned to your archive, the hybrid retrieval logic, and the citation-and-governance layer. Those are the boxes that make the engine accurate and safe to deploy; everything else is bought.

Component	Build or buy	Concrete 2026 choice	Why
Embedding model	Buy / self-host	Gemini Embedding 2, Cohere Embed 4, Twelve Labs Marengo 3, or open SigLIP 2 / Jina v5	Fastest-moving part; never your moat, always swappable
Answer model (VLM/LLM)	Buy / self-host	Gemini 2.5, Claude, GPT, or open LLaVA / Qwen-VL	Also rented; route by cost and sensitivity
Chunking + extraction pipeline	Build	Scene chunker + ASR + OCR + VLM captioner	Tuned to your archive; this is your accuracy
Frame selection	Build on libraries	Downsample + keyframe detection	Standard CV; small, owned, cheap
Vector database / index	Buy (managed)	Pinecone, Qdrant Cloud, Weaviate, or Milvus self-hosted	A solved infrastructure domain; don't reinvent
Hybrid retrieval + rerank	Build on a search engine	Vector search + keyword search (OpenSearch-class) + a reranker	The relevance bar is your product
Citation + answer surface	Build	Your answer assembly + timecode deep-links	The trust feature; cannot be delegated
Governance / permissions / audit	Build	Your access checks at index and query time	Your legal obligation; cannot be bought
ASR / OCR / captioning models	Buy / self-host	Whisper-class ASR, PaddleOCR, a captioning VLM	Mature components, rented or self-run

Two cells deserve a note because they are where teams go wrong. The embedding-model cell is marked "buy or self-host", and the choice between those two is a real fork: a managed embedding API is faster to ship and always current, while a self-hosted open model such as SigLIP 2 keeps every frame inside your own infrastructure — which matters when the archive itself is confidential, such as unreleased footage. The governance cell is marked "build", and it is the one teams are most tempted to skip — search is exciting, permissions are paperwork. Skipping it is the single most expensive mistake available, because an engine that answers over footage the asker should not see is a data-leak incident waiting for an audit, and one that ignores the privacy of the people inside the footage is a regulatory one.

Figure 2. What to build and what to adopt. Rent the embedding and answer models and the vector database; build the extraction pipeline, the hybrid retrieval, the citation surface, and the governance layer — the parts that make the engine accurate and safe.

Choosing The Models — And Why You Abstract Them Anyway

This engine rents two models, and both should sit behind a thin internal interface, because the right answer for each changes every quarter. The point is not which model wins today; it is that no single model wins for long, and an engine wired directly to one specific vendor inherits that vendor's volatility.

The first rented model is the embedding model — the part that turns both your archive and the incoming question into points on the same meaning map. Mid-2026 gives you a crowded field. Google's Gemini Embedding 2, launched on 10 March 2026, is its first natively multimodal embedding model and maps text, images, video, audio, and PDFs into a single 3,072-dimensional space, which makes it a natural fit for an archive that holds all of those. Cohere Embed 4 is the other strong managed option, and Twelve Labs Marengo 3 is purpose-built for video and reachable both through Twelve Labs directly and through Amazon Bedrock, alongside Amazon's own Nova multimodal embeddings. On the open-weights side, SigLIP 2 and Jina v5 now rival the commercial APIs on retrieval benchmarks and run on your own GPUs, keeping the archive inside your walls. The embeddings primer is the CLIP lesson.

The second rented model is the answer model — the vision-language or large language model that reads the retrieved chunks and writes the grounded answer. Here the choice is the same one the rest of the course has drawn repeatedly: a closed frontier model such as Google's Gemini 2.5, Anthropic's Claude, or OpenAI's GPT for the strongest reasoning, or an open model such as LLaVA or Qwen-VL when the footage must stay on your own hardware. Because the answer model only ever sees a few small retrieved chunks — never the archive — even a mid-tier model produces good answers, so this is a place to route by cost and sensitivity rather than always reaching for the most expensive option. The open-model field is the LLaVA and Qwen-VL lesson.

Now the reason both sit behind an interface, stated as a live case rather than a principle. On 30 March 2026, Twelve Labs sunset its Marengo 2.7 model: teams could no longer index new content, run searches, or retrieve embeddings from content they had already indexed, and were told to migrate. An archive whose every vector had been produced by Marengo 2.7 faced not just a code change but a re-indexing — because embeddings from one model are not comparable with embeddings from another, retiring an embedding model means re-embedding the archive. A team that had wrapped the embedding step behind one internal interface, and had kept the cheap extracted text alongside the vectors, could re-index against a new model as a background job. A team that had hard-wired Marengo 2.7's specific calls throughout its pipeline spent that spring untangling them under deadline. Embeddings are the one swap that carries a real cost, which is exactly why the abstraction and the retained source text matter most there.

Model role (mid-2026)	Strong managed options	Strong open options	Note for this engine
Embedding model	Gemini Embedding 2 (multimodal, 3,072-dim) · Cohere Embed 4 · Twelve Labs Marengo 3 · Amazon Nova	SigLIP 2 · Jina v5 · Qwen3-Embedding	Swapping it forces a re-index — abstract it and keep the source text
Answer model	Gemini 2.5 · Claude · GPT	LLaVA · Qwen-VL	Only ever reads a few chunks — route by cost and sensitivity
ASR (speech → text)	Managed Whisper-class APIs	Whisper, faster-whisper, WhisperX	Must emit word-level timestamps for citations
OCR (on-screen text)	Managed OCR APIs	PaddleOCR	Cheap; adds slide titles, signs, captions to search

Notice what the table makes obvious: the two rented models pull in opposite directions on switching cost. The answer model is nearly free to swap — point the final call at a different API and the engine keeps working — while the embedding model is the expensive swap, because changing it invalidates every vector in the index. That asymmetry is the whole argument for the architecture: abstract both, but spend your real engineering care on the embedding interface and on keeping the cheap extracted text beside the vectors, so that the one costly swap becomes a re-index you schedule rather than a crisis you survive.

Figure 3. The engine rents two models that swap at very different costs. The answer model is nearly free to replace; the embedding model is the expensive swap because changing it invalidates the index — so abstract both and keep the extracted text beside the vectors.

Following One Question From Query To Cited Answer

Numbers and boxes become concrete when you trace a single request through the system. Follow one job: an operations analyst at a streaming service needs to know, across two years of a weekly news show, every time a particular sponsor was mentioned by name, so the ad-sales team can audit delivery. They open the search box and type "every mention of the sponsor Northwind by the host or guests".

The question first hits query understanding. It is recognised as a search that needs both exact-term precision — the brand name "Northwind" must match literally — and semantic breadth, because a guest might have said "the Northwind people" or "our sponsor this week". The step keeps the literal term for the keyword search and expands the meaning for the semantic search.

The hybrid retriever runs both at once. The keyword search finds every chunk whose transcript literally contains "Northwind"; the semantic search finds chunks whose meaning is close even where the exact word is fuzzy. The two result lists are merged. This is the moment the once-per-video ingestion pays off: out of perhaps forty thousand chunks across two years of episodes, the retriever pulls back a few dozen candidates in milliseconds, because the hard work of understanding every episode was done at index time.

The reranker takes those few dozen and re-scores them carefully, dropping the false matches — a chunk that mentioned a different "Northwind" in an unrelated story — and keeps the genuine sponsor mentions, each still carrying its source episode and timecode from ingestion.

The answer model reads only those retrieved chunks, not the archive, and writes the answer: a list of every confirmed mention, each with the episode date and the timecode, and a one-line note on the context. Because the metadata travelled with each chunk from the moment it was created, every line in the answer ends in a citation the analyst can click to jump straight to 11:48 in the 14 February episode and confirm it. Before any of this is shown, the governance layer checks that this analyst is permitted to search this show's archive; a contractor without rights to unreleased segments would silently have those excluded from the results. Every step — the question, the retrieved candidates, the reranked set, the answer, and the clips returned — is written to the audit log, so the engine can later show exactly what was asked and what was surfaced.

Notice the discipline. The cost lived in the one-time index, not in the question; the question touched only a few chunks and stayed cheap; the answer is grounded in the footage and carries a citation to every claim; and the permission check happened before the user saw anything. That shape is what keeps the engine fast for the analyst, cheap for the platform, and defensible to an auditor.

Figure 4. One question, end to end. The cost lived in the one-time index; the question touches only a few retrieved chunks, the answer cites every claim by episode and timecode, and the permission check runs before anything is shown.

The Accuracy Problem You Must Design Around

An engine that answers fluently but wrongly is more dangerous than one that obviously fails, because a confident wrong answer slips past a busy user and into a decision. Three failure modes matter in this engine, and the architecture has to be built around all three rather than trusting the models to avoid them.

The first is the bad chunk — the relevant moment is buried inside a chunk that is mostly about something else, or split across the seam between two chunks so neither one retrieves cleanly. This is a failure of the ingestion half, and it is invisible at query time because the retriever can only return chunks that were cut well in the first place. The defence is scene-aligned chunking rather than fixed-clock cutting, so each chunk is one coherent idea, plus a little overlap between chunks so a moment near a boundary survives in both. When an engine "can't find" something a user knows is there, the chunker is the first suspect, not the search.

The second is the retrieval miss — the right chunk exists and was cut well, but the search did not surface it, because the user's words and the footage's words did not line up. Pure semantic search can drift past an exact term; pure keyword search misses a paraphrase. This is precisely why production systems run hybrid retrieval and then rerank: the keyword half guarantees exact names and codes are caught, the semantic half catches meaning, and the reranker cleans up the merged list. An engine running only one of the two retrieval styles will mysteriously miss a class of questions, and the fix is almost always to add the other half rather than to change the model.

The third is unique to the answer: the ungrounded answer — the model writes something plausible that the retrieved chunks do not actually support, or answers confidently when retrieval found nothing relevant at all. This is hallucination wearing the costume of a cited answer. The defence is twofold. First, instruct the answer model to use only the retrieved material and to abstain — to say "I don't find that in the archive" — when retrieval comes back empty or weak, which is far more trustworthy than a confident guess. Second, enforce the second spine rule mechanically: an answer that cannot attach a citation to a claim is a flag, not a feature, and the surface should show the cited clip beside every claim so a human can confirm it in one click. The citation is not just a convenience for the user; it is the mechanism that keeps the model honest.

The engineering answer to all three is the same posture the whole course teaches: cheap, deterministic checks around expensive, probabilistic models. Chunk on meaning, retrieve two ways and rerank, and force every answer to point at its source — and accuracy stops being a property you hope the models have and becomes one the system enforces.

Figure 5. The accuracy strategy. Retrieve two ways, rerank, and force every claim to carry a citation or abstain; design around three failure modes that live in the chunker, the retriever, and the answer model respectively.

The Governance Problem — The Part That Decides Whether You Can Ship

This is the section that separates a demo from a deployable engine, and it has three layers. None of them is about model quality; all of them are about whether a legal and security team will let the engine touch the archive at all.

The first layer is permissions — answering only over what the asker may see. An archive is almost never flat: some footage is public, some is internal, some is embargoed until a release date, some is restricted to a particular team. A search engine that ignores those boundaries becomes a way to exfiltrate exactly the footage that was locked down, because retrieval is blind to access by default — it returns whatever is nearest in meaning, regardless of who is asking. The durable design carries a permission tag on every chunk at index time and filters by the asker's rights at query time, before retrieval returns anything, so a user can never even see that a restricted clip exists. Bolting permissions on after the fact — filtering the answer rather than the retrieval — leaks the existence of restricted material through "no results" patterns and is the wrong place to enforce it. Access is a retrieval-time filter, not an answer-time afterthought.

The second layer is the privacy of the people inside the footage. An OTT archive is full of identifiable people — presenters, guests, callers, bystanders — and a search engine that can find "every clip of a particular person" is, by definition, processing biometric and personal data at scale. In the European Union this is governed by the General Data Protection Regulation (GDPR), Regulation (EU) 2016/679, which sets rules for processing personal data including the images and voices of identifiable people, and increasingly by the EU AI Act, whose restrictions on biometric identification bear directly on any feature that searches by face or voice. The practical consequences are concrete: searching by content ("clips about the budget") is ordinary; searching by person identity ("every clip of this individual") is a higher-risk feature that needs a lawful basis, access controls, and often a deliberate decision not to build face-identity search at all. The regulatory engineering is the subject of the EU AI Act lesson; the design point here is that the governance layer decides which kinds of query are even allowed, not only which footage. This is engineering context, not legal advice; confirm your obligations with counsel before you ship.

The third layer is audit and retention. Because the engine reaches into a company's whole video memory and answers questions about it, it must record what it was asked and what it returned — both to investigate misuse and to prove, later, that access rules were honoured. The same log is what lets you answer a data-subject request about whether and how someone's footage was searched. Treat the audit log as a first-class part of the product, not as debug output, because for an engine of this reach it is the difference between "we can show exactly who searched what" and a shrug during a review.

Figure 6. The three governance gates. Filter by permission at retrieval time, treat person-identity search as a higher-risk feature under GDPR and the EU AI Act, and log every question and result — the engine's reach is exactly why these are load-bearing.

A Cost Model With The Arithmetic Shown

Pricing this engine correctly means separating two numbers that behave completely differently: the one-time cost to index a video, paid once when it enters the archive, and the per-question cost, paid on every query. Conflating them is the most common budgeting error, because the first is a fixed investment that grows with the archive and the second is a tiny variable cost that grows with usage. Walk both out loud, then compare against the naive alternative.

Start with indexing one hour of video, paid once. The frame selection and chunking are cheap compute. The real money is in the extraction models: transcribing the audio, captioning the keyframes, and embedding the result. A managed video-understanding index such as Twelve Labs prices this directly — Pegasus indexing runs about $0.042 per minute of video — so:

index one hour = 60 minutes × $0.042/minute
              ≈ $2.50 per hour of video, paid once

A self-built text-grounding pipeline lands in the same neighbourhood: ASR at roughly $0.006 per minute is about $0.36 an hour, keyframe captioning and embedding add a dollar or two, and the total sits around two to three dollars per hour. Either way, call it about $2.50 to index one hour, paid once. Storing the resulting vectors is almost free: a few hundred chunks an hour, each a few kilobytes, costs cents per month even for a large archive — Pinecone's serverless storage is around $0.33 per gigabyte per month, and an hour of video produces a few megabytes of vectors.

Now the per-question cost. The question is embedded (a fraction of a cent), the vector search reads a handful of units (a fraction of a cent — Pinecone bills reads at about $8.25 per million read units, and one query uses a tiny number of them), the rerank is small, and the answer model reads only the few retrieved chunks — a few thousand tokens — and writes a short answer. Even at a frontier model's rate the generation is a cent or two:

per question ≈ embed + search + rerank + generate
            ≈ ~$0.01–0.02 per question

Round to one to two cents per question. Now the comparison that decides the whole architecture. The naive alternative is to skip indexing and send video straight to a long-context model on every question. Google's Gemini converts video to tokens at about 263 tokens per second of footage, so one hour is about 946,800 tokens, and at the above-200k rate of $2.50 per million input tokens:

one hour into a long-context model = 946,800 tokens × $2.50 / 1,000,000
                                  ≈ $2.37 per question, per hour of video

That is roughly the one-time index cost of an hour — but paid again on every single question, and only for the one hour that fits. A realistic archive of a thousand hours would need about ninety-five million tokens to send in full, which no 2026 model accepts in one request, so the naive path is not merely expensive at scale; it is impossible. RAG inverts the economics: index the thousand hours once for roughly a thousand times $2.50 — about $2,500, a fixed investment — pay a few dollars a month to store the vectors, and then answer every question for a cent or two. The crossover is immediate: past a single moderate video and a handful of questions, indexing wins, and the gap widens with every hour of archive and every query. Keep the one-time index cost and the per-question cost on separate lines in any budget, because the first is capital spent on the archive and the second is the marginal cost of use, and conflating them hides where the money actually goes. The per-feature cost discipline is the subject of the cost-optimization lesson.

$Per-unit cost model for the engine, shown in two panels. The left panel is the two-line cost structure: a one-time index cost of about $2.50 to index one hour of video, paid once when it enters the archive, made of roughly thirty-six cents of speech recognition plus a dollar or two of keyframe captioning and embedding, plus near-free vector storage of cents per month; and a per-question cost of about one to two cents, made of a fractional-cent question embedding, a fractional-cent vector search, a small rerank, and a cent or two of answer generation that reads only the few retrieved chunks. The right panel is a comparison that decides the architecture: indexing a thousand-hour archive once costs about $2,500 as a fixed investment, then answers each question for one to two cents, whereas sending one hour of video to a long-context model costs about $2.37 per question because an hour is about 946,800 tokens at $2.50 per million, paid again on every question, and a full thousand-hour archive at about ninety-five million tokens does not fit in any 2026 model at all. A caption notes that past a single moderate video and a handful of questions, indexing wins, and that the one-time index cost and the per-question cost belong on separate budget lines.$

Figure 7. The two-line economics. Indexing is a one-time cost per hour of archive; answering is a cent or two per question; sending the archive to a long-context model on every question is far more expensive where it fits and impossible where it does not.

Common Mistake: Dumping The Archive And Trusting The Answer

Two failures account for almost every archive Q&A project that stalls, and they are the inverses of the two spine rules.

The first is dumping the archive into the model. A team sees that a frontier model accepts hours of video and concludes they can skip the indexing pipeline entirely — just send the footage and ask. It demos beautifully on one short clip, and then the archive grows, the per-question bill climbs to dollars, the latency stretches to minutes, and the day someone points it at the real thousand-hour library it simply does not fit. The fix is the first spine rule, and it costs almost nothing if you do it first: index once, retrieve per question. Long-context models are a genuine tool — for a single moderate video and a few questions they are the simpler answer, and the honest engineer reaches for them there — but they are a complement to RAG at archive scale, not a replacement for it. Build the index before you build anything that scales.

The second is trusting the answer. A team treats the fluent written answer as the product and the citation as a nicety to add later. The engine ships, the answers read well, and then one of them is confidently wrong, a user acts on it, and the trust never comes back — because no one could check the answer against the footage. Retrofitting citations into an engine that discarded timestamps at ingestion is a rebuild, not a patch, since the provenance the answer needs was thrown away upstream. The citation is not a finishing touch; it is the load-bearing wall, and it must be poured at the ingestion stage where the timestamps are still attached. The model writes the answer; the citation is what makes it a tool.

Both mistakes share a root: mistaking the fluent part for the trustworthy part. Generation is the fluent part and it is now easy. The valuable parts are the unglamorous ones — the index that makes the archive affordable to question, and the citation that lets a human verify what the model said. Build those first.

The Build Plan: Five Milestones, Value At Every Step

Sequence the build so a usable tool exists early and each milestone ships something a team can actually use, rather than a year of plumbing before the first answer.

Milestone one — the cited search box. Index one slice of the archive with a text-grounding pipeline (ASR plus keyframe captions), store the chunks with their source video and timecode in a vector database, and put a search box in front that returns ranked clips, each deep-linking to its moment. No Q&A yet — just search that lands you on the exact second. Even this is immediately useful, and crucially it builds the second spine rule, the citation, into the foundation rather than bolting it on later.

Milestone two — grounded Q&A with abstention. Add the answer model that reads the retrieved chunks and writes a grounded answer, instructed to use only retrieved material and to abstain when retrieval is weak, with every claim carrying its citation. This is the milestone that turns search into question-answering, and doing it with abstention from the start is what keeps it trustworthy.

Milestone three — hybrid retrieval and reranking. Add the keyword half alongside the semantic search and a reranker over the merged list. This is the milestone that fixes the retrieval-miss failure mode and lifts accuracy from "demo-good" to "production-good", especially on exact names, codes, and rare terms.

Milestone four — governance: permissions, privacy, audit. Add the permission tags at index time and the access filter at query time, the policy decision on whether person-identity search is offered at all, and the audit log. This is the milestone that turns a clever tool into a deployable service, and doing it fourth — not last — is the whole point of the governance section.

Milestone five — scale, re-indexing, and evaluation. Index the full archive, add the background re-indexing job that lets you swap the embedding model without a crisis, build the retrieval-evaluation harness that measures whether the right chunks are being found, and add cost controls. This is the milestone that makes the engine cheap to run at archive scale and safe to evolve as the models change.

Order it so a usable product exists after milestone one and the safeguards arrive before the full archive is exposed, not after. A team that ships search and Q&A and then leaves permissions for "later" has built exactly the data-leak the governance section warns against.

Production Concerns: Re-Indexing, Permissions At Query Time, And Evaluating Retrieval

Three operational realities decide whether the engine survives contact with a real archive.

The first is re-indexing without downtime. Embedding models retire — Marengo 2.7 went dark in March 2026 — and chunking strategies improve, so you will re-index, and the archive cannot go dark while you do. The durable pattern keeps the cheap extracted text (transcripts, captions, OCR) stored permanently beside the vectors, so a re-index re-embeds from text you already have rather than re-running the expensive extraction; and it builds the new index alongside the old one and switches over atomically. Keeping the source text is the cheap insurance that turns the one truly costly swap into a background job.

The second is enforcing permissions at query time, not ingest time alone. Access rules change after a video is indexed — an embargo lifts, a contractor's access is revoked, a clip is reclassified — so the permission decision cannot be frozen into the index. The index carries the tags; the filter is applied live, against the asker's current rights, on every query. An engine that resolved permissions only at index time will happily serve footage to someone whose access was revoked yesterday.

The third is evaluating retrieval, not just generation. The tempting thing to measure is whether the written answers read well, but the answer can only be as good as the chunks the retriever found, and a fluent answer over the wrong chunks is the most dangerous output of all. Build a small evaluation set of real questions with their known correct moments, and measure whether the retriever actually surfaces those moments — the discipline the course covers for video AI features generally, applied here to the retrieval step specifically. When accuracy drops, this harness tells you whether the chunker, the retriever, or the answer model is at fault, instead of leaving you guessing.

Where Fora Soft Fits In

Fora Soft has built video software since 2005, and OTT, streaming, and the AI software around them are among the verticals we ship, alongside video conferencing, e-learning, telemedicine, and surveillance. The engine described here — a scene-aware extraction pipeline feeding a vector index, a hybrid retriever with reranking, a grounded answer model that cites every claim, and a governance layer that filters by permission and respects the privacy of the people in the footage — is the shape of the multimodal-search work we scope for media archives. The build order and the build-versus-buy verdicts in this article are not theory for us; they are the checklist we apply, because they are the difference between an engine that lands a user on the exact second they asked for and one that confidently invents an answer or leaks a restricted clip. The governance map is part of that checklist too: we put permissions and the citation in the foundation, treat person-identity search as a deliberate policy decision rather than a default feature, and keep the extracted text beside the vectors so a model retirement is a scheduled re-index rather than an emergency. Our work here lives in OTT, streaming, and the AI software around them, where an accurate retriever and an honest, cited answer are the core of the product rather than a decoration.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your video archive search engine plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Video Archive Search & Q&A Engine Build Blueprint — One-page reference for assembling a video archive search-and-Q&A engine with multimodal RAG: the index-once / cite-always two-rule spine, the two-half reference architecture, the build-vs-buy verdict per component, the 2026….

References

ISO/IEC 14496-12:2022 — Information technology — Coding of audio-visual objects — Part 12: ISO Base Media File Format (ISO BMFF). The container standard the archive's mezzanine and source files are stored in, and the basis for the byte ranges a timecode citation resolves to. Cited for the archive-storage and clip-extraction layer. https://www.iso.org/standard/83102.html
ISO/IEC 23000-19:2024 — Information technology — Multimedia application format (MPEG-A) — Part 19: Common Media Application Format (CMAF). The container a retrieved clip is packaged into when the engine returns a playable deep-linked moment over OTT. Cited for the clip-delivery handoff. https://www.iso.org/standard/85623.html
W3C — WebVTT: The Web Video Text Tracks Format (W3C Candidate Recommendation). The standard timed-text format that carries the word-level timestamps an ASR step emits and that the engine's citations resolve against. Cited for the timestamp/citation layer. https://www.w3.org/TR/webvtt1/
Regulation (EU) 2016/679 (General Data Protection Regulation, GDPR) — processing of personal data, including the images and voices of identifiable people in an archive. The legal basis for the permission and privacy gates; person-identity search processes biometric/personal data and needs a lawful basis. https://eur-lex.europa.eu/eli/reg/2016/679/oj
Regulation (EU) 2024/1689 (EU AI Act) — restrictions on biometric identification and categorisation. Bears directly on any feature that searches an archive by a person's face or voice; informs the decision whether to offer person-identity search at all. https://artificialintelligenceact.eu/
Twelve Labs — Marengo 2.7 sunset notice and Marengo & Pegasus model overview / Pegasus 1.2 pricing (developer docs, 2026). Marengo 2.7 retired 30 March 2026 — indexing, search, and embedding retrieval all withdrawn; Pegasus indexing priced at ~$0.042/min. The concrete case for abstracting the embedding model and keeping the extracted text, and the index-cost anchor. https://docs.twelvelabs.io/docs/get-started/release-notes
Google — Gemini Embedding 2: our first natively multimodal embedding model (Google blog, 10 March 2026). Maps text, images, video, audio, and PDFs into a single 3,072-dimensional space; the managed multimodal embedding option for a mixed-media archive. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/
Google — Gemini Developer API pricing and long-context / video tokenisation (Google AI for Developers, 2026). Gemini 2.5 Pro at $1.25/M input up to 200k tokens, $2.50/M above; video tokenised at ~263 tokens/second of footage. The basis for the long-context-baseline arithmetic the RAG cost model is compared against. https://ai.google.dev/gemini-api/docs/pricing
NVIDIA — Advance Video Analytics AI Agents Using the NVIDIA AI Blueprint for Video Search and Summarization (VSS) and An Easy Introduction to Multimodal RAG for Video and Audio (NVIDIA Technical Blog, 2026). Reference architecture for ingesting archival video for search, summarisation, and interactive Q&A; reports RAG adds ~10% latency to chat Q&A and ~1% to summarisation. Tier-4 deployer reference for the pipeline shape. https://developer.nvidia.com/blog/advance-video-analytics-ai-agents-using-the-nvidia-ai-blueprint-for-video-search-and-summarization/
Pinecone — Serverless pricing (2026). Storage ~$0.33/GB/month; reads ~$8.25 per 1M read units; writes ~$2.00 per 1M write units; ~$70/month at 10M vectors. The vector-storage and per-query-read cost anchors. https://www.pinecone.io/pricing/
Netflix Technology Blog — "Synchronizing the Senses: Powering Multimodal Intelligence for Video Search" (April 2026). A production deployer's account of multimodal embeddings and hybrid retrieval for video search at scale; corroborates the hybrid (semantic + lexical) retrieval design. https://netflixtechblog.com/powering-multimodal-intelligence-for-video-search-3e0020cf1202
Amazon Web Services — Power video semantic search with Amazon Nova Multimodal Embeddings and Video semantic search with AI on AWS (AWS blogs, 2026). A managed multimodal-embedding path and a worked hybrid-search architecture that splits vector storage from keyword (OpenSearch) storage. Tier-4 deployer reference for the retrieval layer. https://aws.amazon.com/blogs/machine-learning/power-video-semantic-search-with-amazon-nova-multimodal-embeddings/