Published 2026-06-01 · 21 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
You have a growing pile of recorded video — past webinars, surveillance footage, recorded consultations, course lectures, support calls — and someone has asked the obvious question: "Can we just ask it things?" Find the moment a defect first appeared. Pull every clip where a specific procedure was demonstrated. Summarise what a customer complained about across forty support calls. The technology that answers those questions is video RAG, and it is the bridge between the vision-language models you have already read about and a product feature people will actually pay for. This lesson is written for the product manager, founder, or operations lead who has to scope such a feature and decide what it costs, where it will be flaky, and whether a simpler approach would do. It builds on the open-frontier VLM lesson and the fine-tuning lesson; if "vision-language model", "weights", and "embedding" are brand-new words, start with the video-VLM lesson first.
What RAG Is, In One Plain Picture
Start with the everyday version of the idea, because the AI version is the same trick. When a librarian answers a hard question, they do not read every book in the building. They walk to the right shelf, pull the two or three books that cover the topic, open them to the right pages, and answer from those pages. Retrieval is the walk to the shelf; generation is the answer written from what was pulled.
Retrieval-augmented generation, almost always shortened to RAG, is exactly that pattern applied to an AI model. Instead of asking the model to answer from memory — which is how it invents wrong facts, a failure called hallucination — you first retrieve the relevant source material from your own collection, then hand it to the model alongside the question, and ask it to answer using only what you gave it. The model stops guessing and starts reading. This is why RAG is the standard way to make a model answer reliably about your data: your documents, your recordings, your archive.
Text RAG — retrieval over documents — has been a solved, ordinary thing since 2023. Video RAG is the harder, newer cousin: the source material is hours of moving pictures and sound, not pages of text. The 2025 research that named the field puts it plainly: most RAG systems focus on text, some handle images, and video — the richest source of all — was largely left out until recently. Closing that gap is what this lesson is about.
[Image: Diagram contrasting a model answering from memory versus retrieval-augmented generation. Top row labelled 'Without RAG': a question goes straight into a vision-language model box, which outputs an answer marked with a warning that reads 'may hallucinate — answers from memory, not your archive'. Bottom row labelled 'With video RAG': the same question first goes to a retrieval step that searches a video archive and pulls three relevant clips, those clips plus the question feed the vision-language model, and the model outputs an answer marked 'grounded in your archive, with timestamps and source clips'. A caption underneath reads: retrieval is the walk to the right shelf; generation is the answer written from what was pulled.] Figure 1. Without retrieval, the model answers from memory and can invent facts. Video RAG fetches the relevant clips first, so the answer is grounded in your archive — and can cite the exact moments it used.
Why Not Just Send The Whole Archive To A Long-Context Model?
This is the first question any sharp product owner asks, and it deserves a real answer rather than a reflexive "you need RAG". Modern frontier models accept enormous inputs. Google's Gemini family accepts up to two million tokens — a token is the small chunk of text or media the model reads as one unit — which is enough to hold roughly nineteen hours of audio or several hours of video in a single request. Google reported that one of these models could find a single secret word hidden in a random frame of a ten-and-a-half-hour video. So if the model can read hours of video at once, why chop anything up?
Three reasons, and you only need one of them to be true for RAG to win.
The first is size. A two-million-token window sounds vast until you measure your archive against it. Gemini converts video to tokens at a fixed rate of about 263 tokens for every second of footage at default quality. Walk the arithmetic out loud:
tokens per hour = 263 tokens/second × 3,600 seconds/hour
= 946,800 tokens/hour
So a single hour of video already eats almost a million tokens — roughly half the entire two-million-token window. A modest hundred-hour archive would need about ninety-five million tokens to send in full. There is no model on the planet in 2026 that accepts that in one request. The archive does not fit, full stop, and chopping it into searchable pieces is the only way in.
The second reason is cost, and it bites even when the content does fit. At Gemini 2.5 Pro's published rate of $1.25 per million input tokens for the first 200,000 tokens — rising to $2.50 per million above that — sending one hour of video costs:
cost of one hour = ~946,800 tokens × $2.50 / 1,000,000 tokens
≈ $2.37 per hour, per question
If you send the whole hour every time someone asks a question, you pay that $2.37 again on every single query. Ask a hundred questions about that one hour and you have spent $237 re-reading the same footage a hundred times. RAG flips this: you pay a one-time cost to process and index the archive, then each question reads only a few small retrieved chunks — typically a few thousand tokens, fractions of a cent.
The third reason is accuracy. Even when a long context fits and you are willing to pay, models do not attend evenly across a giant input. Practitioners consistently report attention degradation past roughly 400,000 tokens, and accuracy drops further when the answer requires stitching together facts scattered far apart in the context. Retrieving the three relevant minutes gives the model a clean, short, focused input — and a focused model is a more accurate model.
The honest exception, which we return to at the end: if your "archive" is one moderate video and a user asks a handful of questions about it, skip RAG and send the whole thing to a long-context model. RAG earns its complexity at scale, not on a single clip.
The Core Problem — Video Is Not Searchable
Here is the obstacle that makes video RAG its own discipline. Text is already made of the same stuff a search runs on: words. Video is made of pixels and sound waves, which carry no searchable meaning until something extracts it. You cannot type "the moment the forklift entered the loading bay" and have raw pixels match it. Something has to first turn the video into a representation a computer can compare against a question.
That representation is an embedding — and the word is worth defining carefully because the whole pipeline rests on it. An embedding is a list of numbers, usually a few hundred to a couple of thousand of them, that captures the meaning of a piece of content as a single point in space. Think of it as a set of map coordinates for meaning: two clips about forklifts land near each other on the map; a clip about a sunset lands far away. The trick that makes search possible is that a question can be turned into coordinates on the same map. "Find the forklift moment" becomes a point, and the system simply returns the clips whose points sit nearest to it.
A multimodal embedding model is one trained so that pictures, sound, and words all land on the same map. This is the magic ingredient: it lets a text question find a silent video clip, because both were placed in the same coordinate space. The 2021 model called CLIP — Contrastive Language–Image Pre-training — was the first widely used version of this idea, and its descendants (SigLIP, X-CLIP, InternVideo, and purpose-built video models like Twelve Labs' Marengo) are what power video search in 2026. We covered the family in the CLIP and VLM lessons; here we only need the one-line version: an embedding model is the thing that turns un-searchable video into searchable coordinates.
The Two Honest Designs — Visual Embeddings vs Text Grounding
There are two respectable ways to make video searchable, and the choice between them shapes everything downstream — cost, accuracy, and what kinds of questions your product can answer. NVIDIA's engineering write-up on the subject names three theoretical approaches; in practice two of them matter, and they sit at opposite ends of a trade-off.
The first design is direct visual embedding. You run the video itself — frames, and sometimes motion across frames — through a multimodal embedding model, and store those visual coordinates directly. The strength is that purely visual meaning survives: a clip of someone slipping on a wet floor is findable even though nobody said the word "slip". The cost is that visual embedding models are heavier to run and the embeddings are larger to store. This design wins for surveillance, sports, manufacturing, medical procedures — anywhere the important information is in what you see, not what is said. It also pairs naturally with the detection and anomaly-detection work covered earlier in the course, where the events you want to find are visual by nature.
The second design is text grounding, and it is the workhorse of most production systems. You convert everything in the video into text first — the spoken words via speech recognition, the on-screen text via optical character recognition, and a written description of each scene via a vision-language model — and then you embed and search that text with an ordinary, cheap, mature text pipeline. The strength is simplicity and low cost: text RAG is a solved problem, and you inherit all of it. The weakness is that the conversion is lossy — a written caption of a scene throws away detail the pixels held, so a question about a subtle visual the captioner did not mention will miss.
Most strong 2026 systems are hybrid: text grounding for the bulk of retrieval because it is cheap and good, plus visual embeddings layered in where pure-visual questions matter. The decision is not religious. It is a budget-and-question-type call, and the diagram below is the one to put in front of your engineer.
[Image: Decision diagram for choosing the retrieval design in a video RAG system. Top diamond reads 'Is the important information mostly in what is SAID or ON-SCREEN TEXT?'. The Yes branch points to a green box 'Text grounding: ASR + OCR + scene captions, then ordinary text RAG. Cheapest, most mature.' The No branch goes to a second diamond 'Is the important information mostly VISUAL — actions, objects, scenes nobody narrates?'. Its Yes branch points to a purple box 'Direct visual embeddings: embed frames/clips with a multimodal model. Heavier, but visual meaning survives.' A third path from both leads to a blue box at the bottom labelled 'Hybrid (most production systems): text grounding for the bulk, visual embeddings layered in for pure-visual queries.' A side note reads: this is a budget and question-type call, not a religious one.] Figure 2. The two honest designs and the question that picks between them. Text grounding is cheap and mature; visual embedding preserves what nobody narrated; most real systems blend the two.
The Pipeline, Stage By Stage
A video RAG system has two halves that run at different times. The ingestion half runs once per video, when it enters the archive — it is the slow, expensive, do-it-once work. The query half runs every time a user asks a question — it must be fast and cheap. Keeping these two halves straight in your head is the single most useful mental model for budgeting the feature, because the costly work happens once and the per-question work is tiny. We walk the stages in order.
Stage 1 — Chunking: cut the video into searchable pieces
You cannot search "a video"; you search pieces of it. Cutting the video into pieces is called chunking, and it is the stage that quietly determines whether your whole system works. A chunk is the unit that gets retrieved and shown to the user, so its size is a real product decision: too large, and a retrieved chunk buries the relevant two seconds inside two irrelevant minutes, wasting tokens and confusing the model; too small, and a chunk loses the context that made it meaningful, so retrieval fragments.
The naive approach is fixed-length chunking: cut every thirty seconds, regardless of content. It is trivial to build and it is what most first prototypes use. Its flaw, documented in 2025 research, is that it slices straight through the middle of meaningful moments — a sentence gets split across two chunks, an action is severed from its result. The better approach is scene-based chunking: cut where the video naturally changes — a new shot, a new speaker, a new topic. Classical computer vision can find shot boundaries by watching for sudden changes in the picture; newer systems use a language model reading the transcript to cut on narrative boundaries, so each chunk is one coherent idea. Scene-aware cutting costs more to compute but produces chunks that retrieve far more cleanly. The rule of thumb: align chunks to meaning, not to the clock.
Stage 2 — Extraction: pull the meaning out of each chunk
Now each chunk gets turned into something searchable, using whichever design you chose. In a text-grounding system this stage does real work, and it is worth seeing the parts because each is a model you covered earlier in this course.
Speech becomes text through automatic speech recognition, the technology behind every captioning feature — production systems lean on Whisper and its faster variants, covered in the WhisperX lesson. Crucially, ASR gives you timestamps: it records not just what was said but when, which is how the final answer can jump the user to the exact second. On-screen text — slide titles, signs, captions baked into the picture — becomes searchable through optical character recognition. And the purely visual content of each scene becomes a written description through a vision-language model prompted to describe what it sees. These three text streams — said, written-on-screen, and seen — are blended into one coherent description per chunk.
In a visual-embedding system this stage is simpler to describe but heavier to run: each chunk's frames go straight through the multimodal embedding model, producing visual coordinates without an intermediate text step.
Stage 3 — Frame selection: do not process every frame
Whichever design you chose, you hit the same brutal number: video has far too many frames to process them all. A single minute of footage at sixty frames per second contains 3,600 frames. Multiply across an archive and the cost of touching every frame is absurd. So every serious pipeline thins the frames first, and the savings are dramatic enough to walk through with NVIDIA's published example.
Start by downsampling — keeping only a fraction of the frames. Dropping from 60 frames per second to 4 frames per second already cuts that one minute from 3,600 frames to 240, because successive frames are nearly identical and carry almost no new information:
frames after downsampling = 240 frames (down from 3,600)
Then select keyframes — the handful of frames that actually represent something new, found by measuring how different each frame is from the last and keeping only the ones that change meaningfully. In NVIDIA's worked example that step takes the 240 frames down to about 40 — a final reduction of roughly ninety-fold from the original 3,600. Forty frames a minute is cheap to caption or embed; 3,600 is not. This frame-selection idea is exactly the one the VideoRAG research highlights as essential: long videos hold more frames than a model can read, and not all frames are equally important.
Stage 4 — Indexing: store the coordinates for fast search
Every chunk now has an embedding — a point on the meaning map. You store these in a vector database, a database built for one job: given a query point, find the nearest stored points fast, even across millions of them. Pinecone, Weaviate, Milvus, and Qdrant are the names you will hear; the open-frontier lesson and your engineer will have a favourite. Alongside each embedding you store metadata — the source video, the timestamp, the chunk's text — because that metadata is what lets the final answer say "see minute 14 of the March 3rd recording" instead of a vague paraphrase. The metadata is not an afterthought; it is what makes the feature trustworthy.
This is the end of the once-per-video ingestion half. Everything so far happened when the video entered the archive. From here on, we are in the fast per-question half.
[Image: Horizontal pipeline diagram of a video RAG system, split into two labelled halves. The left half, labelled 'Ingestion — runs once per video', shows five boxes connected by arrows: 'Raw video' then 'Chunk into scenes' then 'Frame selection: 60fps to 4fps to keyframes (~90x fewer)' then 'Extract meaning: ASR + OCR + scene captions, or visual embeddings' then 'Index in vector DB with timestamps and metadata'. The right half, labelled 'Query — runs every question, fast and cheap', shows: 'User question' then 'Embed the question to the same meaning map' then 'Vector search: fetch top-k nearest chunks' then 'Rerank to the best few' then 'Vision-language model writes the answer from retrieved chunks' then 'Answer with cited timestamps and source clips'. A divider between the halves notes 'costly work happens once; per-question work is tiny'.] Figure 3. The full pipeline. The left half runs once per video and carries the cost; the right half runs on every question and stays cheap because it touches only a few retrieved chunks.
Stage 5 — Retrieval: find the few chunks that matter
A user asks a question. The system turns that question into a point on the same meaning map using the same embedding model, then asks the vector database for the top-k nearest chunks — the k closest points, where k is a small number you choose, often between three and ten. This is the moment all the ingestion work pays off: out of an archive of thousands of chunks, the system pulls back only the handful that actually relate to the question, in milliseconds.
A second, cheap step usually follows: reranking. The fast vector search is good but not perfect, so a more careful (and slightly slower) model re-scores those few candidates and keeps only the best two or three. Reranking on a short list is cheap because the list is already short — you are polishing a handful of candidates, not searching the archive again.
Stage 6 — Generation: write the grounded answer
Finally the retrieved chunks — the relevant transcript passages, scene descriptions, or actual frames — are handed to a vision-language model together with the original question, and the model writes the answer using only what it was given. Because the metadata travelled with each chunk, the answer can cite its sources: "the procedure is demonstrated at 12:40 in the onboarding recording." That citation is the difference between a party trick and a feature an operations team will trust, because it lets a human verify the answer in one click. An answer the user can check is an answer the user will rely on.
A Common Mistake — Chunking By The Clock Instead Of By Meaning
The single most frequent failure in first-version video RAG systems is fixed-length chunking, and it is worth a callout because it looks fine in a demo and falls apart in production. A team cuts every thirty seconds because it is the easy thing to build, the demo on a tidy two-minute clip works, and the feature ships. Then a real user asks about something that happened across a chunk boundary — the question was asked at second 28 and answered at second 34 — and the system retrieves the chunk with the question but not the answer, or vice versa, and returns something confidently wrong.
The fix is not exotic: cut on scene and speaker boundaries so each chunk is one coherent unit of meaning, and let chunks overlap slightly so a moment that straddles a boundary survives in at least one chunk. The 2025 SceneRAG research showed that segmenting on narrative-consistent scenes, rather than the clock, measurably improves retrieval. The lesson for a product owner: when an engineer says "we'll just chunk every N seconds", ask what happens to a moment that spans two chunks — and budget for scene-aware chunking before launch, not after the first wave of bad answers.
Build It Yourself, Or Buy The Embeddings?
As with most of this course, there is a build path and a buy path, and the honest answer is that most teams should start by buying the hard part. The hard, expensive, easy-to-get-wrong component is the video embedding model — the thing that turns video into good searchable coordinates. You can self-host an open model (CLIP, SigLIP, InternVideo and relatives) and own the whole stack, which is the right call when data cannot leave your premises — surveillance and medical footage often cannot — or when archive scale makes per-call API pricing painful.
The buy path is led by Twelve Labs, whose Marengo model produces the video embeddings and whose Pegasus model answers questions about clips and can reason over the temporal arc of an asset up to two hours long. As of 2026 these are offered both through Twelve Labs' own APIs and through AWS Bedrock, and they integrate with the same vector databases you would use anyway. Note the pace of change: Twelve Labs sunset its Marengo 2.7 model on March 30, 2026, with Marengo 3.0 taking over — a reminder that the managed-API layer of this stack moves fast and that any version number in a contract needs a "what happens at end-of-life" clause. The frontier general models (Gemini, GPT, Claude) play the generation role well and can serve as the whole system for small archives via long context, but they are not, in 2026, the cheapest way to index a large archive for repeated search.
The decision mirrors the one in the build-vs-buy cost lesson: buy to get to a working feature fast and learn what your users actually ask; revisit self-hosting once volume, data-residency rules, or per-call costs justify owning the pipeline.
Where The Money Actually Goes
Put real structure on the cost, because it is the question every budget owner asks and the answer is reassuring once you see the shape. The expensive work is ingestion, and you pay it once per video: running ASR, captioning selected frames with a VLM, and computing embeddings. Frame selection is what keeps this bill sane — recall the roughly ninety-fold frame reduction from Stage 3 — and it is why a competent pipeline can index an archive for a sensible one-time fee rather than a fortune.
The cheap work is per query, and it stays cheap by design: embedding one short question costs almost nothing, the vector search is a database lookup measured in milliseconds, and the generation step reads only the few retrieved chunks — a few thousand tokens — rather than the whole archive. Compare that with the naive long-context number from earlier: $2.37 to re-read a single hour on every question. A retrieval query that reads three two-thousand-token chunks reads six thousand tokens, costing a small fraction of one cent. That gap — pay once to index, pay almost nothing per question — is the entire economic argument for video RAG, and it widens with every additional question and every additional hour in the archive.
| Approach | One-time cost | Cost per question | Best when |
|---|---|---|---|
| Long-context model, whole video each time | None | High — re-reads everything (~$2.37/hour of video, every query) | One moderate video, few questions |
| Video RAG (text grounding) | Moderate — ASR + captioning + embedding, once | Very low — reads a few small chunks | Large archive, many questions, mostly spoken/on-screen content |
| Video RAG (visual embeddings) | Higher — heavier embedding model, once | Very low — reads a few small chunks | Large archive where the answer is in what you see |
| Hybrid | Higher of the two | Very low | Production systems that need both |
Table 1. The cost shape of each approach. RAG trades a one-time indexing cost for near-free questions; long context trades zero setup for a high per-question bill. The crossover comes fast as archive size and question volume grow.
Where Fora Soft Fits In
We build the systems this lesson describes inside real products: searchable archives for OTT and internet-TV platforms, "ask the recording" features for e-learning and video-conferencing tools, and investigation tools over surveillance footage where the answer must come with a verifiable timestamp. Across video conferencing, OTT, e-learning, telemedicine, and surveillance, the recurring request is the same — turn a growing pile of recordings into something a non-engineer can query in plain language — and video RAG is the architecture that delivers it without the runaway cost of re-reading everything on every question. Our work in these verticals since 2005 means we have seen which corners (chunking, timestamp metadata, data residency) decide whether the feature earns trust or quietly gets switched off.
What To Read Next
- Fine-tuning a video VLM on a domain — the other way to specialise a model, and when to retrieve instead.
- Open-frontier VLMs — LLaVA, Qwen-VL, InternVL — choosing the model that does the generation step.
- The real cost of AI in video products — the build-vs-buy and token-cost math that decides your architecture.
Talk To Us / See Our Work / Download
- Talk to a video AI engineer — scope an "ask your archive" feature, size the one-time indexing cost, and decide between text grounding, visual embeddings, or long context before you build. Book a 30-minute call.
- See our case studies — searchable archives and "ask the recording" features across OTT, e-learning, telemedicine, and surveillance. Browse case studies.
- Download the video RAG architecture checklist (PDF) — a one-page decision sheet: the RAG-vs-long-context test, the text-grounding-vs-visual-embeddings choice, the chunking and frame-selection rules of thumb, the indexing-and-metadata must-haves, and the cost-shape comparison. Download the checklist.
References
- Jeong, S., Kim, K., Baek, J., Hwang, S. J. "VideoRAG: Retrieval-Augmented Generation over Video Corpus." Findings of the Association for Computational Linguistics: ACL 2025; arXiv:2501.05874 (2025, accessed 2026-06-01). — Primary source for the framing that most RAG systems target text while video was largely overlooked, for the dynamic video-retrieval-plus-visual-and-textual-generation design, and for the video frame-selection mechanism (extract the most informative subset of frames; not all frames are equally important; extract textual information when subtitles are absent).
- Ren, X., et al. "VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos." arXiv:2502.01549 (2025, accessed 2026-06-01); code at github.com/HKUDS/VideoRAG. — Source for the dual-channel design (graph-based textual knowledge grounding + multi-modal context encoding) for unbounded-length video, and for the LongerVideos benchmark of 160+ videos totalling 134+ hours across lecture, documentary, and entertainment categories.
- Luo, Y., et al. "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension." arXiv:2411.13093 (2024–2025, accessed 2026-06-01). — Source for the training-free pipeline that replaces extended visual tokens with auxiliary texts from OCR, ASR, and object detection, and for the finding that ASR auxiliary text drives general improvement across video durations.
- Varshney, T., Surla, A. "An Easy Introduction to Multimodal Retrieval-Augmented Generation for Video and Audio." NVIDIA Technical Blog (developer.nvidia.com, accessed 2026-06-01). — Source for the three architectural approaches (common embedding space; N parallel pipelines; grounding in a common modality), the five-stage video pipeline, the frame-rate arithmetic (1 min @ 60 fps = 3,600 frames; downsample to 4 fps = 240; keyframe-select to ~40), SSIM-based keyframe selection, and the retrieve-rerank-generate query path with timestamp metadata.
- Wang, T., et al. "SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding." arXiv:2506.07600 (2025, accessed 2026-06-01). — Source for the finding that fixed-length chunking disrupts context and fails to capture true scene boundaries, and that LLM-driven segmentation into narrative-consistent scenes (using ASR transcripts plus temporal metadata) improves retrieval.
- Google DeepMind. "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." arXiv:2403.05530 (2024, accessed 2026-06-01); and Google Cloud, "The Needle in the Haystack Test and How Gemini Pro Solves It." — Source for the up-to-2-million-token context window, the near-perfect recall (>99.7%) across text, video, and audio, and the recovery of a secret word from a random frame of a 10.5-hour video.
- Google AI for Developers. "Video understanding" and "Understand and count tokens" (ai.google.dev/gemini-api, accessed 2026-06-01). — Primary source for the video tokenization rate (~263 tokens/second of video at default media resolution; 258 tokens/frame visual, 32 tokens/second audio), the default 1-FPS sampling, and the low-media-resolution alternative (~100 tokens/second).
- Google AI for Developers. "Gemini Developer API pricing" (ai.google.dev/gemini-api/docs/pricing, accessed 2026-06-01). — Primary source for Gemini 2.5 Pro input pricing ($1.25 per 1M tokens up to 200k context; $2.50 per 1M above), used in the per-hour and per-query cost arithmetic.
- Twelve Labs. "Video Foundation Models: Marengo & Pegasus", "Marengo 3.0", and developer release notes (twelvelabs.io and docs.twelvelabs.io, accessed 2026-06-01). — Source for Marengo as the multimodal embedding/search model and Pegasus as the video-understanding model that reasons over the temporal arc of an asset up to two hours, for AWS Bedrock and SaaS deployment, and for the Marengo 2.7 sunset on 2026-03-30 with Marengo 3.0 succeeding it.


