Published 2026-06-03 · 26 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
Almost every organization that runs cameras or keeps recorded video is sitting on an archive it cannot actually use. The footage exists, but finding the one moment that matters means a person scrubbing through hours of timeline, and so most of the archive is never watched at all. An investigator agent changes the economics of that search: it reads the archive once, builds a searchable understanding of it, and then answers questions on demand in seconds instead of hours. If you build or buy video software — surveillance, intelligent video analytics, OTT, conferencing, e-learning, telemedicine — this is the lesson where the whole agentic phase comes together into something you could actually scope, budget, and ship. You do not need to write the code yourself, but you do need to understand the shape of the system well enough to tell a sound design from an expensive mistake, to know which decisions carry legal weight, and to ask an engineering team the right questions. This is the Phase 7 capstone: it assumes you have read the earlier lessons and pulls every one of them into a single build.
What We Are Building — And Why It Is The Capstone
Every lesson in this phase taught one piece in isolation. This one assembles them. The way to understand the capstone is to see it as the one project that needs all of them at once, the way a working car needs the engine, the gearbox, the brakes, and the steering to all be present and connected — not as a parts catalogue, but as a thing that drives.
Here is the job in one sentence. You point the system at a body of recorded video — a year of camera footage, a decade of broadcast archive, a moderation backlog, a library of recorded lessons — and afterwards a person can ask it questions in ordinary language and get back precise answers, each one carrying the clip that proves it. The person never scrubs a timeline. They type or speak a question; the system investigates and reports, the way a research assistant would if the assistant could watch a million hours of video and never get tired.
It helps to fix the boundary against two things this is not. It is not a search box over filenames and manual tags — that only finds what a human already labelled, which is almost nothing. And it is not a single giant model that you feed the whole archive to and ask a question — no model in 2026 can hold a year of video in its attention at once, and trying would cost a fortune. The investigator agent sits between these: it builds a cheap, structured understanding of the archive ahead of time, and then reasons over that understanding at question time, reaching for the expensive full-fidelity tools only on the few moments that matter.
Why call it the capstone rather than just another applied agent? Because it is the first lesson that requires you to make every Phase 7 decision in one place. You need the agent loop to structure how it investigates (the agent-loop lesson). You need the tool-use, memory, and planning primitives to give it hands and a notebook (the primitives lesson). You need to choose a framework to wire it together durably (the framework-decision lesson). You reuse the forensic-search agent for the live question (the investigator lesson) and the async-review agent for the one-time indexing (the async-review lesson). You run it under the eval, safety, cost, and observability discipline (the AgentOps lesson). And you pick concrete tools from the 2026 framework landscape (the Manus / Claude Agent SDK / Google ADK lesson). Eight lessons, one diagram.
Figure 1. The capstone is where the phase converges. Each earlier lesson supplies one part; this build connects them into one system.
The Two Halves — Index Once, Investigate On Demand
The single most important idea in this whole build is that it has two halves that run on completely different clocks, and confusing them is the most common way these projects go wrong. One half is slow, runs once, and touches everything. The other is fast, runs on demand, and touches almost nothing. Keep them separate in your head and the rest of the architecture falls into place.
The first half is the index — a one-time, patient pass over the entire archive that turns raw video into a structured, searchable understanding. Think of it as the difference between a shipping container full of unlabelled boxes and a warehouse with a catalogue: the index is the catalogue. Nobody is waiting while it is built, so it runs as the async-review pattern from the previous applied lesson — a fleet of workers grinding through the archive on its own schedule, using the cheapest model pricing available. We will see exactly what it records in a moment.
The second half is the investigation — what happens when a person actually asks a question. This is the forensic-search agent from the investigator lesson: it takes a plain-language query, plans how to answer it, searches the index, reads the few clips that look promising, reasons about what it found, and reports back with evidence. Someone is waiting here, so this half optimizes for speed and accuracy, not for the rock-bottom pricing the index could afford.
The reason this split matters so much is cost, and the cost gap is enormous. Reading video with a vision-language model — a model that looks at frames and describes them in words, which we will call the reader from here on — is the expensive operation in the whole system. If you skip the index and make the agent read raw video every time someone asks a question, every single query re-watches the archive from scratch, and you pay the reader's full price on every question forever. Build the index once, and the reader's expensive work is done a single time; from then on, each question searches a cheap catalogue and only reads the handful of clips that survive the search. The index is a cost you pay once; the alternative is a cost you pay on every query for the life of the product.
| The index (build once) | The investigation (per question) | |
|---|---|---|
| When it runs | Once over the whole archive, then on new footage | Every time someone asks a question |
| Who is waiting | Nobody — runs on its own schedule | A person, expecting an answer in seconds |
| Pattern it reuses | Async-review agent (batch, fleet, funnel) | Investigator agent (the live agent loop) |
| Optimizes for | Lowest cost per hour of video | Speed and accuracy of the answer |
| Model pricing it uses | Batch APIs — 50% off, 24-hour turnaround | Standard real-time pricing |
| Touches | Everything in the archive | Only the few clips a query surfaces |
Figure 2. Two halves on two clocks. The index is built once with no one waiting; the investigation runs on demand while a person waits. They meet at the catalogue.
What The Index Actually Records
The index is not one thing — it is three layers of understanding stacked on top of the same footage, each answering a different kind of question. Building all three is what lets the agent answer the wide range of questions a real investigation throws at it. Let us walk them from cheapest to richest.
The first layer is detections and tracks — the output of the computer-vision primitives from Phase 2. A detector finds the objects in each sampled frame ("there is a person here, a vehicle there"), and a tracker stitches those detections across time into a single moving thing ("this is person #7, and here is their path through the scene"). We covered the detector lineage in the YOLO lesson and the stitching step in the multi-object tracking lesson. This layer is cheap to run and answers counting and movement questions — how many, where, when, how fast.
The second layer is embeddings — and this is the layer that makes plain-language search possible, so it is worth slowing down. An embedding is a list of numbers that captures the meaning of a piece of content, arranged so that things which mean similar things sit close together. Picture every clip and every search phrase as a pin dropped on a vast map, where "a red truck reversing" lands near other clips of trucks reversing and far from clips of an empty corridor. To search, you drop a pin for the question and gather the nearest clips. The place that stores these pins and finds the nearest ones fast is a vector database — a vector is just the list of numbers, and the database is built to answer "what is near this point?" across billions of points. We cover how this works for video in detail in the multimodal video RAG lesson and the embedding idea itself in the CLIP lesson. In 2026 you can compute these video embeddings with a dedicated video model — Twelve Labs' Marengo 3.0, for example, exposes an Embed API specifically for this, and general multimodal embedding models such as Google's Gemini Embedding 2 place text, image, and video into one shared map so a text question can find a video moment directly.
The third layer is descriptions — short written summaries of what happens in each meaningful segment, produced by the reader. "A delivery van pulls up to the loading bay; two people unload boxes for four minutes; the van leaves." This is the richest and most expensive layer, so the index produces it only for segments the cheaper layers flagged as interesting, never for the dead air. A video-understanding model such as Twelve Labs' Pegasus, which can caption and answer questions about clips, is built for exactly this step, and in 2026 its analysis endpoints handle clips up to two hours at a time. These descriptions are themselves embedded and stored, so the agent can search the meaning of what happened, not just the objects present.
Stack the three and the archive becomes answerable. The detection layer answers "how many people entered after 22:00." The embedding layer answers "find footage that looks like this." The description layer answers "what happened at the loading bay on Tuesday." A good investigator agent uses all three, picking the cheapest layer that can answer each part of a question.
The End-To-End Architecture
Now we can draw the whole thing. Do not let the number of boxes intimidate you — every box is something you have already met, and the data flows in one direction through two clear lanes. Read it left to right: raw video comes in, the index lane turns it into a catalogue, and the investigation lane answers questions against that catalogue.
Raw footage enters through ingest — the unglamorous but essential step that pulls video off cameras, files, or a streaming source and normalizes it into a standard form the rest of the system can read. From ingest, the index lane takes over: the cheap filter drops dead air, the keyframe sampler picks the frames worth looking at, the CV indexers produce detections and tracks, the embedder produces vectors, and the reader produces descriptions for the flagged segments. Everything lands in two stores: a vector store for the embeddings (for "find clips like this" search) and an event store for the structured detections, tracks, and descriptions (for "how many, when, what" queries). This lane is the async-review fleet, run once and then incrementally on new footage.
The investigation lane is the agent. A person's question arrives, and the agent runs its loop: it plans how to answer, calls tools to search the two stores and to read specific clips at full fidelity, reasons over what comes back, keeps a working memory of what it has found so far, and assembles an answer with evidence — the verdict plus the exact timestamps and clips that support it. Critically, the answer does not go straight to action. It passes through a human gate: for anything consequential, a person reviews the agent's finding before it becomes a decision. The agent proposes; a person disposes.
Two cross-cutting systems wrap the whole diagram and are easy to forget until they bite you. The first is the framework — the software that actually wires the agent's loop together and, just as importantly, makes it durable, so an investigation that crashes midway resumes instead of starting over. The second is observability and evaluation — the instrumentation that records every tool call and reasoning step so you can debug, cost, and trust the system. We give each its own section below, because in a real build they are where most of the engineering effort actually goes.
Figure 3. The full system. The index lane turns raw video into two stores once; the investigation lane answers questions against them on demand. The framework wires it durably; observability watches it run.
Wiring The Agent With A Framework
The agent loop is a concept; a framework is what makes it real code that survives contact with production. In 2026 the default choice for this kind of work is LangGraph, and it is worth understanding why it is the default, because the reason is exactly the property this build needs most.
LangGraph models an agent as a graph — a set of steps with arrows showing which step can follow which — and its defining feature is checkpointing: it saves the agent's state after every step. We compared it against CrewAI and AutoGen in the framework-decision lesson; here the checkpointing is the point. An investigation can be a long chain of tool calls — search, read, reason, search again, read again — and if the process crashes at step nine of fifteen, checkpointing means it resumes at step ten rather than re-running the eight expensive reads it already paid for. In practice you use a simple in-memory checkpointer while developing and a database-backed one, such as the Postgres checkpointer, in production, where the state has to survive a server restart.
A fair caveat, because the engineering press argued about it through 2026: checkpointing is not the same as full durable execution, the stronger guarantee that an entire long-running workflow survives any failure and resumes exactly where it left off. For the live investigation, which runs in seconds, the framework's checkpointer is plenty. For the index lane, which runs for hours over a huge archive, you want the stronger guarantee — and that is where a durable-execution engine such as Temporal earns its place, replaying its event log to resume a half-finished indexing job rather than restarting it. This is the same durability spine we built in the async-review lesson; here it protects the index, while the framework's lighter checkpointing protects each investigation.
How does the agent actually call its tools? Through a small, sharp tool set, exposed in 2026 through a common standard called the Model Context Protocol (MCP) — an open protocol, introduced in late 2024 and now stewarded by a Linux Foundation body, that standardizes how an agent discovers and calls external tools. By 2026 it had become near-universal: the major model providers support it natively, and the major frameworks, LangGraph included, treat it as the default way to plug in tools. The practical payoff is that the investigator's tools — search_events, search_clips, read_clip, get_track — are defined once as MCP tools and reused across frameworks and models, instead of being re-glued to each one. The agent's "hands" become portable.
Keep the tool set deliberately small. The temptation in a capstone is to give the agent every tool in the building; the discipline is to give it the four or five it actually needs and nothing else. A small tool set is easier for the model to use correctly, cheaper to run, and far easier to audit when something goes wrong — which, in a system that touches recorded video of real people, it eventually will.
A Worked Investigation, End To End
Abstractions become concrete when you walk a single real question through the system, watching the cost at each step. Take an industrial-safety query against a year of footage from forty cameras: "Show me every time a forklift came within two metres of a person on foot last month." Here is what happens, and what it costs.
First, the arithmetic that makes the index worth it. A year of footage from forty cameras, at thirty frames per second, is an astronomical number of frames — forty cameras times thirty frames times the seconds in a year is roughly 38 billion frames. Reading every one with the reader is impossible at any budget. But the index already ran. The cheap filter and keyframe sampler kept only the frames where something moved and only a few per second of those, cutting the volume by a factor of hundreds before the reader ever saw it; and the index ran on a batch API at half price because nothing was waiting. The expensive pass happened once, offline, at the lowest available rate. We keep the exact dollar figures in the living cost-of-AI reference, because they move; the structure is what matters — billions of frames reduced to a searchable catalogue, paid for once.
Now the live question, which is cheap because the index did the heavy lifting. The agent plans: this is a spatial-proximity question about two object types over a time window, so the event store can answer most of it. It calls search_events for forklift tracks and person-on-foot tracks in last month's window — a fast database query over the index, not a re-watch of video. It reasons over the returned tracks, computing where a forklift path and a person path were within two metres at the same moment — simple geometry over data the index already extracted. That yields, say, fourteen candidate moments. Only now does the expensive tool fire: the agent calls read_clip on those fourteen clips to confirm each one is a genuine close approach and not a tracking artefact. Fourteen reads, not 38 billion frames. It assembles an answer: "Eleven confirmed close approaches; here are the eleven clips, timestamped, with the two flagged as the closest." And it routes that to the human gate, because a safety finding is consequential.
The lesson of the worked example is the funnel, the same one from the async-review lesson, now pointed at a single query: cheap, structured search narrows billions of frames to fourteen candidates, and the expensive reader confirms only those. A common and costly mistake is to skip the structured search and have the agent read clips to answer everything — "the reader can watch video now, so let it." It can; your invoice cannot. The reader is the last resort, fired on a handful of clips, never the first move.
Figure 4. One question, walked through. Structured search over the index narrows billions of frames to a handful; the expensive reader confirms only those. The funnel is the whole economics.
Running It Safely — Eval, Observability, And Cost
A demo that answers one scripted question is easy. A system you trust to search a real archive, day after day, is not — and the difference is entirely the discipline from the AgentOps lesson. Three things separate a toy from a product.
The first is evaluation — measuring whether the agent's answers are actually right, on purpose and repeatedly, rather than trusting that they look right. For an investigator agent this means a held-out set of questions with known answers — a golden set — that you re-run whenever you change a model, a prompt, or a tool, so you catch the day an "improvement" quietly makes the agent miss half the real events. Two failure modes matter most here. A missed event (the agent says nothing happened when something did) is dangerous in a safety or security setting. A false positive (the agent flags something that did not happen) erodes trust and wastes human review time. You measure both, and you decide which one your product can least afford, because tuning the agent to reduce one usually raises the other.
The second is observability — recording every step the agent takes so you can see what it did and why. An agent is not a single function call; it is a chain of plans, tool calls, and reasoning, and when an answer is wrong you need to see the whole chain to know which link failed. In 2026 this is a settled discipline with a vendor-neutral standard: the OpenTelemetry project's generative-AI conventions now define how to record agent, tool, and model steps as structured traces, and tools such as LangSmith, Langfuse, and AgentOps build on that to give you a replayable view of each investigation. The practical rule is simple: if you cannot replay an investigation step by step after the fact, you cannot debug it, cost it, or defend it.
The third is cost control, which is mostly architecture you have already seen but must keep watching. The index uses batch pricing; the live agent uses the funnel so the reader fires rarely; observability tracks the spend per investigation so a single runaway query — an agent that reads two hundred clips instead of fourteen — is caught and capped rather than discovered on the invoice. A pitfall worth naming: an agent with no per-query budget and a reader tool will, on a vague question, happily read far more than it needs. Put a ceiling on tool calls per investigation, and make the agent ask a clarifying question instead of blindly reading when a query is too broad.
The Hard Part — What The Law Will Not Let The Agent Do
The most important section of this lesson is not about engineering. A video archive investigator is, by its nature, software that watches recorded footage of real people, and in 2026 that places it squarely inside the European Union's AI Act — the comprehensive AI law whose main provisions became applicable on 2 August 2026. Getting the engineering right and the law wrong is not a smaller failure; it is the one that stops the product from shipping. None of what follows is legal advice — treat it as the map of where the bright lines are, and bring a lawyer to the specifics.
Start with the brightest line. The AI Act's list of outright prohibited practices, in force since 2 February 2025, bans real-time remote biometric identification — recognizing specific people by their face or body in public spaces as it happens — for law-enforcement use, with only narrow, pre-authorized exceptions. An investigator agent searching a recorded archive is not doing real-time identification, which keeps it clear of that particular ban. But the neighbouring rule is the one that shapes the design: post-event remote biometric identification — identifying specific named people in previously recorded footage — is not prohibited, but it is classified as high-risk and, for law-enforcement investigations, requires prior authorization from a judicial authority. The obligations that attach to high-risk biometric systems carry their own later compliance deadline, applying from 2 December 2027. The practical consequence for your build is a design rule: by default, the agent operates on objects and events, not identities. "A forklift came within two metres of a person" is an event-level finding and carries none of that weight. "That person is Jane Doe" is biometric identification and pulls the whole system into the high-risk regime. We go deeper on the biometric line in the face-detection lesson.
Two more obligations apply broadly. Transparency (the Act's Article 50) requires that people be told when they are interacting with AI and that AI-generated or manipulated content be disclosed — relevant the moment your system summarizes footage or generates a synthetic clip. And the principle that ends every applied lesson in this phase is here a legal as much as an engineering one: a human stays in the loop on consequential decisions. The agent surfaces evidence; a person decides what it means and what to do. Build the human gate as a hard requirement, not a configurable option, and keep an audit trail of who decided what, so the system can show its work.
The design that satisfies all of this is the one we have been building the whole way: index objects and events rather than faces, keep identification out of the default path so the product clears the brightest lines by design instead of by configuration, disclose the AI where the law requires it, gate every consequential action through a person, and log everything. A sound architecture and a compliant one are, here, the same architecture.
Where Fora Soft Fits In
We build video products across surveillance, intelligent video analytics, conferencing, streaming, OTT, e-learning, telemedicine, and AR/VR, and an archive investigator agent draws on most of that experience at once — the computer-vision indexing from our analytics work, the retrieval and agent design from our AI work, and the streaming and ingest plumbing from our OTT work. When a client wants to make an archive answerable, our discipline is the one in this lesson: index cheaply and once, search structured data before reaching for the reader, give the agent a small sharp tool set, keep it at the level of objects and events by default, and put a human gate on every consequential finding. We treat the EU AI Act's bright lines as design constraints from the first diagram, not as a compliance pass bolted on at the end, so a product clears them by construction. The same skeleton serves a surveillance archive, an intelligent-video-analytics deployment, or an OTT back-catalogue with only the indexers and the question set changing.
What To Read Next
- The async video review agent — an archive pattern — the batch pattern that builds the index.
- The video investigator agent — surveillance use case — the live forensic-search half in depth.
- Agent eval, safety, cost, and observability — AgentOps — how to run a fleet like this and trust it.
Talk To Us · See Our Work · Download
- Talk to a video engineer — scope an archive investigator agent for your footage and your jurisdiction: /services/llm-agent-development.
- See our case studies — surveillance, intelligent video analytics, and OTT projects we have shipped: /portfolio.
- Download the blueprint — Video Archive Investigator Agent — Architecture & Compliance Blueprint (PDF): the two-half architecture, the three index layers, the cost funnel, the durability and observability spine, and the EU AI Act bright-line checklist, on one page.
References
- LangChain. "Durable execution — LangGraph documentation." docs.langchain.com, accessed 2026-06-03. (Tier 4 — framework documentation. Source for checkpointing, thread identifiers, MemorySaver in development versus a Postgres checkpointer in production, and resume-from-last-checkpoint behaviour.)
- langchain-ai/langgraph. "Build resilient agents." GitHub repository, accessed 2026-06-03. (Tier 4 — reference implementation. Source for the graph/checkpointer model.)
- Diagrid. "Checkpoints Are Not Durable Execution: Why LangGraph, CrewAI, Google ADK and Others Fall Short for Production Agent Workflows." diagrid.io blog, 2026, accessed 2026-06-03. (Tier 4 — vendor engineering. Source for the checkpointing-versus-durable-execution distinction; presented as one side of an industry debate, balanced against the LangGraph docs.)
- Twelve Labs. "Video Foundation Models: Marengo & Pegasus" and "Marengo 3.0: Real-World Multimodal Embedding AI." twelvelabs.io, 2026, accessed 2026-06-03. (Tier 4 — vendor documentation. Source for Marengo 3.0 Embed/Search API, Pegasus video understanding/QA, the Marengo 2.7 sunset on 2026-03-30, and asynchronous analysis of clips up to two hours.)
- Google. "Gemini Embedding 2: multimodal search." Google developer materials, 2026, accessed 2026-06-03. (Tier 4 — vendor documentation. Source for a unified text/image/video embedding space enabling cross-modal retrieval.)
- Anthropic. "Introducing the Model Context Protocol." anthropic.com/news/model-context-protocol, 2024; with 2026 status via the Model Context Protocol project. Accessed 2026-06-03. (Tier 4 — protocol documentation. Source for MCP's introduction in late 2024, donation to a Linux Foundation body in December 2025, and near-universal 2026 adoption across model providers and frameworks.)
- OpenTelemetry. "AI Agent Observability — Evolving Standards and Best Practices" and GenAI semantic-conventions materials. opentelemetry.io, 2025–2026, accessed 2026-06-03. (Tier 4/6 — open standard. Source for agent/tool/model spans and the vendor-neutral tracing conventions underpinning LangSmith, Langfuse, and AgentOps.)
- European Union. Regulation (EU) 2024/1689 (the AI Act), Article 5 — Prohibited AI Practices, via artificialintelligenceact.eu and the European Commission AI Act materials. Accessed 2026-06-03. (Tier 1 — primary legislation. Source for the prohibition on real-time remote biometric identification for law enforcement, in force from 2 February 2025, and its narrow exceptions.)
- European Union. AI Act timeline and high-risk provisions, European Commission ("AI Act — Shaping Europe's digital future") and FPF analysis "Red Lines under the EU AI Act." Accessed 2026-06-03. (Tier 1/2 — primary legislation plus expert analysis. Source for the 2 August 2026 main-application date, post-remote biometric identification as high-risk requiring prior judicial authorization, and the 2 December 2027 deadline for certain high-risk biometric obligations.)
- European Union. AI Act Article 50 — Transparency obligations, via artificialintelligenceact.eu. Accessed 2026-06-03. (Tier 1 — primary legislation. Source for disclosure of AI interaction and AI-generated/manipulated content.)
- arXiv:2505.23990. "Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding." 2025, accessed 2026-06-03. (Tier 5 — peer-reviewed/preprint. Source for the multimodal-RAG-over-video architecture pattern.)
- Fora Soft async-review and investigator lessons (this section), for the batch funnel, durability spine, and forensic-search loop reused here. Internal cross-reference.


