The Video Investigator Agent — A Surveillance Use Case

Why This Matters

Surveillance is the use case where an agent earns its keep fastest, because the work it replaces — a human scrubbing through hours of footage — is slow, expensive, and easy to get wrong. If you run a video product for physical security, retail analytics, manufacturing safety, or a smart building, this lesson shows you what an investigator agent actually is, what parts it needs, and where the cost and the legal risk hide. You do not need to write the pipeline yourself, but you do need to know enough to tell a sound design from a reckless one — to ask whether the agent reads every frame or retrieves the right ones, and whether a human signs off before anything happens. This is also the first of three applied agent lessons; the meeting copilot and the async review agent reuse the same skeleton on different footage.

What An Investigator Agent Is — And Is Not

Start with the picture most people have, because it is wrong. An investigator agent is not a robot watching a wall of monitors and pressing alarms. It is a piece of software that sits on top of footage already recorded by your cameras and your video management system — the software, called a VMS, that records and stores camera streams — and answers questions about that footage on demand. You ask in plain language; it does the looking.

The word that matters is investigator, and it points at a specific job: looking backward through recorded video to reconstruct what happened. Security teams call this forensic search — searching stored footage after an event to find the clip that explains it. The classic version is a person clicking through twelve hours of recordings from sixteen cameras to find the ninety seconds that matter. That is the task the agent takes over. In 2026 the security industry made this concrete: Brivo and Eagle Eye Networks launched an agent named Eeva that watches cameras from a plain-language prompt, and a startup called Conntour raised seven million dollars in March 2026 to build, in their words, an AI search engine for security video. The pattern is no longer theoretical.

It helps to fix the boundary against two neighbors. An investigator agent is not the same as a live alerting rule — "ping me when the detector sees a person in this zone" — which is a simpler, always-on trigger covered by ordinary video analytics. And it is not a live monitoring operator that watches streams in real time to intervene as events unfold. The investigator works mostly on the past tense: a question arrives, it investigates the archive, it reports. Some products blur the line by letting the same agent both watch live and search history, but the engineering and the law treat "react to a live face" and "search yesterday's footage" very differently, and you should keep them separate in your head.

A labeled diagram contrasting three jobs — a simple live alerting rule that triggers on a zone entry, a live monitoring operator watching streams in real time, and an investigator agent that answers plain-language questions about recorded footage — with the investigator highlighted as the subject of this lesson Figure 1. Three different jobs that get lumped together. Alerting is an always-on trigger; live monitoring watches the present; the investigator agent answers questions about the past. This lesson is about the third.

The Investigator Loop

Every agent runs a loop — perceive, reason, act, observe — and we drew that loop in the agent-loop lesson. The investigator agent is that loop pointed at a specific goal: answer a question about footage. Walk one investigation slowly and the loop becomes concrete.

A question arrives: "Did anyone enter the loading bay after 22:00 last night?" The agent first decomposes the goal — breaks the big question into smaller, answerable steps. Which cameras cover the loading bay? What counts as "entering"? What is the time window? This decomposition is the planning skill from the agent primitives lesson, and it is what separates a real agent from a single search box.

Then the loop turns. The agent picks the first step — "list cameras covering the bay" — and answers it by consulting its memory of the facility's camera map. It picks the next step — "find motion-and-person events in those feeds between 22:00 and 06:00" — and executes it by calling a tool, the person detector, with the camera IDs and the time window filled into a structured form. The detector returns three candidate clips. Those land back in the agent's working memory as a new observation. The agent reasons again: are these real entries or false hits from a swaying shadow? It calls a second tool, the vision-language reader, to look at each candidate clip and describe it. Two are shadows; one is a person carrying a box. The loop turns until the question is answered, and then the agent writes the finding — with the clip, the camera, and the timestamp attached as evidence.

Notice what each turn needed. It needed planning to choose the next step, tool use to execute it, and memory to hold the goal and recall the camera map. Those are exactly the three primitives from the previous lesson; the investigator agent is where they stop being abstract. The security industry describes the same shape in its own words — see, then think about what was seen over time, then assess how serious it is, then act — and that four-beat description from vendors like Ambient.ai is the same agent loop wearing a security uniform.

A circular agent-loop diagram for a single surveillance investigation showing the steps decompose the question, plan the next step, call a CV or search tool, observe the result, and consult memory, looping back until the agent writes a finding with attached clip evidence and a human reviewer sign-off gate at the end Figure 2. One investigation as a loop. Planning picks the step, a tool executes it, memory holds the state and the camera map, and the loop repeats until the agent writes a finding — which a human reviews before anything happens.

The Tool Belt — What The Agent Can Actually Do

An agent is only as capable as the tools you give it, and the art is giving it few, sharp ones rather than many dull ones. A surveillance investigator needs a small, well-chosen belt. Each tool answers one kind of question, and most of them are computer-vision models we covered earlier in this section.

The first tool is a detector — a model that finds objects and people in a frame and draws a box around each one. The production workhorse is the YOLO family, covered in the YOLO lineage lesson; it answers "what and where" cheaply and fast enough to scan a whole night of footage. The detector is the agent's wide net: it turns raw video into a stream of "at 23:14 on camera 3 there is a person" facts the agent can reason over.

The second tool is a tracker — a model that links the same object across consecutive frames and across cameras so it becomes one consistent identity instead of a new detection every frame. Trackers like ByteTrack, from the multi-object tracking lesson, let the agent answer "where did that person go after the bay?" by following a track from camera to camera. Without tracking, the agent sees a thousand disconnected boxes; with it, it sees one person's path.

The third tool is a segmenter — a model that outlines an object precisely, pixel by pixel, rather than with a loose box, and can follow that outline through a clip. SAM 2, from the SAM 2 for video lesson, is the tool here; the agent uses it to isolate a specific object — an abandoned bag, a vehicle — and confirm it is the same one across time.

The fourth tool is the forensic search index — a way to retrieve clips by describing them in words rather than by scrubbing a timeline. This is the heart of natural-language video search, and it is the same retrieval machinery as multimodal RAG over a video archive. We unpack how it works in the next section, because it is the part that makes the whole agent affordable.

The fifth tool is the vision-language reader — a model that looks at a clip and describes what is happening in plain language, the open and closed VLMs from the open-frontier VLM lesson. This is the agent's careful eye. It is slow and costly, so it is used last and least: only on the handful of candidate clips the cheaper tools have already surfaced. The reader is what turns "three boxes flagged by the detector" into "a man in a dark jacket carried a box into the bay at 23:14."

The sixth tool is the report writer — the agent assembling its findings, the supporting clips, and the timestamps into a short, human-readable summary with the evidence linked. This is not a vision tool at all; it is the agent using the language model's native strength to communicate a result a person can act on.

Tool	What it answers	Cost to run	Video example
Detector (YOLO)	"What and where, in this frame?"	Very low — scans whole nights	Find every person in the bay feeds overnight
Tracker (ByteTrack)	"Where did this object go next?"	Low	Follow one person across four cameras
Segmenter (SAM 2)	"Is this the exact same object?"	Low–medium	Confirm one abandoned bag across a clip
Forensic search index	"Which clips match this description?"	Low at query time (built once)	"Person in a red jacket near the gate"
Vision-language reader	"What is actually happening here?"	High — use sparingly	Describe the 3 candidate clips in words
Report writer	"Summarize the finding with evidence"	Low	One paragraph + 3 linked clips + times

A common and expensive mistake is tool overload: handing the agent thirty tools because each one seemed useful. The more tools an agent can choose from, the more often it picks the wrong one or stalls deciding, and accuracy on tool selection drops as the menu grows. Start with the six above, and add a seventh only when a real investigation proves it is missing.

A tool-belt diagram laying out the six tools an investigator agent uses — detector, tracker, segmenter, forensic search index, vision-language reader, and report writer — each shown with the question it answers and a cost-to-run band, arranged from cheap-and-wide on the left to costly-and-precise on the right Figure 3. The investigator's tool belt, ordered from cheap-and-wide (the detector, run over everything) to costly-and-precise (the VLM reader, run on a few clips). The order is the cost-control strategy.

How Natural-Language Forensic Search Works

The single most important idea in this lesson is that you never run the expensive model on all the footage. Understanding why, and what you do instead, is what separates a system that costs a few dollars a night from one that costs thousands.

Do the arithmetic once and the problem is obvious. Take a modest site: sixteen cameras, each recording at fifteen frames per second, for the twelve dark hours when most investigations happen. Frames from one camera in one night come to 15 × 3,600 × 12 = 648,000 frames. Across sixteen cameras that is 648,000 × 16 = 10,368,000 frames — about ten million pictures every night. Now suppose you tried to answer questions by sending each frame to a vision-language model at, say, $0.0005 per image. That is 10,368,000 × $0.0005 = $5,184 per night, or more than $155,000 a month, for one small site — and you would still be paying it whether or not anyone ever asked a question. Reading everything with the smart model does not scale. You can verify the price side of this against the real cost of AI in video products; the lesson there is the same one the arithmetic teaches here.

The way out has two phases, and the split between them is the whole design. The first phase happens once, at ingest time, as footage is recorded. A cheap detector runs over the stream and writes down a compact note for each meaningful moment — "person, camera 3, 23:14, near the bay door." At the same time, the system samples a representative still frame every so often (you do not need all fifteen per second; one every second or two captures the scene) and turns each sampled frame into a short string of numbers called an embedding — a numeric fingerprint that places similar-looking images near each other in a mathematical space. Frames are fingerprinted with an image-text model such as CLIP or SigLIP, the same family from the CLIP primer; the audio track, if any, is transcribed to text; and camera, time, and zone metadata are attached. All of this — detector notes, frame fingerprints, transcript, metadata — is filed in a searchable index. The index is built continuously and quietly in the background, and it is cheap because the detector and the embedding model are small.

The second phase happens only when a question is asked, at query time, and it is fast and cheap because the heavy lifting already happened. The question — "person in a red jacket near the gate around midnight" — is turned into the same kind of numeric fingerprint, and the index returns the handful of clips whose fingerprints sit closest to it, filtered by the time and camera the question implied. Because visual, transcript, and metadata indexes all get searched and their results merged, the system can match on appearance, on spoken words, and on where-and-when at the same time. A reranking step then re-sorts that short list for relevance. Only now — on the five or ten clips that survived — does the costly vision-language reader actually look, confirm the match, and describe it. The smart model touches ten clips instead of ten million frames. That is the difference between a few cents and five thousand dollars.

	Read everything (naïve)	Detect-then-retrieve (production)
When the VLM runs	On every frame, all night	On ~10 candidate clips per query
Frames the VLM sees	~10,000,000 / night	~10 / query
Rough nightly model cost	~$5,000 for one site	A few cents to a few dollars
Cost when no one asks	You pay anyway	Almost nothing
Who does the wide scan	The expensive model	A cheap detector + embedding index

This two-phase shape — index cheaply and continuously, retrieve and read expensively but rarely — is the production pattern behind every natural-language video search product shipping in 2026, from Eagle Eye's Smart Video Search to Conntour to Ambient.ai's edge reasoning model. When you evaluate any "AI video search" tool, the first question to ask is where the heavy model runs. If the answer is "on every frame," the bill will tell on it.

Figure 4. Forensic search in two phases. Ingest time (top) is cheap and continuous: index everything with small models. Query time (bottom) is where the costly reader runs — but only on the few clips retrieval surfaced. The split is the cost control.

Memory — Why The Agent Has To Remember

An investigator that forgets everything between questions is a search box, not an agent. The difference is memory, and a surveillance agent needs all four kinds we named in the agent primitives lesson.

Its working memory is the current investigation — this question, the clips found so far, the running plan — and it lives in the model's context window for the length of one task. Its semantic memory is the standing knowledge of the facility: the camera map, which feed covers which zone, the list of normal vehicles, the layout of the site. Its episodic memory is the record of specific past events — "the loading-bay camera produced false hits from headlights last Tuesday" — so the agent does not re-flag the same shadow night after night. Its procedural memory is the investigation playbook: the standing steps the agent always follows when it gets an "after-hours entry" question. The first two make the agent competent on day one; the third is what makes it get better over time instead of repeating yesterday's mistakes.

Episodic memory is the one teams most often skip, and skipping it is why so many analytics systems are exhausting to use: they cry wolf every night because nothing remembers that last night's wolf was a tree. An agent that writes "camera 3 false-flags on headlight glare after 23:00" into its memory and reads it back on the next investigation is doing the one thing a tireless human guard would do — learning the site.

The Hard Part — What The Law Will Not Let The Agent Do

Surveillance is where agent engineering meets the sharpest legal limits, and in Europe those limits are not advisory. The EU AI Act sorts AI uses into bands by risk, and surveillance touches the two strictest bands directly. Getting this wrong is not a bug; it is a fine of up to €35 million or 7% of global turnover.

The brightest line is real-time remote biometric identification in publicly accessible spaces — scanning faces in a live public crowd to identify who people are. For most actors this is prohibited outright under Article 5 of the AI Act, a ban that has been in force since February 2025, with narrow law-enforcement exceptions (a missing-child search, an imminent terrorist threat) that require prior authorization. For a commercial video product, treat live face identification in public space as off the table. An investigator agent must be designed so it cannot do this — not "configured not to," but built without the capability.

The softer but still serious line is post-event biometric identification — running facial recognition on yesterday's recording to identify a person after the fact. This is not banned, but it is classified as high-risk, which brings heavy obligations: documentation, logging, human oversight, a conformity assessment. Because of these obligations, the safe default for most products is to keep the agent at the level of objects, events, and descriptions — "a person in a red jacket," "a vehicle at the gate," "an after-hours entry" — and never at the level of identity — "this is Jane Smith." Searching for what happened is ordinary analytics; searching for who someone is crosses into biometric territory. The same boundary is drawn in detail in the face detection under the EU AI Act lesson; an investigator agent should respect it by design.

One date matters for planning, and it changed in 2026. The Act's high-risk obligations were originally due to apply from 2 August 2026. In May 2026 EU institutions reached a provisional political agreement — the "Digital Omnibus" — to postpone them: stand-alone high-risk systems (the band that covers law-enforcement and biometric tools) now apply from 2 December 2027, and high-risk AI embedded in regulated products from 2 August 2028. The prohibitions in Article 5, including the real-time biometric ban, were not delayed and remain in force. The postponement buys engineering time; it does not move the red line.

There is also a design principle that sits above the law and makes the law easier to satisfy: keep a human in the loop on every consequential action. The agent investigates and proposes; a person decides. The agent may surface "a likely break-in at the bay, here are the three clips," but it does not lock a door, dispatch a guard, or file a police report on its own. This is both the responsible design and the compliant one — human oversight is exactly what the high-risk rules demand — and it is why Figure 2 ends on a review gate rather than an action.

Use of the agent	EU AI Act band	Practical rule for a product
Live face ID in public space	Prohibited (Article 5, since Feb 2025)	Do not build the capability at all
Recognizing who a person is, post-event	High-risk (applies 2 Dec 2027)	Avoid by default; heavy obligations if used
Searching for objects, events, descriptions	Ordinary analytics	The safe default scope for the agent
Auto-acting on a match (lock, dispatch)	Removes human oversight	Always gate behind a human decision

A Worked Investigation, End To End

Tie the pieces together with one run. The question comes in at 07:00: "Anything unusual at the loading bay overnight?" The agent decomposes it into "find after-hours human activity in the bay feeds between 22:00 and 06:00." It consults semantic memory for which two cameras cover the bay. It calls the detector, already run at ingest, and pulls every person-event in those feeds in the window — fourteen of them. It checks episodic memory: camera 3 false-flags on headlight glare after 23:00, so it down-weights the six events that match that pattern. It sends the remaining eight candidate clips, not the night's ten million frames, to the vision-language reader, which describes each: six are a staff member on a known shift, one is a fox, one is a person in dark clothing carrying a box to the bay door at 23:14 who is not on any roster. The agent uses the tracker to follow that person's path across two more cameras to the perimeter fence. It writes a report: one paragraph, three linked clips, the times, the camera path. Then it stops — and a human security lead, not the agent, decides what to do. Every primitive fired; the costly model touched eight clips; no face was identified; a person made the call.

Build, Buy, Or Wrap

You have three honest options, and the right one depends on how much your product is about this feature. You can buy a finished agent — Eeva, Conntour, an Ambient.ai deployment — and integrate it, which is fastest and right when investigation is a feature, not your product. You can wrap an existing VMS: vendors now add agent layers onto platforms like Milestone XProtect, so you keep the recording and storage you have and bolt an agent on top. Or you can build the pipeline from the open primitives in this section — detector, tracker, segmenter, embeddings, VLM, an agent framework from the framework lesson — which is right when the investigation logic is your product and the off-the-shelf agents do not fit your domain. Most teams should start by buying or wrapping to learn what questions users actually ask, then build only the parts the market tools get wrong for their vertical.

Where Fora Soft Fits In

We build video products across surveillance, video analytics, conferencing, streaming, OTT, e-learning, telemedicine, and AR/VR, and the investigator agent is squarely in the surveillance and analytics work we ship. When a client needs forensic search, our design discipline is the one in this lesson: index cheaply and continuously, retrieve and read expensively but rarely, give the agent a small sharp tool set, and put a human on every consequential action. We keep the agent at the level of objects and events rather than identity by default, so a product clears the EU AI Act's bright lines by design instead of by configuration. In analytics-heavy verticals — retail, industrial safety, smart buildings — the same skeleton serves intelligent video analytics work without rebuilding the agent for each site.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your ai video surveillance agent plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Video Investigator Agent — Scoping & Guardrails Checklist — One page: the investigator loop; the six-tool belt ordered by cost; the two-phase forensic-search pipeline (index cheaply, read rarely); the EU AI Act lanes (prohibited / high-risk / ordinary analytics) with the 2 Dec 2027 high-risk….

References

Eagle Eye Networks / Brivo — "Introducing Eeva: Your AI Video Agent That Keeps an Eye on Anything," BusinessWire (17 March 2026) — https://www.businesswire.com/news/home/20260317090970/en/Introducing-Eeva-Your-AI-Video-Agent-That-Keeps-an-Eye-on-Anything — tier 4 (product deployer). Primary source for the Eeva natural-language video agent running on existing camera estates; used for the "products shipped in 2026" claim.
TechCrunch — "Conntour raises $7M from General Catalyst, YC to build an AI search engine for security video systems" (26 March 2026) — https://techcrunch.com/2026/03/26/conntour-raises-7m-from-general-catalyst-yc-to-build-an-ai-search-engine-for-security-video-systems/ — tier 4 (reporting). Source for the Conntour funding and "natural language query across cameras without preset categories" claim.
Ambient.ai — "AI Video Analytics for Physical Security: What Works in 2026" and "Agentic AI Security" — https://www.ambient.ai/blog/ai-video-analytics-in-physical-security — tier 4 (deployer). Source for the see → think → assess → act agentic loop framing and the edge-VLM forensic-search pattern.
European Union — Regulation (EU) 2024/1689 (Artificial Intelligence Act), Article 5 (Prohibited AI Practices) — https://artificialintelligenceact.eu/article/5/ — tier 1 (primary legislation). Source for the real-time remote biometric identification prohibition and its law-enforcement exceptions; in force since 2 February 2025.
Future of Privacy Forum — "Red Lines under the EU AI Act: Restricting Real-time Remote Biometric Identification Systems for Law Enforcement Purposes" — https://fpf.org/blog/red-lines-under-the-eu-ai-act-restricting-real-time-remote-biometric-identification-systems-for-law-enforcement-purposes/ — tier 3 (legal analysis). Used for the real-time-vs-post-event distinction and the high-risk classification of post-remote biometric identification.
Council of the EU — "Artificial Intelligence: Council and Parliament agree to simplify and streamline rules" (7 May 2026) — https://www.consilium.europa.eu/en/press/press-releases/2026/05/07/artificial-intelligence-council-and-parliament-agree-to-simplify-and-streamline-rules/ — tier 1 (primary, institutional). Source for the Digital Omnibus provisional agreement postponing high-risk obligations.
Gibson Dunn — "EU AI Act Omnibus Agreement — Postponed High-Risk Deadlines and Other Key Changes" — https://www.gibsondunn.com/eu-ai-act-omnibus-agreement-postponed-high-risk-deadlines-and-other-key-changes/ — tier 3 (legal analysis). Source for the specific new dates: 2 December 2027 (stand-alone high-risk) and 2 August 2028 (embedded in regulated products).
Lee et al. — "VideoRAG: Retrieval-Augmented Generation over Video Corpus," arXiv:2501.05874 (2025) — https://arxiv.org/pdf/2501.05874 — tier 5 (primary, academic). Grounds the retrieval-then-read architecture for video and the need to select an informative subset of frames rather than process all of them.
NVIDIA Developer Blog — "An Easy Introduction to Multimodal Retrieval-Augmented Generation for Video and Audio" — https://developer.nvidia.com/blog/an-easy-introduction-to-multimodal-retrieval-augmented-generation-for-video-and-audio/ — tier 4 (vendor engineering). Source for the ingest-time embedding / query-time retrieval pipeline and CLIP/SigLIP frame embeddings.
Radford et al. — "Learning Transferable Visual Models From Natural Language Supervision (CLIP)," arXiv:2103.00020 (OpenAI, 2021) — https://arxiv.org/abs/2103.00020 — tier 5 (primary, algorithmic). The shared image-text embedding space that makes describe-it-in-words frame retrieval possible.
Jocher & Ultralytics — YOLO documentation and model cards — https://docs.ultralytics.com — tier 4 (production deployer). Source for the detector as the cheap, wide-scan tool; full lineage in Lesson 2.2.
Ravi et al. — "SAM 2: Segment Anything in Images and Videos," Meta AI (2024) — https://ai.meta.com/sam2/ — tier 4 (first-party). Source for the segmenter tool and its memory-based propagation across a clip; full treatment in Lesson 2.4.