Capstone — Building A Surveillance Investigation System With A Multimodal Agent

Why This Matters

This article is for the founder or product manager who has decided to build an AI video investigation product — for physical security, retail loss prevention, manufacturing safety, or smart buildings — and now needs to know what the real thing costs, how long it takes, which parts are bought versus built, and where the law draws a hard line. It is equally for the engineer who has read the individual computer-vision lessons and wants them welded into one deployable system with named technologies and numbers. It assumes you have met the underlying ideas already, because a capstone assembles rather than re-derives; the cross-links point back to each foundational lesson when you need the detail. By the end you will be able to draw the production architecture on a whiteboard, name the exact 2026 technology in every box, defend the cost per investigation to a finance team, sequence the build so a first version ships in weeks, and tell a lawful design from one that will be fined out of the European market.

What You Are Building, Stated Precisely

Fix the product before any technology. You are building a system that sits on top of footage your cameras have already recorded and answers questions about it on demand. A user types a question in plain English — "did a forklift block fire exit 3 during yesterday's late shift?" — and the system returns the clips that answer it, each with a timestamp and a one-paragraph written explanation of what it found. It is not a robot watching a wall of monitors and sounding alarms; it is an analyst-in-software that does the looking so a human does not have to scrub through hours of video frame by frame.

Three capabilities make that possible, and the whole rest of this article is about wiring them together. The system must find events of interest cheaply across enormous amounts of footage, retrieve the right few clips when asked a question described in words, and read those clips closely enough to write a trustworthy answer. A useful mental model is a law firm's discovery process: paralegals first cull millions of documents down to a relevant box, and only then does an expensive senior lawyer read the box closely. Reverse that order — put the senior lawyer on every page — and the bill is astronomical and the work never finishes. The same economics govern video, and getting the order right is the entire engineering problem.

"Investigation" is the word that shapes every later decision, and it has a precise meaning here. This system works on recorded footage after the fact, answering questions about what already happened. That is a deliberate scope choice, not a limitation, and it matters enormously for the law: as we will see, reading old footage to find a described event is a very different legal animal from scanning live faces in a public square, and conflating the two is how products get banned.

The Spine: A Retrieval Cascade And The Agent Pattern

Two ideas carry the entire build. Get them right and everything else is detail.

The first is the retrieval cascade — the discovery-box idea above, made concrete. Raw footage is enormous: a single 1080p camera at 15 frames per second produces about 1.3 million frames a day, and a 200-camera site produces over 250 million frames a day. No system can afford to run a smart, slow model on every one of those frames. So you build a funnel. A fast, cheap detector watches every frame and flags only the ones with something in them — a person, a vehicle, a left-behind bag. A search index then turns those flagged moments into a form you can query by description. When a question arrives, the index returns a shortlist of candidate clips, and only that shortlist — a few dozen clips, not a few hundred million frames — reaches the expensive model that reads video closely. Each stage is cheaper per item and handles more items than the stage above it; each stage hands the next one far less work. This is the single most important design decision in the system, and we return to its arithmetic below.

The second idea is the AI agent as the orchestrator. Rather than hard-coding one fixed sequence of steps, you write a small program that plans the investigation the way a junior analyst would: it reads the question, decides which tools to call and in what order, calls them, looks at what comes back, and decides whether it has enough to answer or needs to look again. The tools on its belt are exactly the cascade's stages — the detector, the search index, a tracker that follows one person across cameras, and the video reader. This is the agent loop from the Phase 7 lessons applied to a tool belt of computer-vision components; the lesson on the agent loop for video is the conceptual foundation, and the investigator-agent lesson walks the loop in detail. Here we take the loop as settled and build the deployable system around it.

Hold the two ideas together and the platform has a clean shape. The cascade decides what work is cheap enough to run on everything versus what work is expensive enough to run only on a shortlist. The agent decides which tools to call for this particular question. Everything in the rest of this article is filling in the boxes, deciding what you build versus buy, and pricing it.

The Production Architecture, Box By Box

A real deployment is more than a detector and a model. Seven kinds of component show up in every video-investigation system we have scoped, and naming them precisely is the first hour of any project.

The camera and ingest layer is where footage enters. Cameras stream video — almost always using a protocol called RTSP, which is just the standard way a camera publishes a live feed — into a recorder. You do not write this; you receive streams from hardware that already exists on the customer's site.

The video management system, universally abbreviated VMS, is the software that records those streams, stores them, and manages retention — how many days of footage you keep before deleting it. This is mature, boring, essential infrastructure with established vendors (Milestone, Genetec, and the cloud-native Eagle Eye Networks among them). You integrate with a VMS; you do not rebuild one.

The edge inference layer is the fast, cheap detector running close to the cameras. "Edge" means on a small computer at the customer's site rather than in a distant data centre, which matters because sending every camera's full video to the cloud is expensive and slow. In 2026 this layer is typically a detector such as YOLO11 running on an NVIDIA Jetson device through the DeepStream pipeline — a toolkit built precisely for running detection on many camera streams at once. A Jetson Orin AGX handles roughly 8 to 16 camera streams; you size the hardware to the camera count.

The tracking and event layer turns raw detections into things worth indexing. A detector says "there is a person in this frame"; a tracker stitches those per-frame boxes into "this is one person, who appeared at 22:04 and left at 22:09." Multi-object trackers like ByteTrack do this stitching, and a video segmenter like SAM 2 can cut the exact shape of an object out across time when an investigation needs the precise outline rather than a rough box. An optional anomaly detector sits here too, flagging unusual motion patterns without being told what to look for.

The search index is the system's memory and the heart of natural-language search. As events are detected, the system converts each clip into a list of numbers — an embedding — that captures its meaning, the way a fingerprint captures an identity. These embeddings live in a vector database, a store built to answer "find me the clips most similar to this description." When a user types "a person in a red jacket near the entrance," that text is turned into the same kind of embedding and the database returns the closest matches. This is the engine behind video retrieval-augmented generation, covered in lesson 4.6.

The reasoning layer is the expensive video reader: a vision-language model, or VLM — a model that takes both video frames and a text question and answers in words. This is where Qwen-VL or a comparable model reads the shortlist of candidate clips, confirms which ones actually answer the question, and writes the finding. It is the most capable and most costly component, which is exactly why the cascade works so hard to send it only a handful of clips. Lesson 4.4 on open vision-language models covers the model choices in depth.

The application layer is everything a human touches: the search box, the results view with clips and timestamps, the audit log, and the case file where findings are saved. This is ordinary web-application territory — a database, an API, object storage for clips and reports — and it is deliberately boring, because the hard problems are solved elsewhere. Crucially, it is also where a human reviewer confirms or rejects every finding before it counts as a result.

Figure 1. The production deployment. A cheap edge detector runs on everything; an embedding index makes footage searchable by description; an agent calls a costly vision-language model only on the shortlist; and a human signs off before any finding counts.

Build Versus Buy: The 2026 Verdict, Component By Component

A capable team does not write all of this from scratch, and does not buy all of it either. The line in 2026 sits in a fairly stable place, and getting it right is the difference between shipping in a quarter and burning a year. The rule of thumb mirrors the conferencing capstone's: adopt the mature infrastructure, buy or adopt the fast-moving models, and build only the part that is your actual product — the investigation experience and the safeguards around it.

Component	Build or buy	Concrete 2026 choice	Why
Video management (VMS)	Integrate	Milestone, Genetec, or Eagle Eye Networks	Recording and retention are a solved, regulated domain; never rebuild a VMS
Edge detector	Adopt open model	YOLO11 on Jetson via DeepStream	Mature, fast, well-supported; Ultralytics advises against v12/v13 for production
Tracker / segmenter	Adopt open source	ByteTrack; SAM 2 / SAM 3 for masks	Standard, free, and battle-tested across the industry
Anomaly detector	Adopt or skip first	BN-WVAD-class model, or a VLM-prompt detector	Useful but noisy; add after search works, not before
Embedding + search index	Adopt or buy	Open embeddings + a vector DB, or Twelve Labs Marengo	The retrieval engine; buy the managed model if volume is modest
Vision-language reader	Buy or self-host	Qwen-VL family (self-host) or a frontier VLM API	Models improve monthly; self-host only at large, steady volume
Agent orchestration	Build on a framework	LangGraph or a comparable agent framework	The control logic is your product; the framework is plumbing
App, audit, review UI	Build	Your stack	This is your product — the experience, the case file, and the guardrails

Two cells deserve a note because they changed in 2026. The detector choice is no longer "newest is best": the YOLO family kept releasing — v12 and v13 arrived through 2025 — but the maintainer, Ultralytics, explicitly recommends the older YOLO11 (or the streamlined YOLO26) for production, because the attention-heavy v12 and v13 bring training instability, higher memory use, and slower processing on ordinary processors for only marginal accuracy. A capstone that reaches for the highest version number ships a slower, more fragile detector; the production-correct choice is the stable one. The embedding-and-search cell now has a credible buy option that did not exist a few years ago: managed video-embedding services such as Twelve Labs Marengo (version 3.0 as of early 2026) turn footage into searchable embeddings through an API, so a small team can have natural-language search without standing up its own model. The per-model reasoning for the detector and tracker is the subject of lesson 2.2 on the YOLO lineage and lesson 2.4 on SAM 2 for video.

Figure 2. What to build and what to adopt. The pattern is consistent with every system in this course: integrate the mature infrastructure, adopt or buy the models, build only the agent logic and the experience that differentiate you.

Following One Investigation From Question To Clip

Numbers and boxes become concrete when you trace a single question through the system. Follow one investigation: a warehouse operations lead asks, "Was the loading bay door left open for more than ten minutes after the last truck left yesterday evening?"

The question reaches the agent orchestrator, which does what a junior analyst would: it breaks the request into parts it can act on. It needs to know when the last truck left, then whether the door stayed open afterward, then for how long. It does not read any video yet — reading video is the expensive step, and a good agent earns its keep by postponing it.

First the agent queries the search index for "truck at the loading bay" within yesterday evening's footage. Because every detected event was already turned into an embedding when it happened, this is a fast database lookup, not a video scan; it returns a handful of candidate clips ranked by how well they match. The agent reads the timestamps and finds the last truck departure at 21:47.

Next the agent narrows the window — 21:47 onward — and asks the index for "loading bay door, open." Again, a fast lookup over pre-computed embeddings returns a shortlist. Now, and only now, the agent spends real money: it sends those few candidate clips to the vision-language model, with the precise question, "Is the bay door open in this clip, and is the area otherwise clear?" The model reads the frames and answers in words for each clip, with the times the door is visibly open. The agent assembles those into a span — open from 21:49 to 22:06 — does the arithmetic (22:06 − 21:49 = 17 minutes, which exceeds the ten-minute threshold), and drafts a finding: "The loading bay door remained open for approximately 17 minutes after the last truck departed at 21:47, from 21:49 to 22:06."

The draft does not go straight to the user as fact. It lands in the review view, where a human sees the finding, the supporting clips, and the timestamps, and confirms or corrects it before it is saved to the case file. Every step — every tool call, every clip retrieved, every model answer — is written to the audit log, so that weeks later anyone can reconstruct exactly how the system reached its conclusion.

Notice the discipline. Two cheap index lookups did the heavy narrowing; the expensive model ran on a handful of clips, not on an evening of footage; the arithmetic was explicit; and a person signed off. That shape is not decoration — it is exactly what keeps the system fast, affordable, and defensible.

Figure 3. One investigation, end to end. Cheap index lookups narrow the search; the costly model reads only the shortlist; the arithmetic is explicit; and a human signs off before anything is saved.

The Cascade That Makes It Affordable

The whole system lives or dies on one decision: never run the expensive model on more video than you must. The cascade is how you enforce that, and it is worth seeing the funnel in numbers.

Picture a single camera over one 24-hour day at 15 frames per second. That is 15 × 60 × 60 × 24 = 1,296,000 frames — call it 1.3 million. Stage one, the edge detector, runs on all of them, because it is cheap: a YOLO-class detector processes a frame in a few milliseconds on a Jetson, and most frames contain nothing of interest and are dropped immediately. Suppose 5 percent of frames contain a person or vehicle worth noting — that is about 65,000 frames, grouped by the tracker into perhaps 1,500 distinct events (a person walking through is one event spanning many frames, not hundreds of separate hits).

Stage two, indexing, turns those 1,500 events into embeddings and stores them. This is moderately cheap and happens once, when the event is detected, not when someone searches. Now the day's footage is searchable.

Stage three runs only when a question is asked. The index returns, say, the top 20 candidate clips for the query. Stage four, the vision-language model, reads those 20 clips — not 1.3 million frames. The model never sees 99.998 percent of the day's frames. That is the entire trick: the costly reader is asked to look at twenty clips, and the question is answered in seconds for a few cents instead of in hours for hundreds of dollars.

Run the comparison out loud. If you naively sent every frame to a frontier vision-language model, even at an optimistic one cent per frame you would spend 1,300,000 × $0.01 = $13,000 per camera per day — absurd, and that is before you multiply by hundreds of cameras. With the cascade, the model touches 20 clips per query; at a few cents per clip, a query costs well under a dollar, and the only always-on cost is the cheap edge detector and the modest indexing. The difference is not a tuning detail; it is the difference between a viable product and a bankrupt one.

Figure 4. The cascade funnel. Each stage is cheaper per item and handles more items than the stage below; the expensive reader sees only the shortlist. This single design choice separates a viable product from an unaffordable one.

A Cost Model With The Arithmetic Shown

Pricing this system correctly means understanding that costs come in three shapes, and only one of them grows every second the cameras are on. Walk through a concrete example: a 50-camera site running for a month, with staff asking about 500 investigation queries over that month.

The always-on edge cost is the detector running on every camera around the clock. This is hardware, not a per-frame fee: a Jetson Orin AGX handles roughly 8 to 16 streams, so 50 cameras need about four such devices. The cost is the one-time hardware plus power — on the order of a few thousand dollars of equipment amortised over years, plus a small monthly electricity bill. Critically, it does not rise when people ask more questions; it tracks camera count, which is known in advance.

The indexing cost is turning detected events into embeddings as they happen. It scales with how much activity the cameras see, not with footage length — an empty corridor at 3 a.m. produces almost no events to index. For a busy 50-camera site this is a steady, modest compute cost, whether you self-host the embedding model or pay a service such as Twelve Labs per minute of indexed activity. Budget it as a known monthly line item in the low hundreds of dollars for a site this size.

The per-query reasoning cost is the vision-language model reading shortlisted clips, and it tracks the number of investigations, not the number of cameras or the hours of footage. Suppose each query sends about 20 clips to the model and each clip costs a few cents to read; that is roughly $0.50 to $1.00 per investigation. For 500 queries in the month: 500 × $0.75 ≈ $375 of reasoning for the month. Self-hosting the VLM on your own GPU replaces this with server cost, which is cheaper at high, steady query volume and more expensive once you count the engineers who keep the GPU running.

Add the shapes: a few thousand dollars of one-time edge hardware, a couple hundred dollars a month of indexing, and a few hundred dollars a month of reasoning. The shape matters more than the exact figure: a system that ran the smart model continuously on every camera would pay the $13,000-per-camera-per-day nightmare from the previous section, while this one pays cheap-detector hardware plus a few hundred dollars of model time a month. Understand the shape and you price and scale correctly. The full set of cost levers — from self-hosting break-even to model selection and quantization — is the subject of lesson 8.4 on video-AI cost optimization.

Common Mistake: Reading Every Frame With The Smart Model

The failure we are called to fix most often on video-AI products is not a weak model — it is a good model pointed at far too much video. Three versions recur, and all three are decided at architecture time, not in tuning.

The first is running the vision-language model continuously on every camera "so it never misses anything." It is seductive because it is simple, and it is ruinous because the cost scales with frames × cameras × seconds, the three largest numbers in the system. The fix is the cascade: a cheap detector and a search index narrow the work, and the smart model reads only a shortlist. If a stakeholder insists the model must watch everything live, that is a different and much more expensive product — and, as the next section shows, often an illegal one.

The second is indexing at query time instead of at detection time. A team builds search, then computes embeddings only when a user searches, and every query takes minutes because it re-scans footage. Embeddings must be computed once, when the event is detected, and stored; search then becomes an instant database lookup. Getting this backward turns a sub-second search into a multi-minute one and quietly caps how many users the system can serve.

The third is trusting the model's answer without a human in the loop. Vision-language models are confident even when wrong, and in zero-shot anomaly detection they show a documented bias toward calling things "normal," so they miss real events while sounding sure. A finding that triggers a consequence — a report, an escalation, a person being approached — must be confirmed by a human reviewer. This is not only good engineering; in Europe it is close to a legal requirement for the high-risk uses we turn to next. Build the review step in from the first milestone; bolting it on later means re-plumbing every result path.

The Build Plan: Five Milestones, Value At Every Step

You do not build this all at once, and you do not build it in the order the diagram is drawn. You build it so that a working product exists after the first milestone and every later milestone ships independently, which keeps the project fundable and the team motivated.

Milestone 1 — search over recorded footage. Stand up the VMS integration, the edge detector, the tracker, the embedding index, and a simple search box. Ship the ability to type "person at the side entrance after 9 p.m." and get back ranked clips. No agent and no vision-language model yet — just fast retrieval. This alone replaces hours of manual scrubbing and is the foundation everything else attaches to. If this is shaky, no amount of agent cleverness will save the product.

Milestone 2 — the reading step. Add the vision-language model as a second stage that reads the top clips a search returns and writes a one-line description of each. The product crosses from "here are some clips that might match" to "here is what is happening in these clips, in words." This is the moment users feel the system understands footage rather than just indexes it.

Milestone 3 — the agent loop. Wrap the search and the reader in an agent that can plan multi-step investigations — find the last truck, then check the door, then measure the duration — instead of answering one-shot queries. This is where the product becomes an investigator rather than a search engine, and it reuses the retrieval and reading stages you already built. The investigator-agent lesson is the blueprint for this milestone.

Milestone 4 — the review and audit layer. Build the human-in-the-loop confirmation view, the case file, and the audit log that records every tool call and every clip. This is the milestone that makes the system defensible — to a customer's security team, to an auditor, and to a regulator. It is also where chain-of-custody features live for any customer who might use a finding in a formal process.

Milestone 5 — the regulated extensions. Anomaly detection, cross-camera tracking of a single person, and any feature that touches identity come last, because they are higher effort, noisier, and carry the heaviest compliance weight. Identity features in particular must be designed against the legal map below before a line of code is written — for some markets the answer is that you do not build them at all.

The five-milestone staircase, the reference stack, the build-versus-buy verdicts, and the go-live and legal checklist are all collected on one page in the downloadable blueprint at the end of this article, so a team can pin it to the wall and work down it.

The Hard Part — What The Law Will Not Let The System Do

A video system that works in a demo is not a product if it is illegal to operate where your customers are. For surveillance built or sold in the European Union, the law is now specific, dated, and strict, and it is the single most important section of this article to read before you write code. This is engineering context, not legal advice; confirm the specifics with counsel for your market.

The governing text is the European Union's AI Act, Regulation (EU) 2024/1689, and it sorts uses of this system into three sharply different boxes.

The first box is outright prohibited. Under Article 5, using AI for real-time remote biometric identification of people in publicly accessible spaces for law-enforcement purposes is banned, with only narrow exceptions (a targeted search for a missing person or victim, prevention of an imminent terrorist threat, or locating a suspect in a serious crime, each subject to authorisation). The Act also prohibits building face databases by scraping the internet or CCTV, and inferring emotions in workplaces and schools. These prohibitions have applied since 2 February 2025. The practical line for your product: a feature that scans live camera feeds in a public space to identify who someone is, by their face, is on the wrong side of the law unless you are a law-enforcement body operating inside a narrow, authorised exception. Most commercial products simply must not do this.

The second box is high-risk but permitted. Analysing recorded footage after the fact to identify a person by biometric features — post-event facial recognition — is classified as high-risk under Annex III, not banned. It is allowed, but only with the full weight of compliance: a documented risk-management system, technical documentation, logging, human oversight, accuracy and robustness testing, and registration in the EU database. These high-risk obligations apply from 2 August 2026. If your roadmap includes identifying named individuals from archived video, you are building a high-risk system and must budget for that compliance from the start, not as an afterthought.

The third box is everything our investigation system does by default, and it is deliberately the lightest. Searching footage for described events and objects — "a red truck," "a person who fell," "a door left open" — is not biometric identification, because it never tries to determine who a person is. It finds what happened, not who did it. This is why the system in this article is scoped to description-based search and reads footage with a vision-language model rather than matching faces against a watchlist: that scope keeps the core product out of the prohibited box and out of the heaviest high-risk obligations. The moment a customer asks you to add "and tell me their name from our employee photos," you have crossed from the third box into the second, and the compliance bill changes entirely.

Two transparency duties round out the picture. Article 50 of the same Regulation, in force from 2 August 2026, requires that AI-generated or AI-altered media be disclosed as such — relevant if your product ever produces a synthetic reconstruction. And the human-in-the-loop review from Milestone 4 is not merely good practice for high-risk uses; meaningful human oversight is one of the law's explicit requirements. The full regulatory picture, including the biometric rules and Article 50, is the subject of lesson 8.5 on EU AI Act engineering.

Figure 5. The legal map. Description-based search stays in the lightest box; live face identification in public is prohibited; after-the-fact face identification is high-risk. Scope the product into the lightest box on purpose, and confirm with counsel for each market.

Production Concerns: Chain Of Custody, Observability, And Security

Three cross-cutting concerns separate a prototype from something a security team will buy and trust.

Chain of custody is the discipline of being able to prove, later, that a clip is genuine and unaltered and to show exactly how a finding was reached. The audit log from Milestone 4 is the backbone: every tool call, every retrieved clip, every model answer, time-stamped and immutable. For any customer who might use a finding in an insurance claim, an HR process, or a court, this is not optional polish — it is the feature that makes the output usable at all. Store original clips alongside any derived findings, and never overwrite the source.

Observability means you can see why an investigation went wrong after it ended. Video-AI systems fail in ways ordinary web apps do not — a camera that drifted out of focus, an edge device that fell behind and dropped frames, a model that returned a confident wrong answer. You want per-investigation traces (which tools ran, how long each took, how many clips were read, what the model said) collected centrally, so support can answer "why did the system miss the event I know happened?" without guessing. Build this in from Milestone 1; retrofitting telemetry into a live pipeline is painful.

Security and data minimisation start from the fact that footage is among the most sensitive data a company holds. Encrypt clips at rest and in transit, enforce strict access controls on who can run investigations and see results, and keep footage no longer than the retention policy and the law require — minimising what you store is both a security posture and, under data-protection law, an obligation. For customers in regulated sectors, self-hosting the whole pipeline so that footage never leaves their infrastructure is often the deciding factor, and this architecture supports it: every component here can run on the customer's own hardware.

Where Fora Soft Fits In

Fora Soft has built video software since 2005, and computer-vision systems for video surveillance are one of the verticals we ship, alongside video conferencing, streaming, e-learning, and telemedicine. The system described here — a mature VMS underneath, an edge detector running on everything, an embedding index that makes footage searchable by description, an agent that calls a vision-language model only on a shortlist, and a human reviewer who signs off — is the backbone of the surveillance and intelligent-video-analytics work we scope. The build order and the build-versus-buy verdicts in this article are not theory for us; they are the checklist we apply, because they are the difference between an investigation tool that answers in seconds for cents and one that either misses events or runs an unaffordable bill. The legal map is part of that checklist too: we scope products into the lightest-touch box on purpose, because a feature that is brilliant and banned ships to no one. Our work here lives in video surveillance and intelligent video analytics, where real-time computer vision on recorded footage is the core of the product rather than a decoration.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your ai video surveillance investigation system plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Surveillance Investigation System Build Blueprint — One-page reference for assembling an AI video investigation system: the retrieval-cascade + agent reference stack, the build-vs-buy verdict per component, the cascade cost arithmetic, the five-milestone build order, the EU AI Act legal….

References

European Union — Regulation (EU) 2024/1689 (AI Act), Article 5 — Prohibited AI practices (real-time remote biometric identification in publicly accessible spaces for law enforcement is prohibited with narrow exceptions; prohibitions apply from 2 February 2025). https://eur-lex.europa.eu/eli/reg/2024/1689/oj
European Union — Regulation (EU) 2024/1689 (AI Act), Annex III & Article 6 — High-risk classification (post-remote biometric identification is high-risk; obligations apply from 2 August 2026). https://artificialintelligenceact.eu/annex/3/
European Union — Regulation (EU) 2024/1689 (AI Act), Article 50 — Transparency obligations (disclosure of AI-generated or manipulated media; applies from 2 August 2026). https://eur-lex.europa.eu/eli/reg/2024/1689/oj
Bai et al. — Qwen2.5-VL Technical Report (arXiv:2502.13923, February 2025; native video understanding, dynamic resolution, absolute time encoding, second-level event localization in long video). https://arxiv.org/abs/2502.13923
Qwen Team, Alibaba — Qwen3-VL Technical Report (arXiv:2511.21631, November 2025; 256K-token interleaved context across text, image, and video; dense 2B–32B and MoE variants). https://arxiv.org/abs/2511.21631
Meta AI — SAM 3: Segment Anything with Concepts (arXiv:2511.16719, released 19 November 2025; concept-prompted detect-segment-track; SAM 2-style masklet propagation in video). https://arxiv.org/abs/2511.16719
Ultralytics — YOLO model comparison and production guidance (YOLO11 / YOLO26 recommended for production; YOLO12/YOLO13 not recommended due to training instability, memory use, and slower CPU inference). https://docs.ultralytics.com/compare/
NVIDIA — DeepStream SDK documentation (multi-stream video-analytics pipeline on Jetson and data-centre GPUs; Orin AGX supports up to 8–16 streams). https://docs.nvidia.com/metropolis/deepstream/dev-guide/
Twelve Labs — Marengo 3.0: multimodal video embeddings (natural-language retrieval over long-form video; the managed embedding-and-search option, 2026). https://www.twelvelabs.io/blog/marengo-3-0
Zhou et al. — BN-WVAD: weakly-supervised video anomaly detection (state-of-the-art public results: AUC 87.24% on UCF-Crime, AP 84.93% on XD-Violence; ~30–50 ms on RTX 4090 / Jetson Orin AGX). https://arxiv.org/abs/2311.15367
Cao et al. — AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM (WACV 2026; VLM-prompted anomaly detection generalizes across datasets). https://arxiv.org/abs/2503.04504
Brivo — Introducing Eeva, an AI video agent driven by natural-language prompts (17 March 2026; commercial natural-language video monitoring on the Brivo/Eagle Eye VMS). https://www.brivo.com/introducing-eeva/