Published 2026-06-04 · 34 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

This article is for the founder or product manager who has decided to build an AI video investigation product — for physical security, retail loss prevention, manufacturing safety, or smart buildings — and now needs to know what the real thing costs, how long it takes, which parts are bought versus built, and where the law draws a hard line. It is equally for the engineer who has read the individual computer-vision lessons and wants them welded into one deployable system with named technologies and numbers. It assumes you have met the underlying ideas already, because a capstone assembles rather than re-derives; the cross-links point back to each foundational lesson when you need the detail. By the end you will be able to draw the production architecture on a whiteboard, name the exact 2026 technology in every box, defend the cost per investigation to a finance team, sequence the build so a first version ships in weeks, and tell a lawful design from one that will be fined out of the European market.

What You Are Building, Stated Precisely

Fix the product before any technology. You are building a system that sits on top of footage your cameras have already recorded and answers questions about it on demand. A user types a question in plain English — "did a forklift block fire exit 3 during yesterday's late shift?" — and the system returns the clips that answer it, each with a timestamp and a one-paragraph written explanation of what it found. It is not a robot watching a wall of monitors and sounding alarms; it is an analyst-in-software that does the looking so a human does not have to scrub through hours of video frame by frame.

Three capabilities make that possible, and the whole rest of this article is about wiring them together. The system must find events of interest cheaply across enormous amounts of footage, retrieve the right few clips when asked a question described in words, and read those clips closely enough to write a trustworthy answer. A useful mental model is a law firm's discovery process: paralegals first cull millions of documents down to a relevant box, and only then does an expensive senior lawyer read the box closely. Reverse that order — put the senior lawyer on every page — and the bill is astronomical and the work never finishes. The same economics govern video, and getting the order right is the entire engineering problem.

"Investigation" is the word that shapes every later decision, and it has a precise meaning here. This system works on recorded footage after the fact, answering questions about what already happened. That is a deliberate scope choice, not a limitation, and it matters enormously for the law: as we will see, reading old footage to find a described event is a very different legal animal from scanning live faces in a public square, and conflating the two is how products get banned.

The Spine: A Retrieval Cascade And The Agent Pattern

Two ideas carry the entire build. Get them right and everything else is detail.

The first is the retrieval cascade — the discovery-box idea above, made concrete. Raw footage is enormous: a single 1080p camera at 15 frames per second produces about 1.3 million frames a day, and a 200-camera site produces over 250 million frames a day. No system can afford to run a smart, slow model on every one of those frames. So you build a funnel. A fast, cheap detector watches every frame and flags only the ones with something in them — a person, a vehicle, a left-behind bag. A search index then turns those flagged moments into a form you can query by description. When a question arrives, the index returns a shortlist of candidate clips, and only that shortlist — a few dozen clips, not a few hundred million frames — reaches the expensive model that reads video closely. Each stage is cheaper per item and handles more items than the stage above it; each stage hands the next one far less work. This is the single most important design decision in the system, and we return to its arithmetic below.

The second idea is the AI agent as the orchestrator. Rather than hard-coding one fixed sequence of steps, you write a small program that plans the investigation the way a junior analyst would: it reads the question, decides which tools to call and in what order, calls them, looks at what comes back, and decides whether it has enough to answer or needs to look again. The tools on its belt are exactly the cascade's stages — the detector, the search index, a tracker that follows one person across cameras, and the video reader. This is the agent loop from the Phase 7 lessons applied to a tool belt of computer-vision components; the lesson on the agent loop for video is the conceptual foundation, and the investigator-agent lesson walks the loop in detail. Here we take the loop as settled and build the deployable system around it.

Hold the two ideas together and the platform has a clean shape. The cascade decides what work is cheap enough to run on everything versus what work is expensive enough to run only on a shortlist. The agent decides which tools to call for this particular question. Everything in the rest of this article is filling in the boxes, deciding what you build versus buy, and pricing it.

The Production Architecture, Box By Box

A real deployment is more than a detector and a model. Seven kinds of component show up in every video-investigation system we have scoped, and naming them precisely is the first hour of any project.

The camera and ingest layer is where footage enters. Cameras stream video — almost always using a protocol called RTSP, which is just the standard way a camera publishes a live feed — into a recorder. You do not write this; you receive streams from hardware that already exists on the customer's site.

The video management system, universally abbreviated VMS, is the software that records those streams, stores them, and manages retention — how many days of footage you keep before deleting it. This is mature, boring, essential infrastructure with established vendors (Milestone, Genetec, and the cloud-native Eagle Eye Networks among them). You integrate with a VMS; you do not rebuild one.

The edge inference layer is the fast, cheap detector running close to the cameras. "Edge" means on a small computer at the customer's site rather than in a distant data centre, which matters because sending every camera's full video to the cloud is expensive and slow. In 2026 this layer is typically a detector such as YOLO11 running on an NVIDIA Jetson device through the DeepStream pipeline — a toolkit built precisely for running detection on many camera streams at once. A Jetson Orin AGX handles roughly 8 to 16 camera streams; you size the hardware to the camera count.

The tracking and event layer turns raw detections into things worth indexing. A detector says "there is a person in this frame"; a tracker stitches those per-frame boxes into "this is one person, who appeared at 22:04 and left at 22:09." Multi-object trackers like ByteTrack do this stitching, and a video segmenter like SAM 2 can cut the exact shape of an object out across time when an investigation needs the precise outline rather than a rough box. An optional anomaly detector sits here too, flagging unusual motion patterns without being told what to look for.

The search index is the system's memory and the heart of natural-language search. As events are detected, the system converts each clip into a list of numbers — an embedding — that captures its meaning, the way a fingerprint captures an identity. These embeddings live in a vector database, a store built to answer "find me the clips most similar to this description." When a user types "a person in a red jacket near the entrance," that text is turned into the same kind of embedding and the database returns the closest matches. This is the engine behind video retrieval-augmented generation, covered in lesson 4.6.

The reasoning layer is the expensive video reader: a vision-language model, or VLM — a model that takes both video frames and a text question and answers in words. This is where Qwen-VL or a comparable model reads the shortlist of candidate clips, confirms which ones actually answer the question, and writes the finding. It is the most capable and most costly component, which is exactly why the cascade works so hard to send it only a handful of clips. Lesson 4.4 on open vision-language models covers the model choices in depth.

The application layer is everything a human touches: the search box, the results view with clips and timestamps, the audit log, and the case file where findings are saved. This is ordinary web-application territory — a database, an API, object storage for clips and reports — and it is deliberately boring, because the hard problems are solved elsewhere. Crucially, it is also where a human reviewer confirms or rejects every finding before it counts as a result.

Production deployment diagram of an AI video investigation system, left to right. On the left, site cameras stream over RTSP into a video management system that records and retains footage. An edge inference layer of Jetson devices runs a fast detector on every stream. Detections feed a tracking and event layer with a multi-object tracker, a video segmenter, and an optional anomaly detector. Events are turned into embeddings and stored in a vector database that serves as the search index. In the centre, an agent orchestrator plans each investigation and calls tools: the detector, the tracker, the search index, and a vision-language model that reads candidate clips. On the right, an application layer presents results, an audit log, and a case file to a human reviewer who signs off on every finding.

Figure 1. The production deployment. A cheap edge detector runs on everything; an embedding index makes footage searchable by description; an agent calls a costly vision-language model only on the shortlist; and a human signs off before any finding counts.

Build Versus Buy: The 2026 Verdict, Component By Component

A capable team does not write all of this from scratch, and does not buy all of it either. The line in 2026 sits in a fairly stable place, and getting it right is the difference between shipping in a quarter and burning a year. The rule of thumb mirrors the conferencing capstone's: adopt the mature infrastructure, buy or adopt the fast-moving models, and build only the part that is your actual product — the investigation experience and the safeguards around it.

Component Build or buy Concrete 2026 choice Why
Video management (VMS) Integrate Milestone, Genetec, or Eagle Eye Networks Recording and retention are a solved, regulated domain; never rebuild a VMS
Edge detector Adopt open model YOLO11 on Jetson via DeepStream Mature, fast, well-supported; Ultralytics advises against v12/v13 for production
Tracker / segmenter Adopt open source ByteTrack; SAM 2 / SAM 3 for masks Standard, free, and battle-tested across the industry
Anomaly detector Adopt or skip first BN-WVAD-class model, or a VLM-prompt detector Useful but noisy; add after search works, not before
Embedding + search index Adopt or buy Open embeddings + a vector DB, or Twelve Labs Marengo The retrieval engine; buy the managed model if volume is modest
Vision-language reader Buy or self-host Qwen-VL family (self-host) or a frontier VLM API Models improve monthly; self-host only at large, steady volume
Agent orchestration Build on a framework LangGraph or a comparable agent framework The control logic is your product; the framework is plumbing
App, audit, review UI Build Your stack This is your product — the experience, the case file, and the guardrails

Two cells deserve a note because they changed in 2026. The detector choice is no longer "newest is best": the YOLO family kept releasing — v12 and v13 arrived through 2025 — but the maintainer, Ultralytics, explicitly recommends the older YOLO11 (or the streamlined YOLO26) for production, because the attention-heavy v12 and v13 bring training instability, higher memory use, and slower processing on ordinary processors for only marginal accuracy. A capstone that reaches for the highest version number ships a slower, more fragile detector; the production-correct choice is the stable one. The embedding-and-search cell now has a credible buy option that did not exist a few years ago: managed video-embedding services such as Twelve Labs Marengo (version 3.0 as of early 2026) turn footage into searchable embeddings through an API, so a small team can have natural-language search without standing up its own model. The per-model reasoning for the detector and tracker is the subject of lesson 2.2 on the YOLO lineage and lesson 2.4 on SAM 2 for video.

Component build-versus-buy matrix for the video investigation system. Eight rows list the video management system, edge detector, tracker and segmenter, anomaly detector, embedding and search index, vision-language reader, agent orchestration, and the application and review layer. For each, a colored tag marks whether you integrate, adopt open source, buy a managed service, or build it yourself, with the concrete 2026 technology named and a one-line reason. Integrate and adopt decisions dominate the infrastructure rows; build is reserved for the agent logic and the product layer.

Figure 2. What to build and what to adopt. The pattern is consistent with every system in this course: integrate the mature infrastructure, adopt or buy the models, build only the agent logic and the experience that differentiate you.

Following One Investigation From Question To Clip

Numbers and boxes become concrete when you trace a single question through the system. Follow one investigation: a warehouse operations lead asks, "Was the loading bay door left open for more than ten minutes after the last truck left yesterday evening?"

The question reaches the agent orchestrator, which does what a junior analyst would: it breaks the request into parts it can act on. It needs to know when the last truck left, then whether the door stayed open afterward, then for how long. It does not read any video yet — reading video is the expensive step, and a good agent earns its keep by postponing it.

First the agent queries the search index for "truck at the loading bay" within yesterday evening's footage. Because every detected event was already turned into an embedding when it happened, this is a fast database lookup, not a video scan; it returns a handful of candidate clips ranked by how well they match. The agent reads the timestamps and finds the last truck departure at 21:47.

Next the agent narrows the window — 21:47 onward — and asks the index for "loading bay door, open." Again, a fast lookup over pre-computed embeddings returns a shortlist. Now, and only now, the agent spends real money: it sends those few candidate clips to the vision-language model, with the precise question, "Is the bay door open in this clip, and is the area otherwise clear?" The model reads the frames and answers in words for each clip, with the times the door is visibly open. The agent assembles those into a span — open from 21:49 to 22:06 — does the arithmetic (22:06 − 21:49 = 17 minutes, which exceeds the ten-minute threshold), and drafts a finding: "The loading bay door remained open for approximately 17 minutes after the last truck departed at 21:47, from 21:49 to 22:06."

The draft does not go straight to the user as fact. It lands in the review view, where a human sees the finding, the supporting clips, and the timestamps, and confirms or corrects it before it is saved to the case file. Every step — every tool call, every clip retrieved, every model answer — is written to the audit log, so that weeks later anyone can reconstruct exactly how the system reached its conclusion.

Notice the discipline. Two cheap index lookups did the heavy narrowing; the expensive model ran on a handful of clips, not on an evening of footage; the arithmetic was explicit; and a person signed off. That shape is not decoration — it is exactly what keeps the system fast, affordable, and defensible.

Numbered question-and-data flow for one investigation. Step one, a user asks a plain-language question. Step two, the agent orchestrator plans the investigation into sub-questions. Step three, the agent queries the embedding search index for the first event and gets a ranked shortlist from pre-computed embeddings. Step four, the agent narrows the time window and queries the index again. Step five, only the shortlisted candidate clips are sent to the vision-language model, which reads them and answers in words. Step six, the agent assembles a finding with timestamps and shown arithmetic. Step seven, a human reviewer confirms or corrects the finding. Step eight, the result and the full tool-call trace are written to the case file and audit log.

Figure 3. One investigation, end to end. Cheap index lookups narrow the search; the costly model reads only the shortlist; the arithmetic is explicit; and a human signs off before anything is saved.

The Cascade That Makes It Affordable

The whole system lives or dies on one decision: never run the expensive model on more video than you must. The cascade is how you enforce that, and it is worth seeing the funnel in numbers.

Picture a single camera over one 24-hour day at 15 frames per second. That is 15 × 60 × 60 × 24 = 1,296,000 frames — call it 1.3 million. Stage one, the edge detector, runs on all of them, because it is cheap: a YOLO-class detector processes a frame in a few milliseconds on a Jetson, and most frames contain nothing of interest and are dropped immediately. Suppose 5 percent of frames contain a person or vehicle worth noting — that is about 65,000 frames, grouped by the tracker into perhaps 1,500 distinct events (a person walking through is one event spanning many frames, not hundreds of separate hits).

Stage two, indexing, turns those 1,500 events into embeddings and stores them. This is moderately cheap and happens once, when the event is detected, not when someone searches. Now the day's footage is searchable.

Stage three runs only when a question is asked. The index returns, say, the top 20 candidate clips for the query. Stage four, the vision-language model, reads those 20 clips — not 1.3 million frames. The model never sees 99.998 percent of the day's frames. That is the entire trick: the costly reader is asked to look at twenty clips, and the question is answered in seconds for a few cents instead of in hours for hundreds of dollars.

Run the comparison out loud. If you naively sent every frame to a frontier vision-language model, even at an optimistic one cent per frame you would spend 1,300,000 × $0.01 = $13,000 per camera per day — absurd, and that is before you multiply by hundreds of cameras. With the cascade, the model touches 20 clips per query; at a few cents per clip, a query costs well under a dollar, and the only always-on cost is the cheap edge detector and the modest indexing. The difference is not a tuning detail; it is the difference between a viable product and a bankrupt one.

Cascade cost funnel for one camera over one day. The widest band at the top shows about 1.3 million raw frames per camera per day at 15 frames per second. An arrow narrows to the edge detector stage, which runs on every frame cheaply and passes roughly 65,000 frames with something in them, grouped into about 1,500 tracked events. A further narrowing shows those events turned into embeddings in the search index. At query time, the index returns about 20 candidate clips, and only those reach the vision-language model at the narrow bottom of the funnel. A side panel contrasts two costs: sending every frame to a frontier model at one cent per frame would cost about 13,000 dollars per camera per day, while the cascade sends about 20 clips per query for well under one dollar.

Figure 4. The cascade funnel. Each stage is cheaper per item and handles more items than the stage below; the expensive reader sees only the shortlist. This single design choice separates a viable product from an unaffordable one.

A Cost Model With The Arithmetic Shown

Pricing this system correctly means understanding that costs come in three shapes, and only one of them grows every second the cameras are on. Walk through a concrete example: a 50-camera site running for a month, with staff asking about 500 investigation queries over that month.

The always-on edge cost is the detector running on every camera around the clock. This is hardware, not a per-frame fee: a Jetson Orin AGX handles roughly 8 to 16 streams, so 50 cameras need about four such devices. The cost is the one-time hardware plus power — on the order of a few thousand dollars of equipment amortised over years, plus a small monthly electricity bill. Critically, it does not rise when people ask more questions; it tracks camera count, which is known in advance.

The indexing cost is turning detected events into embeddings as they happen. It scales with how much activity the cameras see, not with footage length — an empty corridor at 3 a.m. produces almost no events to index. For a busy 50-camera site this is a steady, modest compute cost, whether you self-host the embedding model or pay a service such as Twelve Labs per minute of indexed activity. Budget it as a known monthly line item in the low hundreds of dollars for a site this size.

The per-query reasoning cost is the vision-language model reading shortlisted clips, and it tracks the number of investigations, not the number of cameras or the hours of footage. Suppose each query sends about 20 clips to the model and each clip costs a few cents to read; that is roughly $0.50 to $1.00 per investigation. For 500 queries in the month: 500 × $0.75 ≈ $375 of reasoning for the month. Self-hosting the VLM on your own GPU replaces this with server cost, which is cheaper at high, steady query volume and more expensive once you count the engineers who keep the GPU running.

Add the shapes: a few thousand dollars of one-time edge hardware, a couple hundred dollars a month of indexing, and a few hundred dollars a month of reasoning. The shape matters more than the exact figure: a system that ran the smart model continuously on every camera would pay the $13,000-per-camera-per-day nightmare from the previous section, while this one pays cheap-detector hardware plus a few hundred dollars of model time a month. Understand the shape and you price and scale correctly. The full set of cost levers — from self-hosting break-even to model selection and quantization — is the subject of lesson 8.4 on video-AI cost optimization.

Common Mistake: Reading Every Frame With The Smart Model

The failure we are called to fix most often on video-AI products is not a weak model — it is a good model pointed at far too much video. Three versions recur, and all three are decided at architecture time, not in tuning.

The first is running the vision-language model continuously on every camera "so it never misses anything." It is seductive because it is simple, and it is ruinous because the cost scales with frames × cameras × seconds, the three largest numbers in the system. The fix is the cascade: a cheap detector and a search index narrow the work, and the smart model reads only a shortlist. If a stakeholder insists the model must watch everything live, that is a different and much more expensive product — and, as the next section shows, often an illegal one.

The second is indexing at query time instead of at detection time. A team builds search, then computes embeddings only when a user searches, and every query takes minutes because it re-scans footage. Embeddings must be computed once, when the event is detected, and stored; search then becomes an instant database lookup. Getting this backward turns a sub-second search into a multi-minute one and quietly caps how many users the system can serve.

The third is trusting the model's answer without a human in the loop. Vision-language models are confident even when wrong, and in zero-shot anomaly detection they show a documented bias toward calling things "normal," so they miss real events while sounding sure. A finding that triggers a consequence — a report, an escalation, a person being approached — must be confirmed by a human reviewer. This is not only good engineering; in Europe it is close to a legal requirement for the high-risk uses we turn to next. Build the review step in from the first milestone; bolting it on later means re-plumbing every result path.

The Build Plan: Five Milestones, Value At Every Step

You do not build this all at once, and you do not build it in the order the diagram is drawn. You build it so that a working product exists after the first milestone and every later milestone ships independently, which keeps the project fundable and the team motivated.

Milestone 1 — search over recorded footage. Stand up the VMS integration, the edge detector, the tracker, the embedding index, and a simple search box. Ship the ability to type "person at the side entrance after 9 p.m." and get back ranked clips. No agent and no vision-language model yet — just fast retrieval. This alone replaces hours of manual scrubbing and is the foundation everything else attaches to. If this is shaky, no amount of agent cleverness will save the product.

Milestone 2 — the reading step. Add the vision-language model as a second stage that reads the top clips a search returns and writes a one-line description of each. The product crosses from "here are some clips that might match" to "here is what is happening in these clips, in words." This is the moment users feel the system understands footage rather than just indexes it.

Milestone 3 — the agent loop. Wrap the search and the reader in an agent that can plan multi-step investigations — find the last truck, then check the door, then measure the duration — instead of answering one-shot queries. This is where the product becomes an investigator rather than a search engine, and it reuses the retrieval and reading stages you already built. The investigator-agent lesson is the blueprint for this milestone.

Milestone 4 — the review and audit layer. Build the human-in-the-loop confirmation view, the case file, and the audit log that records every tool call and every clip. This is the milestone that makes the system defensible — to a customer's security team, to an auditor, and to a regulator. It is also where chain-of-custody features live for any customer who might use a finding in a formal process.

Milestone 5 — the regulated extensions. Anomaly detection, cross-camera tracking of a single person, and any feature that touches identity come last, because they are higher effort, noisier, and carry the heaviest compliance weight. Identity features in particular must be designed against the legal map below before a line of code is written — for some markets the answer is that you do not build them at all.

The five-milestone staircase, the reference stack, the build-versus-buy verdicts, and the go-live and legal checklist are all collected on one page in the downloadable blueprint at the end of this article, so a team can pin it to the wall and work down it.

The Hard Part — What The Law Will Not Let The System Do

A video system that works in a demo is not a product if it is illegal to operate where your customers are. For surveillance built or sold in the European Union, the law is now specific, dated, and strict, and it is the single most important section of this article to read before you write code. This is engineering context, not legal advice; confirm the specifics with counsel for your market.

The governing text is the European Union's AI Act, Regulation (EU) 2024/1689, and it sorts uses of this system into three sharply different boxes.

The first box is outright prohibited. Under Article 5, using AI for real-time remote biometric identification of people in publicly accessible spaces for law-enforcement purposes is banned, with only narrow exceptions (a targeted search for a missing person or victim, prevention of an imminent terrorist threat, or locating a suspect in a serious crime, each subject to authorisation). The Act also prohibits building face databases by scraping the internet or CCTV, and inferring emotions in workplaces and schools. These prohibitions have applied since 2 February 2025. The practical line for your product: a feature that scans live camera feeds in a public space to identify who someone is, by their face, is on the wrong side of the law unless you are a law-enforcement body operating inside a narrow, authorised exception. Most commercial products simply must not do this.

The second box is high-risk but permitted. Analysing recorded footage after the fact to identify a person by biometric features — post-event facial recognition — is classified as high-risk under Annex III, not banned. It is allowed, but only with the full weight of compliance: a documented risk-management system, technical documentation, logging, human oversight, accuracy and robustness testing, and registration in the EU database. These high-risk obligations apply from 2 August 2026. If your roadmap includes identifying named individuals from archived video, you are building a high-risk system and must budget for that compliance from the start, not as an afterthought.

The third box is everything our investigation system does by default, and it is deliberately the lightest. Searching footage for described events and objects — "a red truck," "a person who fell," "a door left open" — is not biometric identification, because it never tries to determine who a person is. It finds what happened, not who did it. This is why the system in this article is scoped to description-based search and reads footage with a vision-language model rather than matching faces against a watchlist: that scope keeps the core product out of the prohibited box and out of the heaviest high-risk obligations. The moment a customer asks you to add "and tell me their name from our employee photos," you have crossed from the third box into the second, and the compliance bill changes entirely.

Two transparency duties round out the picture. Article 50 of the same Regulation, in force from 2 August 2026, requires that AI-generated or AI-altered media be disclosed as such — relevant if your product ever produces a synthetic reconstruction. And the human-in-the-loop review from Milestone 4 is not merely good practice for high-risk uses; meaningful human oversight is one of the law's explicit requirements. The full regulatory picture, including the biometric rules and Article 50, is the subject of lesson 8.5 on EU AI Act engineering.

Decision tree for the legal status of a video investigation feature under the EU AI Act. The top question asks whether the feature identifies a person by biometric features such as their face. If no, and it only searches for described events and objects, the outcome is the lightest-touch standard path that the default system uses. If yes, a second question asks whether identification happens in real time on live feeds in a public space for law enforcement. If yes, the outcome is prohibited under Article 5, in force since February 2025, with only narrow authorised exceptions. If identification instead happens after the fact on recorded footage, the outcome is high-risk but permitted under Annex III, with full compliance obligations from August 2026. A footnote notes that Article 50 transparency for AI-generated media also applies from August 2026 and that this is engineering context, not legal advice.

Figure 5. The legal map. Description-based search stays in the lightest box; live face identification in public is prohibited; after-the-fact face identification is high-risk. Scope the product into the lightest box on purpose, and confirm with counsel for each market.

Production Concerns: Chain Of Custody, Observability, And Security

Three cross-cutting concerns separate a prototype from something a security team will buy and trust.

Chain of custody is the discipline of being able to prove, later, that a clip is genuine and unaltered and to show exactly how a finding was reached. The audit log from Milestone 4 is the backbone: every tool call, every retrieved clip, every model answer, time-stamped and immutable. For any customer who might use a finding in an insurance claim, an HR process, or a court, this is not optional polish — it is the feature that makes the output usable at all. Store original clips alongside any derived findings, and never overwrite the source.

Observability means you can see why an investigation went wrong after it ended. Video-AI systems fail in ways ordinary web apps do not — a camera that drifted out of focus, an edge device that fell behind and dropped frames, a model that returned a confident wrong answer. You want per-investigation traces (which tools ran, how long each took, how many clips were read, what the model said) collected centrally, so support can answer "why did the system miss the event I know happened?" without guessing. Build this in from Milestone 1; retrofitting telemetry into a live pipeline is painful.

Security and data minimisation start from the fact that footage is among the most sensitive data a company holds. Encrypt clips at rest and in transit, enforce strict access controls on who can run investigations and see results, and keep footage no longer than the retention policy and the law require — minimising what you store is both a security posture and, under data-protection law, an obligation. For customers in regulated sectors, self-hosting the whole pipeline so that footage never leaves their infrastructure is often the deciding factor, and this architecture supports it: every component here can run on the customer's own hardware.

Where Fora Soft Fits In

Fora Soft has built video software since 2005, and computer-vision systems for video surveillance are one of the verticals we ship, alongside video conferencing, streaming, e-learning, and telemedicine. The system described here — a mature VMS underneath, an edge detector running on everything, an embedding index that makes footage searchable by description, an agent that calls a vision-language model only on a shortlist, and a human reviewer who signs off — is the backbone of the surveillance and intelligent-video-analytics work we scope. The build order and the build-versus-buy verdicts in this article are not theory for us; they are the checklist we apply, because they are the difference between an investigation tool that answers in seconds for cents and one that either misses events or runs an unaffordable bill. The legal map is part of that checklist too: we scope products into the lightest-touch box on purpose, because a feature that is brilliant and banned ships to no one. Our work here lives in video surveillance and intelligent video analytics, where real-time computer vision on recorded footage is the core of the product rather than a decoration.

What To Read Next

Talk To Us · See Our Work · Download

References

  1. European Union — Regulation (EU) 2024/1689 (AI Act), Article 5 — Prohibited AI practices (real-time remote biometric identification in publicly accessible spaces for law enforcement is prohibited with narrow exceptions; prohibitions apply from 2 February 2025). https://eur-lex.europa.eu/eli/reg/2024/1689/oj
  2. European Union — Regulation (EU) 2024/1689 (AI Act), Annex III & Article 6 — High-risk classification (post-remote biometric identification is high-risk; obligations apply from 2 August 2026). https://artificialintelligenceact.eu/annex/3/
  3. European Union — Regulation (EU) 2024/1689 (AI Act), Article 50 — Transparency obligations (disclosure of AI-generated or manipulated media; applies from 2 August 2026). https://eur-lex.europa.eu/eli/reg/2024/1689/oj
  4. Bai et al. — Qwen2.5-VL Technical Report (arXiv:2502.13923, February 2025; native video understanding, dynamic resolution, absolute time encoding, second-level event localization in long video). https://arxiv.org/abs/2502.13923
  5. Qwen Team, Alibaba — Qwen3-VL Technical Report (arXiv:2511.21631, November 2025; 256K-token interleaved context across text, image, and video; dense 2B–32B and MoE variants). https://arxiv.org/abs/2511.21631
  6. Meta AI — SAM 3: Segment Anything with Concepts (arXiv:2511.16719, released 19 November 2025; concept-prompted detect-segment-track; SAM 2-style masklet propagation in video). https://arxiv.org/abs/2511.16719
  7. Ultralytics — YOLO model comparison and production guidance (YOLO11 / YOLO26 recommended for production; YOLO12/YOLO13 not recommended due to training instability, memory use, and slower CPU inference). https://docs.ultralytics.com/compare/
  8. NVIDIA — DeepStream SDK documentation (multi-stream video-analytics pipeline on Jetson and data-centre GPUs; Orin AGX supports up to 8–16 streams). https://docs.nvidia.com/metropolis/deepstream/dev-guide/
  9. Twelve Labs — Marengo 3.0: multimodal video embeddings (natural-language retrieval over long-form video; the managed embedding-and-search option, 2026). https://www.twelvelabs.io/blog/marengo-3-0
  10. Zhou et al. — BN-WVAD: weakly-supervised video anomaly detection (state-of-the-art public results: AUC 87.24% on UCF-Crime, AP 84.93% on XD-Violence; ~30–50 ms on RTX 4090 / Jetson Orin AGX). https://arxiv.org/abs/2311.15367
  11. Cao et al. — AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM (WACV 2026; VLM-prompted anomaly detection generalizes across datasets). https://arxiv.org/abs/2503.04504
  12. Brivo — Introducing Eeva, an AI video agent driven by natural-language prompts (17 March 2026; commercial natural-language video monitoring on the Brivo/Eagle Eye VMS). https://www.brivo.com/introducing-eeva/