Published 2026-05-23 · 32 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
If you ship a video product in 2026 — surveillance, OTT, telemedicine, e-learning, conferencing — and you are not detecting anomalies yet, your competitors are. Anomaly detection moved from a research curiosity to a line item on customer RFPs between 2023 and 2025; in 2026 it is the difference between an enterprise security pitch you win and one you lose. The good news is that the underlying tech is finally ready for production: the academic benchmarks have stabilised, the vendor APIs have matured, the edge silicon is cheap, and the regulatory frame is — if not stable — at least visible. The bad news is that the field is now so deep that a product manager who reads only the marketing copy will pick the wrong architecture and pay for it for three years. This article is for the product manager who has to defend an architecture choice to a board, the founder who has to choose between a SaaS contract at $50 per camera per month and a custom build, and the engineer who has to explain to that founder why "just use a VLM" sometimes works and sometimes burns the runway. It assumes you have already read lesson 1.2 on latency and deployment topology and lesson 1.4 on AI cost in video products, because every decision in this article reduces to a latency number or a dollar number.
What "Anomaly" Actually Means In A Video Product
The word "anomaly" is doing too much work in most product specifications, and the choice of model depends entirely on which kind of anomaly you mean. The taxonomy below is the one we use at Fora Soft when we sit down with a new client and ask them to point at the problem they want solved. Until that taxonomy is clear, no model selection conversation is useful.
A behavioural anomaly is a person or object doing something the policy of the space forbids — a customer running through a "no running" zone in a warehouse, a fence-climber, a loiterer in a restricted corridor, a person collapsing on a transit platform. These are the dominant academic benchmark category and the dominant production category for surveillance products. They are temporal — you need several seconds of context to recognise them — and they are the natural home of 3D-CNNs (I3D, SlowFast) and video transformers (TimeSformer, VideoSwin, VideoMAE).
A physical anomaly is an object that should not be there, or an object that should be there but is not — an unattended bag in a station, a puddle on a factory floor, a missing fire extinguisher in a hospital corridor, a fallen sign on a road. These are frame-level problems with a temporal-persistence overlay; the right architecture is an object detector (YOLO, RT-DETR) feeding into a state-tracking layer that fires the alert when the object has persisted past a threshold time.
A density or crowd anomaly is a count of people or objects crossing a threshold in a defined region — a queue overflowing, a crowd crush forming, a platform too dense, a refrigerator section emptying. These are heat-map regression problems plus per-region threshold logic; modern crowd-counting models from MMDetection's CrowdNet family handle them at frame rates of 30 frames per second on a modest GPU.
A schedule or temporal anomaly is movement, presence, or absence that violates a time-of-day or day-of-week rule — motion in a building at 03:00, a vehicle stationary in a fire lane for more than five minutes, a back-of-house door propped open during business hours. These are mostly solved by rules plus a normal-pattern learning layer; the AI value here is in suppressing false positives by learning that "motion at the loading dock at 06:00 on Tuesday is normal," not in the underlying detection.
A statistical or signal anomaly is a data-stream-level deviation — a sudden quality drop, a frozen camera feed, a stuck encoder, a network jitter spike, a metric inside the video pipeline itself crossing a learned threshold. These are not "what is in the picture" problems at all; they are time-series problems and they belong with isolation forests, autoencoder reconstruction loss, or LOF on engineered features. They matter because the failure mode of every video anomaly pipeline above includes "the camera silently froze and the pipeline kept emitting the last good frame's predictions" — and only signal-level anomaly detection catches that.
The single biggest win you can engineer into an anomaly detection RFP is to name the anomaly class in the first paragraph. "We need real-time anomaly detection" is not a specification; "we need to detect a person running in a designated no-running zone within 250 milliseconds, on a Jetson Orin Nano, at 30 frames per second" is.
Figure 1. The five anomaly classes a video product can ship, with the model family that best fits each one.
The 2026 Latency Budget — The First Number You Negotiate
Real-time means "fast enough to act before the event ends," and that maps to concrete budgets that the rest of the architecture must respect. Get this number wrong and every other decision compounds the mistake; get it right and the model choice, the hardware choice, and the topology choice all fall out naturally.
A transit-platform fall that needs to trigger a gate-close or a train-hold needs to land under 150 milliseconds end-to-end — the time from the event happening to the actuator being commanded. That budget allows roughly 33 ms for capture and codec, 50 ms for inference, 30 ms for the alert bus, and 30 ms of headroom. There is no room in that budget for a network round-trip; the inference has to run on a device next to the camera.
A perimeter intrusion or a fence-climb that needs to arm a siren and dispatch a guard runs in the 200 to 300 ms band. That budget allows a thin edge-to-edge hop, or a tightly engineered LAN round-trip to an on-site NVR with a GPU.
A retail shoplift that needs to put staff on the floor in time to intercept runs in the 300 to 500 ms band. That is the sweet spot for a hybrid architecture — a fast edge detector at high recall, a cloud re-scorer that confirms the alert before staff are dispatched.
A crowd-density or queue alert that triggers staffing or routing decisions runs in the 500 to 1000 ms band. Cloud inference is fine here; the WAN round-trip and a modest GPU queue still fit inside a second.
A forensic re-analysis has no real-time requirement — minutes to hours is fine — and the architecture should optimise for cost and accuracy, not latency. This is where frontier VLMs earn their per-token rates.
The corollary is sharper than it looks. If your target event is "preventable by a human if alerted in under 10 seconds," you need edge inference. If your target event is "investigable after the fact," cloud is fine, cheaper, and easier to upgrade. The architectures for those two cases share no components except the camera; do not let a vendor sell you one as a substitute for the other.
A worked example. A 1080p camera capturing at 30 frames per second produces a frame every 33 ms. A Jetson Orin Nano Super running a quantised SlowFast model inferences at about 25 to 35 ms per clip. The ONVIF event publish to a local broker is about 5 ms. A relay actuator close to the camera responds in about 10 ms. End-to-end: 33 + 30 + 5 + 10 = 78 ms. The same pipeline routed through a cloud region 50 ms away in WAN round-trip becomes 33 + 50 + 30 + 50 + 10 = 173 ms — and now the 150 ms budget for the train-hold case is blown. The hardware did not change; the topology did.
Figure 2. Latency budget bands and the only deployment topology that respects each one.
The Model Families That Matter In 2026 — A Layered Stack, Not A Beauty Contest
The model landscape moved twice between 2023 and 2026. In 2023 the field was a contest between 3D-CNNs and the early video transformers; in 2024 CLIP-based methods started landing on the leaderboards; in 2025 multi-modal LLMs (Holmes-VAD, AnomalyRuler, VadCLIP++) started producing explanations that customers could actually use; in 2026 the consensus production architecture is a layered stack rather than a single model. We walk each layer.
The first layer is a fast, cheap, frame-level detector that runs on every frame at the edge. Its job is to surface candidate clips — windows of time that are statistically unusual enough to be worth a closer look. The two main approaches are a quantised 3D-CNN (I3D or SlowFast at INT8) running on a Jetson Orin Nano Super or a Hailo-8 module, and an autoencoder (ST-AE, MemAE) that flags frames whose reconstruction loss crosses a threshold. The 3D-CNN approach is faster and more accurate when you have labels; the autoencoder approach is the only thing that works when you have none. Both run comfortably at 30 frames per second on a $250 edge device.
The second layer is a weakly-supervised classifier that re-examines the candidate clips and assigns them to anomaly classes. This is the layer where the academic state of the art lives. Multiple instance learning (MIL) frameworks — RTFM, MGFN, Unbiased MIL, and the newer Multi-Timescale Feature Learning (MTFL, 2024) — let you train an instance-level classifier from video-level labels, which is the difference between a project that ships and one that drowns in annotation cost. On UCF-Crime the current state of the art sits at roughly 88 to 90% AUC for the most recent MIL variants on raw features and 95% AUC plus when you stack a transformer on top. On XD-Violence the equivalent figure is 85 to 86% AP.
The third layer is a CLIP-based or VLM-based re-scorer that handles two jobs the MIL layer cannot: open-vocabulary detection (catching anomalies the model was never trained on, by their textual description), and few-shot adaptation to new sites. VadCLIP, the first CLIP-based weakly-supervised VAD method, integrates text and visual prompts and beats the older MIL-only state of the art on UCF-Crime. TPWNG (Text Prompt with Normality Guidance, CVPR 2024) fuses visual features with CLIP text embeddings to produce pseudo-labels for training. VadCLIP++ and TrCLIP-VAD (2025–2026) push that further with dynamic vision-language models and CLIP training improved by text rewriting. The practical lift from this layer is two things: you can describe a new anomaly class in a sentence and get reasonable detection without retraining, and you cut your labelling budget by something like 80% on the long tail.
The fourth layer is a frontier VLM that handles the hardest queries and produces the explanation. Holmes-VAD constructed the first large-scale multimodal VAD instruction-tuning benchmark (VAD-Instruct50k) and showed that a fine-tuned multimodal LLM can both localise anomalies precisely in time and produce a coherent narrative explaining what happened. AnomalyRuler does few-shot prompting with normal reference samples to induce rules that the LLM then applies to new footage. The 2025 Cerberus paper showed that a cascaded VLM architecture can hit 97.2% accuracy on UCF-Crime while running at 57.68 frames per second on an NVIDIA L40S — within range of real-time, finally, for VLM-class methods. In production today, the practical pattern is to use a small VLM (Llama 3.2 Vision, Qwen-VL, LLaVA) on-device for explanation, and a frontier VLM (Gemini 2.5 Pro, Claude Opus 4.7) in the cloud for the borderline cases that the smaller model is unsure about.
The fifth layer, often missed, is the signal anomaly detector that watches the pipeline itself. An isolation forest or an LSTM-autoencoder on engineered features (frame freeze duration, encoder bitrate variance, dropped packet rate, model confidence histogram) catches the failure modes that the vision layers cannot: a stuck encoder, a camera that has slowly drifted out of focus, a model that has started predicting the modal class on every frame because of a silent training-data drift. We have seen production systems run for weeks with every vision layer reporting "all clear" while the camera fed a still image of the parking lot at sunset on a loop. The signal layer is not optional.
| Layer | Job | Typical model | Where it runs | Why you need it |
|---|---|---|---|---|
| 1 — Candidate filter | Surface unusual windows | I3D / SlowFast INT8, MemAE | Edge NPU | 30 fps at $250 hardware |
| 2 — Class classifier | Assign anomaly class | MIL (RTFM, MGFN, MTFL) | Edge or cloud GPU | 90% AUC on UCF-Crime |
| 3 — VLM re-scorer | Open-vocabulary + few-shot | VadCLIP, TPWNG, VadCLIP++ | Cloud GPU | Long-tail anomalies, low labels |
| 4 — Explanation | Human-readable narrative | Holmes-VAD, Gemini 2.5, Claude | Cloud | Audit trail, regulator review |
| 5 — Signal anomaly | Watch the pipeline | Isolation Forest, LSTM-AE | Edge / cloud sidecar | Catch silent failures |
Figure 3. The five-layer stack a 2026 production anomaly detection pipeline ships. Most failure stories trace to a missing layer or a missing handoff between layers.
The shape of that stack is the shape of the conversation you have with a vendor. If a vendor offers you a single model — "our transformer detects anomalies" — they are selling you layer 2 and pretending the rest is solved. If a vendor offers you a complete pipeline at one price, ask which of the five layers they own and which they assume you will build; usually they own layers 1 and 2 and the rest is your problem to integrate.
The Datasets And Benchmarks Worth Knowing — And The Ones That Will Lie To You
Benchmarks anchor model conversations even though they do not predict production behaviour. The 2026 reference numbers below are the ones that an engineer needs to know to read a paper or evaluate a vendor; they are also the numbers that a careless reader will misuse.
UCF-Crime is the dominant weakly-supervised benchmark — 1,900 surveillance videos, 13 anomaly classes including abuse, arrest, arson, assault, burglary, fighting, road accidents, robbery, shooting, shoplifting, stealing, and vandalism. The state of the art has moved through 84 to 90% AUC range over the past two years for MIL-only methods on raw features, with the absolute best CLIP-augmented and transformer-augmented methods at roughly 95 to 97% AUC. Watch for two things when a vendor quotes a UCF-Crime number: did they test on the standard test split or on a cherry-picked subset, and did they report video-level AUC or frame-level AUC — the difference can be 8 points.
ShanghaiTech Campus has 437 videos of campus life, dominated by pedestrian-anomaly scenarios. SOTA on the standard split is in the 95 to 96% AUC range with VideoSwin and VideoMAE. Watch for the version of the dataset — the original (one-class supervised, no anomalies in training) and the weakly-supervised reorganisation (Zhong et al.) have different difficulty.
Avenue is a 47-video set of unusual pedestrian behaviour at a campus avenue. It is small enough that everything overfits; SOTA sits above 98% AUC. Treat reports above 98% AUC on Avenue as "the model can pass the dataset," not "the model works."
XD-Violence is a 4,754-video set with audio, 6 violence-related classes, multi-label. SOTA sits at roughly 85 to 86% AP for the recent CLIP-based and MIL approaches.
Street Scene is the cross-domain benchmark that most reports leave out. The same model that scores 97% AUC on UCF-Crime will typically drop 10 to 15 percentage points on Street Scene. That gap is the gap you should expect between benchmark performance and your customer's first month of production.
The honest read on benchmarks is this: a model scoring 97% on UCF-Crime will require fine-tuning on 2 to 6 weeks of your actual site footage before it produces customer-grade performance on your cameras. Budget for that data collection and the relabelling work upfront, and you will not be surprised by it. Treat any vendor that does not ask about your site data as a vendor that has never shipped to production.
How CLIP And VLMs Changed The Field — And When To Reach For Them
The single biggest engineering shift in video anomaly detection between 2024 and 2026 was the arrival of CLIP-based and VLM-based methods. They are different enough from the older MIL stack that you need a separate mental model to know when to use them.
CLIP is a contrastive vision-language model trained on image–text pairs from the open web. It maps images and short text descriptions into a shared embedding space where semantically similar items cluster. The application to anomaly detection is direct: you describe the anomaly class in text ("a person climbing a fence", "a person fighting", "a person collapsing"), embed both the text and a video clip, and threshold on the cosine similarity. With zero training examples of the target class you get a credible detector — not as accurate as a fully-trained model, but accurate enough to bootstrap a labelling program or to ship a v1 in a week.
VadCLIP (AAAI 2024) was the first CLIP-based weakly-supervised VAD method to land on the leaderboards. It integrates textual priors via text and visual prompts and beats the older MIL-only state of the art on UCF-Crime by several points. TPWNG (Text Prompt with Normality Guidance, CVPR 2024) fuses visual features with CLIP text embeddings to produce pseudo-labels for training, which closes the gap with fully-supervised methods on long-tail classes. VadCLIP++ (2025) introduces a dynamic vision-language model that adapts CLIP at inference time. TrCLIP-VAD (2026) improves CLIP training with text rewriting and is the current near-SOTA for weakly-supervised UCF-Crime detection. WSVAD-CLIP (2025) adds temporal-aware prompt learning and frame-aware fusion. AVadCLIP (2025) extends the approach to audio-visual collaboration, which is the right call for XD-Violence-style violence detection.
The arrival of multimodal LLMs as anomaly detectors is the second shift. Holmes-VAD (2024–2025) fine-tunes a multimodal LLM on the VAD-Instruct50k benchmark and produces both precise temporal localisation and a coherent natural-language explanation. AnomalyRuler (ECCV 2024) treats VAD as a few-shot reasoning task: feed the LLM a few examples of normal behaviour, ask it to induce rules, then apply those rules to new footage. Cerberus (2025) cascades smaller and larger VLMs to balance accuracy and frame rate — 97.2% accuracy, 57.68 fps on an L40S.
The practical question is when to reach for which layer. The rule we use at Fora Soft is straightforward. If you have fewer than 100 labelled examples of the target anomaly class, start with CLIP zero-shot — VadCLIP or TPWNG, depending on whether you have text prompts you trust. If you have 100 to 1,000 labelled examples, train a weakly-supervised MIL layer on top of CLIP features. If you have more than 10,000 labelled examples, train a fully-supervised classifier and use CLIP only as a long-tail safety net. If your product needs to explain why an event was flagged — for a court, for a regulator, for a security analyst's review queue — wire a frontier VLM into the explanation layer. If your product needs to answer ad-hoc queries against the archive ("did anyone leave a bag near the south entrance last Tuesday?"), wire a multimodal RAG over your archive (see the video RAG lesson in Phase 4).
The pitfall with VLM-based methods is the cost. A frontier VLM at $1.25 per million input tokens, ingesting a 30-second clip at default media resolution (about 9,000 input tokens), costs roughly $0.011 per clip — call it a cent. Multiply that by a thousand alerts a day and you have $10 a day per camera, or $300 a month per camera, just for the explanation layer. That is more than most SaaS surveillance contracts. The fix is to route only the borderline cases — confidence between 0.4 and 0.7 from the MIL layer — to the frontier VLM, and to use a small on-device VLM (Qwen-VL 7B, Llama 3.2 Vision 11B, Moondream 2B) for the easier explanations. We work through the full cost arithmetic in lesson 1.4.
The Edge-Versus-Cloud Topology Decision
Three deployment topologies cover essentially every production anomaly detection system in 2026. Picking among them is a function of the latency budget you negotiated above, the camera count, the connectivity at the site, and the compliance regime. Get those four numbers in front of you and the topology choice is mechanical.
Edge only. A Jetson Orin Nano Super ($249), a Hailo-8 module ($150 to $300), or a Google Coral Dev Board ($150) next to each camera, or one per group of four to sixteen cameras. The Orin Nano Super delivers about 67 TOPS for AI workloads; the Hailo-8 delivers 26 TOPS at about 2.5 watts. Inference latency for a quantised SlowFast or I3D model lands at 30 to 80 ms; total end-to-end latency to a local alert bus is well under 200 ms. The hardware cost runs $200 to $600 per device, plus a one-time integration cost. The advantage is hard: frames never leave the site (the strongest privacy story in regulated industries), there is no WAN dependency, and the compliance footprint under the EU AI Act is much smaller. The disadvantage is operational: every device is a thing to maintain, monitor, update, and replace.
Cloud only. Stream every camera to an NVIDIA L4 ($1 to $1.50 per hour rented), an L40S ($1.80 to $2.20 per hour), or an A100 ($1.30 to $2.50 per hour) in a managed region. Each L4 handles 100 plus 1080p streams at 30 frames per second for object detection; an L40S handles roughly 60 plus streams of the heavier 3D-CNN models. Inference latency on the GPU itself is 20 to 50 ms; end-to-end latency depends entirely on the WAN. The advantage is operational: one place to deploy, retrain, and monitor; hot model swaps; central audit storage. The disadvantage is the WAN round-trip — a 50 ms RTT plus 30 ms inference plus 30 ms alert fan-out is already at 110 ms and leaves no room for a 200 ms SLA — and the bandwidth cost. A 1080p H.264 camera at 4 Mbps consumes about 1.3 TB of bandwidth per month; multiply by 100 cameras and you are at 130 TB per month, which dominates the cost equation at $0.05 per GB egress.
Hybrid — the 2026 default. Edge devices run layer 1 (candidate filter) and layer 2 (MIL classifier) at high precision and moderate recall. Borderline events — typically the 5 to 20% of candidates that the edge model is unsure about — are escalated to the cloud, where layer 3 (CLIP re-score) and layer 4 (VLM explanation) refine the verdict and produce the audit trail. The Qdrant team published a worked example where a 50-camera surveillance deployment generates 432,000 clips per day; the edge-to-cloud escalation pattern cuts cloud processing volume by about 6× while catching 95% of true anomalies. The split is the right one for almost every production system above 20 cameras: latency stays inside the budget at the edge, the long tail of model updates and the explanation layer live in the cloud.
The anti-pattern we see most often is pushing raw 1080p streams to the cloud for a latency-critical use case just to reuse an existing cloud GPU. You pay in bandwidth, you pay in latency, you pay in regulatory exposure, and you usually break the latency SLA in the first stress test. The fix is not to add more bandwidth; it is to move layer 1 to the edge.
A worked example. A 50-camera retail deployment with a 300 ms latency budget and an EU AI Act compliance regime. Edge: 50 Hailo-8 modules at $200 each = $10,000 capex, running a quantised SlowFast model for shoplift detection at 30 fps per camera, 70 ms inference per camera. Local alert bus on MQTT. Cloud: borderline-event escalation to a single L40S instance running VadCLIP for re-scoring and a Gemini 2.5 Pro call for the explanation on the top-1% most ambiguous events. End-to-end latency to staff alert: 180 ms median, 240 ms p95. Bandwidth: edge-only filtering cuts the cloud upload to roughly 5% of raw stream volume = 1.5 TB per month for the whole site instead of 65 TB. Monthly cost: ~$1,100 cloud + ~$200 amortised edge hardware = $1,300 against the $50 per camera per month SaaS alternative of $2,500. The custom build wins on cost, on latency, and on the AI Act compliance posture; the SaaS option wins only if the deployment will not scale past 20 cameras.
Figure 3. The hybrid edge-plus-cloud topology that owns the 2026 default. Edge runs the candidate filter and classifier; cloud owns the re-scorer, explanation, and audit trail.
Integration Protocols — How The Detector Plugs Into The Rest Of The World
Detection without integration is a demo. The detector has to publish events to a VMS, a SIEM, a staff mobile app, an actuator, a notification system, a ticketing tool, and an audit log — all on different protocols, on different latency budgets, with different reliability guarantees. Four protocols cover the production stack in 2026.
ONVIF Profile M is the metadata and analytics profile of the ONVIF specification, ratified in 2022 and broadly adopted by Milestone XProtect, Genetec Security Center, Avigilon Unity, Hikvision Bricks, and Axis Communications Camera Station between 2023 and 2026. If your detector cannot emit ONVIF events, it will not plug into enterprise Video Management Systems (VMS) without per-vendor custom work. Profile M defines the JSON event payload, the WS-BaseNotification subscription pattern, and the metadata XML schema; the cost of compliance is a one-time integration effort of two to three engineer-weeks and an annual ONVIF membership fee for the conformance test.
RTSP is still the default camera-to-pipeline stream protocol. Camera latency is 30 to 50 ms LAN, 100 to 300 ms WAN. The edge box pulls the RTSP stream, decodes it (often with a hardware decoder on the same NPU module), and runs inference. RTSP is older than the modern WebRTC stack but it is universal in the IP camera market and ONVIF defines the streaming side as RTSP for backward compatibility.
WebRTC is the newer streaming layer, with 20 to 100 ms latency, better firewall traversal, stronger encryption, and adaptive bitrate. It is adopted by newer cloud VMS platforms and the operator UIs we ship when sub-second delivery to a remote console matters. The integration story is more complex (you need a signalling server, a TURN server, an SFU at scale) but the latency story is unambiguously better.
MQTT is the alerting bus. The anomaly detector publishes a topic like event/shoplift/zone-5/conf-0.87 to a broker; subscribers — the SIEM, the ticketing system, the staff mobile app, the actuator, the audit log — pick up the event and fan it out. MQTT is the de-facto standard for IoT-class alerting because of its low overhead and quality-of-service guarantees; it is the right choice over HTTP webhooks for any deployment above a dozen cameras.
The pattern we use in production is: ONVIF Profile M out to the customer's VMS (because they already have one and any other choice creates integration pain), MQTT for the high-frequency event bus to the on-site response systems, and a dedicated audit-log write to immutable object storage (S3 Object Lock, Azure Blob WORM) for regulatory evidence. The detector itself is connected to the camera over RTSP or WebRTC depending on the camera generation.
Vendor Pricing And The Build-Versus-Buy Math In May 2026
The build-versus-buy decision in anomaly detection is a function of camera count, horizon, and the value of data ownership. Below 20 cameras and under 18 months of horizon, SaaS wins almost every time. Above 100 cameras and over a 3-year horizon, custom wins almost every time. The middle is genuinely contested.
The current SaaS pricing band, as of May 2026:
| Vendor | Indicative price | Typical fit | Watch-outs |
|---|---|---|---|
| BriefCam | $50 to $200 per camera per month | Enterprise, investigation + live | EU AI Act footprint, vendor lock-in |
| Motorola Avigilon | $30 to $150 per camera per month | Public sector, large campuses | Hardware tied to ecosystem |
| Milestone XProtect + plugin | $200 to $400 per server per month | VMS-led shops | Plugin quality varies |
| Verkada | $25 to $75 per camera per month | SMB, cloud-first | Proprietary cameras |
| Eagle Eye Networks | $10 to $30 per camera per month | Cloud VMS, basic analytics | Limited advanced anomaly detection |
| iOmniscient | $40 to $120 per camera per month | Transport, high-density anomaly | Fewer integrators |
| Genetec Security Center + KiwiVision | $40 to $150 per camera per month | Enterprise, large-scale ops | Annual maintenance ~20% list |
| AnyVision (Oosto) | $50 to $200 per camera per month | Identity-aware retail and venues | Biometric regime |
| Custom build (Fora Soft) | Project + infra | 100+ cameras, niche classes, regulator-grade audit | Time-to-first-pilot vs SaaS |
The numbers above are May 2026 list-price bands; negotiated enterprise contracts at scale routinely come in 30 to 50% below list. Treat the table as an order-of-magnitude reference, not a quote.
A worked crossover analysis. A 100-camera deployment, three-year horizon, EU AI Act-compliant. SaaS path (BriefCam mid-band at $100 per camera per month, including some EU AI Act paperwork wrapped in): $100 × 100 × 36 = $360,000 over three years. Custom build path: 12-week engineering build at a mid-market service rate plus $25,000 of edge hardware plus $30,000 a year of cloud and ops. Total custom over three years: roughly $200,000 to $280,000 depending on how aggressive the engineering team is and whether you reuse an existing VMS. The crossover lands somewhere around year 1.7 to year 2.1 — beyond that point, custom is cheaper and you own the data, the model, the audit trail, and the integration. Below 18 months of horizon, the SaaS contract is the right call; you do not have time to amortise the build.
The market growth context. The data anomaly detection market is at a 19.5% CAGR — it is one of the fastest-growing segments in the broader AI tooling space. The intelligent video analytics market is at $14.65 billion in 2026 forecast to $41.39 billion by 2031 (23.1% CAGR). The general video surveillance market is at $95 billion in 2026 forecast to $261 billion by 2034 (13.5% CAGR). Anomaly detection capability is the differentiator the major vendors are pricing into their new contracts; in 24 months it will be table stakes, and the products that ship without it will be re-platforming under pressure from their own customers.
The EU AI Act, GDPR, And The Compliance Engineering The Codebase Has To Carry
The EU AI Act (Regulation 2024/1689) is the largest single change to the economics of video anomaly detection in the past decade. It came into partial force in August 2025; the high-risk-system obligations come fully into force in August 2026. If your product is in the EU, or has EU customers, the relevant sections are now an engineering input, not a legal afterthought.
Real-time biometric categorisation of natural persons in publicly accessible spaces is heavily restricted under Article 5 (the prohibited-practices article). The exceptions are narrow — search for missing persons, prevention of a specific and immediate terrorist threat, identification of a perpetrator of a serious crime — and require judicial authorisation. Live face-recognition-based anomaly detection is the high-risk centre of the act; many vendors retrenched out of the EU market in 2025 specifically because their products could not satisfy Article 5.
Annex III lists the high-risk AI systems. Critical infrastructure surveillance, biometric identification, and law-enforcement-adjacent systems all land here. For a high-risk system the obligations under Articles 9–15 include a documented risk management system, dataset governance with bias documentation, technical documentation that covers the model and the training set, automatic logging, human oversight provisions, accuracy and robustness requirements, and a conformity assessment. These are not legal-team artefacts; they are engineering artefacts that the codebase has to produce. Model cards, data sheets, training logs, drift logs, alert audit trails — these are the deliverables that satisfy Articles 11 and 13.
The fines are not symbolic. Up to 7% of global annual turnover or €35 million, whichever is higher, for the most serious prohibited-practice violations. Up to 3% or €15 million for high-risk-system non-compliance. A small product that ships in the EU without a conformity assessment is exposed in a way that no product in this space was three years ago.
GDPR has not gone away. Article 9 still treats biometric data as a special category requiring an explicit legal basis. Article 35 still requires a Data Protection Impact Assessment for any meaningful video surveillance deployment. The CCTV-specific guidance from the European Data Protection Board (EDPB) and the UK ICO has tightened steadily since 2022. The practical consequence is that face-based anomaly detection in the EU is exceptionally hard to ship legally outside narrowly-defined use cases; gait, posture, object, and behaviour-based detection sit on a much easier legal footing.
The engineering pattern that satisfies the act is the one we already recommended for technical reasons. Prefer gait and posture anomaly detection over face-based when the use case allows. Keep frames on-site when latency allows — edge-only inference is a much smaller compliance surface than cloud inference. Log every detection with confidence, the reviewer ID, the disposition, and the model version — that log is the audit trail the regulator will ask for. Wire human oversight into the response workflow; the act requires it. Maintain the model card and the training data documentation in the same repo as the model; do not treat them as a separate artefact.
We cover the full regulatory layer in lesson 8.5 on EU AI Act and Article 50 engineering. The short version: in 2026 the compliance posture is a feature of the product, not an external imposition.
The Twelve Anomaly Detection Algorithms You Actually Choose Between
The next article in this series, 2.16 — Anomaly detection algorithms: 12 production approaches, walks every algorithm in detail. The summary below is the short version — enough to know which name to use in an RFP without claiming the depth you do not have.
One-class SVM is the classical baseline. Train it on normal frames or features, threshold the distance to the boundary at inference. Lightweight, no anomaly examples needed, and the right floor for "is anything moving in this normally-empty space?" use cases.
Isolation Forest isolates anomalies by recursively partitioning the feature space with random binary trees. Anomalies are isolated in fewer splits than normal points. Fast, scales linearly, requires no anomaly examples. The right tool for signal-level anomaly detection on engineered features.
Local Outlier Factor (LOF) identifies local density-based anomalies — points in a low-density region surrounded by higher density. Better than Isolation Forest when normal behaviour varies in density across the feature space.
Autoencoder reconstruction trains an encoder–decoder network to compress and reconstruct normal frames; anomalies have high reconstruction error. The dominant unsupervised approach for video anomaly detection in 2020–2024 and still the right call when you have zero labels.
ST-AE and MemAE are spatial-temporal autoencoders specialised to video. MemAE adds a memory module that prevents the autoencoder from generalising too well to anomalies; ST-AE adds temporal modelling. Both run on edge hardware.
Variational Autoencoder (VAE) and GAN-based detectors model the distribution of normal data and flag samples with low likelihood. Stable in research, less common in production because of training-stability issues.
3D-CNN (I3D, C3D, SlowFast) is the workhorse for behavioural anomaly detection. Quantised to INT8, it runs at 30 fps on a Jetson Orin Nano Super. The right layer-2 model for production.
Video transformers (TimeSformer, VideoSwin, VideoMAE, ViViT) are the accuracy ceiling. They dominate the small-dataset benchmarks (Avenue, ShanghaiTech) and are the right choice for cloud-side re-scoring. Not yet real-time-feasible at the edge for most variants in 2026.
Weakly-supervised MIL (RTFM, MGFN, AAMIL, Unbiased MIL, MTFL) is the academic state of the art on UCF-Crime. The right layer-2 model when you have video-level labels but not frame-level labels.
CLIP-based methods (VadCLIP, TPWNG, VadCLIP++, TrCLIP-VAD, WSVAD-CLIP, AVadCLIP) are the layer-3 re-scorer family. Use them when you need open-vocabulary detection or when your labelling budget is small.
Multimodal LLMs (Holmes-VAD, AnomalyRuler, Cerberus) are the layer-4 explainer family. Use them when the alert has to be auditable in natural language for a human reviewer or a regulator.
LSTM and temporal CNN on engineered features are the layer-5 signal-anomaly tools. Use them on the pipeline metadata to catch the failures the vision layers miss.
The Fora Soft default architecture wires layers 1, 2, 3, 4, and 5 together — a quantised 3D-CNN at the edge, a MIL classifier on top, a CLIP-based re-scorer in the cloud, a frontier VLM for explanation on the top 1% of borderline cases, and an isolation forest watching the pipeline. The cost runs in the band described in the worked example above; the accuracy lands at the top of the field; the audit trail satisfies the AI Act. Other architectures work for narrower problems; this one works for the general case.
A Common Mistake: Treating Anomaly Detection As An Object Detection Problem
The most expensive architectural mistake in this field is the one where a team takes a working YOLO pipeline they have already built for object detection and bolts an anomaly rule on top — "alert if a person is in zone X" or "alert if a count exceeds N". It looks cheap because it reuses code. It is not.
The reason it fails is that an anomaly is rarely "an object in a place"; it is "an object behaving differently in time, in a context the model has not been told about". A person walking through a zone is normal at 09:00 and anomalous at 03:00; a bag near a bench is normal for 2 minutes and anomalous after 10; a person running is normal in a corridor and anomalous on a transit platform. None of these distinctions live in a YOLO output. To get them, you need either a temporal model (a 3D-CNN, a video transformer, an LSTM on object tracks) or a rule layer with a schedule and a context (which is fine for the simplest cases and quickly becomes unmaintainable).
The fix is to start the architecture from layer 2 of the stack above — a temporal model that ingests several seconds of video — and only layer in the object detector when you need to localise the anomaly to a specific entity for a downstream action. The reverse path almost always lands a year into the project, four kinds of edge cases deep, with a tangle of rules that nobody can debug.
Pitfall avoided: when a stakeholder says "we already have object detection, let's just add anomaly detection on top", the right response is "what's the latency budget, what's the anomaly class, what is the temporal context length, and what's the labelling budget?" Those four questions force the conversation back to layer 2 of the stack, not layer 6 of the rules engine.
Where Fora Soft Fits In
We have shipped production video products for two decades — surveillance with Netcam Studio, conferencing on WebRTC, telemedicine, e-learning, OTT, AR/VR. The anomaly detection layer is the workload we have been most active on in 2024–2026, both for greenfield surveillance products and for retrofit projects where an existing video product needs the capability added without disturbing the rest of the pipeline. Our default architecture is the five-layer stack in this article; our default deployment topology is the hybrid edge-plus-cloud pattern; our default integration is ONVIF Profile M plus MQTT plus a regulator-grade audit trail. We have worked across retail, transit, industrial safety, healthcare, and conferencing verticals — different anomaly classes, different latency budgets, the same engineering discipline.
What To Read Next
- Anomaly detection algorithms — 12 production approaches — the deep dive on each of the twelve algorithms named above.
- Computer vision in retail, industrial, and intelligent video analytics — vertical playbooks for the largest commercial anomaly-detection use cases.
- The cost model — what AI in video actually costs at scale — the dollar arithmetic behind every layer of the stack above.
Talk To Us / See Our Work / Download
Talk to a video engineer. Book a 30-minute scoping call with a Fora Soft senior engineer; we will map your latency budget, camera fleet, and compliance shape to a concrete architecture you can ship.
See our case studies. Our surveillance and video analytics work is collected in the projects portfolio at forasoft.com/projects.
Download the anomaly detection architecture decision worksheet. A one-page printable that walks the five-layer stack against the four numbers you need — latency budget, camera count, labelled examples, compliance regime. Download the worksheet (PDF).
References
- Sultani, W., Chen, C., & Shah, M. (2018). Real-World Anomaly Detection in Surveillance Videos. CVPR. https://arxiv.org/pdf/1801.04264 — Foundational paper introducing UCF-Crime and the MIL framework for weakly-supervised video anomaly detection.
- Wu, P. et al. (2024). VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection. AAAI. https://www.semanticscholar.org/paper/VadCLIP%3A-Adapting-Vision-Language-Models-for-Weakly-Wu-Zhou/a58a1eb932372f70039dbfb0b49af84de855d18e — First CLIP-based weakly-supervised VAD method; the layer-3 reference for the architecture in this article.
- Yang, Z. et al. (2024). Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection. CVPR. https://openaccess.thecvf.com/content/CVPR2024/papers/Yang_Text_Prompt_with_Normality_Guidance_for_Weakly_Supervised_Video_Anomaly_CVPR_2024_paper.pdf — TPWNG; CLIP-text-embedding pseudo-labelling for VAD.
- Zhang, H. et al. (2024). Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM. https://arxiv.org/pdf/2406.12235 — Multimodal LLM for VAD with the VAD-Instruct50k benchmark; the layer-4 reference.
- Yang, Y. et al. (2024). Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models. ECCV. https://arxiv.org/pdf/2407.10299 — AnomalyRuler; few-shot rule induction for VAD.
- Liu, X. et al. (2025). Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models. https://arxiv.org/pdf/2510.16290 — 97.2% UCF-Crime accuracy at 57.68 fps on L40S; real-time VLM-class VAD.
- Lv, H. et al. (2023). Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection. CVPR. https://openaccess.thecvf.com/content/CVPR2023/papers/Lv_Unbiased_Multiple_Instance_Learning_for_Weakly_Supervised_Video_Anomaly_Detection_CVPR_2023_paper.pdf — Layer-2 MIL state of the art reference.
- MTFL: Multi-Timescale Feature Learning for Weakly-Supervised Anomaly Detection in Surveillance Videos. (2024). https://arxiv.org/pdf/2410.05900 — Multi-timescale MIL approach; current SOTA on several VAD benchmarks.
- Foundation Models and Transformers for Anomaly Detection: A Survey. (2025). https://arxiv.org/abs/2507.15905 — July 2025 survey covering transformers and foundation models for visual anomaly detection.
- A Comprehensive Survey of Transformer-Based Models for Video Anomaly Detection. (2025). https://iieta.org/download/file/fid/191034 — October 2025 survey covering transformer-based VAD.
- Foundation Models for Anomaly Detection: Vision and Challenges. AI Magazine. (2025). https://onlinelibrary.wiley.com/doi/full/10.1002/aaai.70045 — Wiley overview of foundation model use in anomaly detection.
- A Survey on Video Anomaly Detection via Deep Learning: Human, Vehicle, and Environment. (2025). https://arxiv.org/html/2508.14203v1 — Recent survey covering human, vehicle, and environment-centric VAD applications.
- Qdrant. Video Anomaly Detection From Edge to Cloud With Qdrant. (2025). https://qdrant.tech/blog/video-anomaly-detection-edge-to-cloud/ — Worked production example showing edge-to-cloud escalation cutting cloud volume by 6×.
- NVIDIA. Metropolis: Intelligent Vision AI for Smart Infrastructure. Read 2026-05-23. https://www.nvidia.com/en-us/autonomous-machines/intelligent-video-analytics-platform/ — NVIDIA's vision AI platform for edge-to-cloud video analytics.
- Hailo. Edge AI Cameras and Devices. Read 2026-05-23. https://hailo.ai/blog/edge-ai-device/ — Reference for Hailo-8 specifications (26 TOPS at 2.5W) used in the edge worked example.
- Fortune Business Insights. Video Analytics Market Size, Share, Growth & Trends 2034. Read 2026-05-23. https://www.fortunebusinessinsights.com/industry-reports/video-analytics-market-101114 — Primary source for the $14.65B 2026 / $41.39B 2031 video analytics market projection.
- Mordor Intelligence. Video Analytics Market Size, Growth Trends 2030. Read 2026-05-23. https://www.mordorintelligence.com/industry-reports/video-analytics-market — Cross-check for the 23.1% CAGR figure in the video analytics market.
- Fortune Business Insights. Video Surveillance Market Size, Share & Industry Analysis Report 2034. Read 2026-05-23. https://www.fortunebusinessinsights.com/video-surveillance-market-102673 — Source for the $95B 2026 / $261B 2034 video surveillance market projection.
- Market.us. Data Anomaly Detection Market — CAGR of 19.5%. Read 2026-05-23. https://market.us/report/data-anomaly-detection-market/ — Source for the 19.5% CAGR figure for the data anomaly detection segment.
- Latham & Watkins. EU AI Act — GPAI obligations in force. Read 2026-05-23. https://www.lw.com/en/insights/eu-ai-act-gpai-model-obligations-in-force-and-final-gpai-code-of-practice-in-place — Primary legal source for the EU AI Act timeline, Annex III high-risk classification, and fine structure.
- ONVIF. Profile M for Metadata and Analytics. Read 2026-05-23. https://www.onvif.org/profiles/profile-m/ — Primary source for ONVIF Profile M technical requirements.
- ECCV. Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models. (2024). https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/10568.pdf — Conference version of the AnomalyRuler paper.


