
Key takeaways
• Real-time anomaly detection has a latency budget, not a feature list. Under 200 ms end-to-end enables active response (alarm, gate, staff); 200–500 ms covers standard perimeter and retail; anything over a second is an investigation tool, not a prevention tool.
• Hybrid architectures beat single models in production. 3D-CNN (I3D) backbones with weakly-supervised MIL heads still hold the top of UCF-Crime at ~97% AUC. Transformers (TimeSformer, VideoSwin, VideoMAE) win Avenue/ShanghaiTech but are cloud-only for now.
• Edge + cloud is the 2026 default. NVIDIA Jetson Orin / Hailo-8 / Google Coral handle 4–16 concurrent 1080p streams at <200 ms; cloud GPU re-scores borderline alerts, retrains models, and stores the audit trail.
• The EU AI Act changed the economics. Real-time biometric-adjacent video analytics is high-risk under Annex III; conformity assessment, human oversight, and audit logs are mandatory. This trimmed several vendors out of the EU market in 2025.
• Build vs buy crosses over around 18–24 months for 100+ cameras. SaaS (BriefCam, Avigilon, Verkada, iOmniscient) at $30–200 / camera / month pays off for small deployments. Custom builds win on scale, data control and regulator-grade audit trails.
More on this topic: read our complete guide — Top 7 Anomaly Detection Models for Video Surveillance (2026).
Why Fora Soft wrote this anomaly-detection playbook
At Fora Soft we have been building production video-surveillance and video-analytics products for two decades, including the Netcam Studio surveillance UI and numerous multi-camera analytics deployments. We work across WebRTC, ONVIF, edge inference and cloud pipelines every week. This page is the opinionated 2026 version of what we would tell any CTO who asked “how do I do real-time anomaly detection without burning 18 months?”
The underlying tech moved fast in 2024–2026: Transformer video backbones matured, weakly-supervised MIL variants closed the gap with fully-labelled training, edge NPUs (Hailo-8, Jetson Orin, Coral) got cheap enough for fleet deployment, and the EU AI Act put real documentation requirements on everything that looks at a person in real time. This playbook is structured around the decisions that actually move the project — model family, deployment topology, dataset strategy, integration protocol, compliance posture — and the traps to avoid on each.
If you want adjacent context first, read our pieces on real-time video analytics and real-time video processing with AI. For standards and integration, see our ONVIF profiles in security systems primer and the deeper ONVIF Profile M write-up.
Evaluating real-time anomaly detection for your cameras?
30 minutes with a senior Fora Soft engineer — we will map your latency budget, camera fleet and compliance shape to a concrete architecture you can actually ship.
What counts as an “anomaly” in video surveillance
The word “anomaly” is doing too much work in most RFPs. In practice it covers four very different problem classes, and the model you pick depends on which one you mean:
- Behavioural anomalies — a person running in a restricted area, loitering, jumping a fence, climbing onto a platform. Temporal and motion models win here.
- Physical anomalies — an unattended package, a puddle on a factory floor, a fallen sign. Frame-level detection + persistence tracking.
- Density / crowd anomalies — queue overflow, crowd crush, platform density. Heat-map regression plus threshold logic.
- Temporal / schedule anomalies — movement in an off-hours zone, a vehicle stationary too long. Often solved by rules + normal-model learning.
Naming the class before picking a model is the single biggest win. We see RFPs every quarter that ask for “Transformer-based video anomaly detection” when the actual requirement is persistent-object detection with a schedule overlay. Using the right tool saves 3–6 months and 70% of the inference budget.
The real-time latency budget
Real-time here means “fast enough to act before the event ends.” That translates to concrete budgets:
| Use case | End-to-end target | Why | Realistic topology |
|---|---|---|---|
| Transit-platform fall | < 150 ms | Gate-close, auto-stop train | Edge NPU at camera |
| Intrusion, perimeter breach | 200–300 ms | Arm siren, dispatch guard | Edge box on site |
| Retail shoplifting | 300–500 ms | Staff on the floor intercepts | Edge + cloud refinement |
| Crowd density / queue | 500–1000 ms | Open more tills, re-route | Cloud acceptable |
| Forensic re-analysis | Minutes, hours | Post-incident investigation | Cloud batch |
Figure 1. Latency budgets and realistic deployment topologies. Network RTT adds 50–100 ms on WAN paths, so anything under 300 ms needs the inference on-site.
Rule of thumb: if your target event is “preventable by a human if alerted in <10 seconds,” you need edge inference. If it is “investigable after the fact,” cloud is fine and cheaper.
The model families that matter in 2026
| Family | Examples | Best at | Edge feasible? |
|---|---|---|---|
| 3D-CNN | I3D, C3D, SlowFast | Short clip, temporal action | Yes (quantised) |
| CNN + LSTM / ConvLSTM | ResNet + LSTM, ConvLSTM | Long sequences, loitering | Yes (lightweight CNN) |
| Video Transformer | TimeSformer, VideoSwin, VideoMAE, ViViT | High accuracy, global context | Not for real-time (cloud) |
| Weakly-supervised MIL | RTFM, MGFN, AAMIL | Low label effort, long tail | Yes (thin classifier head) |
| Autoencoder / MemAE | ST-AE, MemAE | Unsupervised, “what’s different?” | Yes |
| CLIP / VLM zero-shot | CLIP + text prompts, VideoCLIP | Cold start, few labels, semantic prompts | Partially (distilled) |
| Diffusion-based anomaly | Emerging 2024–26 | Probabilistic, uncertainty-aware | No (cloud only) |
Figure 2. Model families and where they fit. “Edge feasible” assumes a Jetson Orin-class NPU with INT8 quantisation.
Reach for CLIP / VLM zero-shot when: you need to ship something in weeks, you do not have labelled examples of the target anomaly, or the anomalies are semantically describable (“person climbing fence,” “package left unattended”). Accuracy ceilings are lower; velocity is much higher.
Datasets and benchmarks worth knowing
Benchmarks do not translate to production — but they anchor conversations. Current 2025–26 reference numbers:
- UCF-Crime — 1,900 videos, 13 anomaly classes (abuse, arson, robbery, etc.). Real-world reference for weakly-supervised methods. SOTA: ~97% video-level AUC with I3D + attention-based MIL. See our commentary on this dataset.
- ShanghaiTech Campus — 437 videos of campus life. SOTA: ~95–96% AUC with VideoSwin / VideoMAE.
- Avenue — 47 videos, unusual pedestrian behaviour. SOTA: ~98.5% AUC, dominated by Transformers.
- XD-Violence — 2,405 videos, violence-centric labels. Hybrid CNN+Transformer models reach ~96% AP.
- Street Scene (2024–2025) — diverse urban footage; domain-shift benchmark. AUC typically drops 10–15 points versus in-domain — a useful reality check.
The honest read: a model scoring 97% on UCF-Crime will still require fine-tuning on 2–6 weeks of your actual site footage before customer-grade performance. Plan for that data-collection and relabelling cost upfront.
Edge vs cloud vs hybrid — the deployment call
1. Edge only. A Jetson Orin, Hailo-8 or Coral module next to each camera. 4–16 concurrent 1080p streams per device, 50–200 ms inference, no WAN dependency, strongest privacy story (frames never leave the site). Hardware $200–$600/unit. Best for small sites, regulated industries, poor connectivity.
2. Cloud only. Stream to an NVIDIA L4 / A100. 100+ streams per GPU, 20–50 ms inference, central retraining, hot model swaps. Cost $0.50–$3 / stream / hour on managed plans. Best for large centralised fleets with good WAN and a forensic focus. Not for <300 ms responses (network round-trip eats the budget).
3. Hybrid (our default). Edge detects at high precision / moderate recall; cloud re-scores borderline events, owns the model registry, runs drift detection and archival. Typical split: 60–70% edge recall at 95%+ precision, cloud catches the remainder. This is the pattern BriefCam, Motorola Avigilon and Milestone integrations use in the field.
4. What drives the choice. Latency budget (see Figure 1), camera count, site connectivity, compliance regime (EU AI Act is far easier on edge-only), and dataset locality.
5. The anti-pattern. Pushing raw 1080p streams to the cloud for a latency-critical use case just to reuse an existing cloud GPU. You will pay in bandwidth, latency, and regulatory exposure.
Reach for hybrid when: you have >20 cameras, any real-time SLA, or any regulated footage. The cost delta to cloud-only is usually negative once you factor in bandwidth and egress.
Need the right edge + cloud topology mapped to your cameras?
We will price out Jetson Orin / Hailo / Coral fleets, GPU quantities and bandwidth in one working session.
Integration protocols — ONVIF, RTSP, WebRTC, MQTT
Detection without integration is a demo. The four protocols that matter:
- ONVIF Profile M. The metadata/analytics profile ratified in 2022–2024 and broadly adopted by Milestone, Genetec, Avigilon, Hikvision, Axis. If your detector cannot emit ONVIF events, it will not plug into enterprise VMS.
- RTSP. Still the default camera stream. 30–50 ms LAN latency, 100–300 ms WAN. Edge boxes pull RTSP, run inference, push events.
- WebRTC. 20–100 ms latency, better firewall traversal, stronger encryption. Adopted by newer cloud VMS and the operator UIs we ship when sub-second delivery matters — same stack we use on Worldcast Live.
- MQTT. The alerting bus. Anomaly detector publishes
event/shoplift/zone-5/conf-0.87; SIEM, ticketing (Jira, ServiceNow), access control and staff mobile apps subscribe.
Vendor pricing and build-vs-buy math
| Option | Indicative price | Typical fit | Watch-outs |
|---|---|---|---|
| BriefCam | $50–200 / cam / mo | Enterprise, investigation + live | EU AI Act footprint, lock-in |
| Motorola / Avigilon | $30–150 / cam / mo | Public sector, large campuses | Hardware tied |
| Milestone XProtect + plugin | $200–400 / server / mo | VMS-led shops | Plugin quality varies |
| Verkada | $25–75 / cam / mo | SMB, cloud-first | Proprietary cameras |
| Eagle Eye | $10–30 / cam / mo | Cloud VMS, basic analytics | Limited advanced AD |
| iOmniscient | $40–120 / cam / mo | Transport, high-density anomaly | Fewer integrators |
| Custom build (Fora Soft) | Project cost + infra | 100+ cameras, niche anomaly types, regulator-grade audit | Time-to-first-pilot vs SaaS |
Figure 3. 2025–26 indicative vendor pricing. Negotiated enterprise contracts vary significantly; treat these as order-of-magnitude, not quotes.
Rough build-vs-buy crossover for a 100-camera deployment sits around 18–24 months: SaaS costs accumulate linearly with cameras, custom builds front-load the engineering and then run on amortised hardware + ops. Because we lean on agent-assisted engineering at Fora Soft, MVP cycles on custom anomaly detection tend to come in measurably shorter than they did two years ago — but we still price conservatively and avoid numbers we cannot defend on paper.
Privacy, GDPR and the EU AI Act
1. GDPR Article 9. Biometric data is “special category.” Face-based anomaly detection almost always triggers Article 9 and needs an explicit legal basis (rarely consent; more often substantial public interest or employment law).
2. GDPR Article 35 (DPIA). Video anomaly detection deployments over any meaningful scale require a Data Protection Impact Assessment. Budget 2–4 weeks of DPO / legal time.
3. EU AI Act (Regulation 2024/1689). Real-time biometric categorisation in publicly accessible spaces is tightly restricted; many video-analytics products now gate EU deployments behind conformity assessment, documentation, human oversight mechanisms and post-market monitoring. This is why several vendors retrenched out of the EU in 2025.
4. UK ICO (Jan 2024) and US state laws. Live facial recognition is effectively banned for UK private operators; Illinois BIPA, Texas CUBI, California CCPA all add consent or disclosure burdens for biometric or video analytics.
5. Practical consequences. Prefer gait / posture / object-centric anomaly detection over face-based where possible. Keep frames on-site when latency allows. Log every detection with confidence, reviewer ID and disposition — that audit trail becomes the legal evidence of human oversight.
Block the deployment when: a DPIA has not been completed, human oversight is not wired into the response workflow, or the audit log cannot be produced for a regulator in under 48 hours. These are not polish — they are operating licences.
Use cases that pay for themselves
1. Retail shrink reduction. Combined with trained floor staff, real-time shoplift alerts reduce shrink 20–40% in the first year of deployment in mid-size chains. See our retail video analytics deep-dive.
2. Transit safety. Platform-edge and fall detection at <200 ms enables automated train-hold or gate-close. Paris, Singapore and Seoul metro systems all deployed anomaly-detection overlays on existing CCTV between 2022 and 2025.
3. Perimeter security. Fence-climb and vehicle-intrusion detection at remote sites; 90–95% true-positive rate at <10 false alerts / 1,000 hours is achievable with current models.
4. Industrial safety (PPE, spills, behaviours). YOLO-class object detection + anomaly logic; OSHA-level reporting and 40–60% reductions in recordable incidents in deployments we have seen.
5. Elderly fall detection in care facilities. Edge inference on Jetson Nano / Orin; 95%+ sensitivity with <500 ms to staff alert. Healthcare privacy regimes (HIPAA in the US) demand on-site storage and tight data-retention policies.
Mini case: surveillance analytics shipped without launch drama
Situation. An enterprise surveillance vendor needed an anomaly-detection overlay across multi-camera sites ahead of a major platform update. The dev team had green CI, but the proposed model (Transformer-based, cloud-only) would have missed the 300 ms latency budget on the firm’s WAN. Regulator-grade audit trails were a hard requirement.
12-week plan. Swap to a hybrid architecture: quantised I3D + weakly-supervised MIL head at the edge (Jetson Orin), TimeSformer re-scorer in cloud for borderline events. ONVIF Profile M event bus for VMS integration. MQTT fan-out to on-site staff and central SOC. Drift monitoring on rolling 7-day cohorts. Formal DPIA and AI Act conformity documentation in parallel. See Netcam Studio for the class of surveillance UI we typically ship.
Outcome. Rolled to multi-site production in 12 weeks. End-to-end latency landed in the 180–220 ms band, well inside the 300 ms SLA. False-positive rate after four weeks of fine-tuning came in at <3% for the top-priority anomaly classes. Audit trail satisfied the regulator’s follow-up inspection with zero findings. The client kept the regression programme on retainer.
About to pay SaaS prices for analytics you could own?
We will price a hybrid custom stack against your shortlist in a single session — model choice, hardware, ops, and the EU AI Act paperwork included.
A decision framework — right-sizing in five questions
1. What is the latency budget? Under 200 ms forces edge; 200–500 ms allows hybrid; over a second is cloud and batch.
2. Which anomaly class are you solving? Behavioural, physical, density or temporal — each wants a different model family. Put the class in the RFP, not “Transformer.”
3. How many labelled examples do you have? <100 → start with CLIP/VLM zero-shot. 100–1,000 → weakly-supervised MIL. >10,000 → fully supervised.
4. What is the compliance regime? EU AI Act, HIPAA, GDPR, state biometric laws — each shapes where frames can live and what metadata you must log.
5. Build, buy or hybrid? <20 cameras or <18-month horizon → SaaS. >100 cameras, niche anomaly classes, regulator-grade audit → custom, possibly wrapped around an existing VMS.
Five pitfalls we keep seeing in anomaly-detection programmes
1. Chasing benchmark accuracy instead of site accuracy. A 97% UCF-Crime model rarely survives first contact with your cameras. Budget for 2–6 weeks of site data collection and fine-tuning.
2. Ignoring false positives. Even 1% FP at 30 fps on 100 cameras = tens of thousands of alerts per day. Ensembles, confidence thresholds, and human-in-the-loop review are not optional.
3. Under-engineering drift detection. Seasonality, time-of-day and site changes move the distribution silently. Monthly retraining and rolling-window drift checks should be in the runbook on day one.
4. Assuming cloud-only works for real-time. WAN round-trip + GPU inference + alert fan-out rarely fits under 300 ms in production. Edge-first is usually the right call.
5. Treating the AI Act and DPIA as after-the-fact paperwork. They are engineering requirements. Map them to data flows, logs and model-card artefacts from week one.
KPIs that prove the system is working
1. Quality KPIs. Frame-level AP per class ≥ 0.85 on site data; recall on top-priority classes ≥ 0.9; false-positive rate ≤ 5 per 1,000 hours on reviewed alerts; drift ΔAUC < 2 points week-on-week.
2. Business KPIs. Time-to-intervention for flagged events; documented incidents prevented; shrink / loss reduction versus baseline; customer or occupant NPS on safety perception.
3. Reliability KPIs. p95 end-to-end latency against the target budget; model-registry uptime; edge-box crash-free days; alert-delivery SLO ≥ 99.9%.
When NOT to do real-time anomaly detection
- No one will act on the alerts. Without a response team, alerts pile up and credibility collapses. Fix staffing first.
- Latency budget is minutes, not milliseconds. Offline or batch analytics is far cheaper and often more accurate.
- Ambiguous anomaly class. If you cannot name the behaviour in one sentence, you will not be able to label it, either.
- Camera quality is the bottleneck. 360p dome cameras and fish-eye lenses cap accuracy below what any model will deliver.
- Compliance will veto it. If the use case violates biometric consent or local AI Act rules, invest in non-AI mitigations instead.
FAQ
What latency is realistic for real-time anomaly detection?
End-to-end budgets we hit in production: 150–250 ms on edge (Jetson Orin / Hailo-8) for I3D / MIL-class models; 300–500 ms for hybrid edge+cloud; 500 ms+ for cloud-only Transformers. Anything claimed <100 ms usually means camera-to-alert on a heavily quantised light model.
Do I need fully labelled data?
Usually no. Weakly-supervised methods (MIL, RTFM, MGFN) learn from video-level labels and are within 2–4 points of fully-supervised AUC on UCF-Crime. Zero-shot CLIP/VLM detectors can bootstrap with 1–2 examples. Full frame-level labelling is mostly a regulated-path thing.
Edge, cloud or hybrid?
Hybrid by default. Edge for latency and privacy, cloud for model registry, drift, re-scoring borderline alerts and audit storage. Pure cloud is fine for forensic analytics; pure edge is fine for small, regulated deployments. Pure anything else is usually a cost or SLA mistake.
How much does it cost to build vs buy?
SaaS runs $30–200 / camera / month depending on vendor. A custom build front-loads engineering + hardware and amortises at roughly 18–24 months for a 100-camera deployment. Above that scale, or for niche anomaly classes and regulator-grade audit, custom is almost always cheaper over a 3-year horizon.
Is this compatible with the EU AI Act?
Yes, with work. Real-time biometric-adjacent analytics is high-risk under Annex III. That means conformity assessment, documented risk management, human oversight, logging and post-market monitoring. We treat those as engineering artefacts (model cards, data-flow diagrams, audit logs) rather than legal boilerplate.
What is ONVIF Profile M and do I need it?
Profile M is the ONVIF analytics and metadata profile — a standard way for detectors to publish events and metadata to VMS (Milestone, Genetec, Avigilon and others). If your detector does not speak it, integration with enterprise surveillance platforms becomes custom work. See our Profile M primer.
How do I manage false positives at scale?
Three levers: higher confidence thresholds (trade recall for precision on non-critical classes), ensemble models voting, and human-in-the-loop review on top-N daily alerts with feedback flowing to retraining. Expect 4–8 weeks of tuning before alert fatigue drops to operationally acceptable levels.
How often should I retrain?
Monthly incremental retraining on newly labelled cohorts; full retraining every quarter on the last 6–12 months of data. More frequently in seasonal environments (retail, transit, outdoors). Instrument drift detection (PSI, KS-test on feature distributions) and let it trigger retraining automatically rather than running on a calendar.
What to Read Next
Real-time analytics
Real-time video analytics in 2026
Latency budgets and network design for video analytics at scale.
Best practices
Real-time video processing with AI — best practices
The engineering patterns behind production-grade AI video pipelines.
Retail
Retail video analytics — AI-powered store intelligence
Shrink reduction, queue analytics and the ROI math behind retail AI.
Standards
ONVIF profiles in security systems
Why Profile M matters for any analytics that must plug into enterprise VMS.
Datasets & methods
Real-world anomaly detection in surveillance videos
How weakly-supervised methods changed the playing field.
Ready to ship real-time anomaly detection that actually scales?
The 2026 version of this problem is less about “can ML do it” and more about picking the right combination of model family, deployment topology, integration protocol and compliance posture for your cameras, your latency SLA and your regulator. The math has changed in favour of hybrid edge+cloud stacks; the compliance bar has risen; the model landscape is genuinely better than it was 24 months ago.
If you read the latency-budget row first, picked the anomaly class honestly, wired ONVIF Profile M in early, kept frames on site where the regulator demanded it, and instrumented drift from day one — you have most of what it takes. The rest is taste and iteration.
If that sounds like the next 12 weeks of your roadmap, we are happy to help — whether that is one review session, a discovery week, or running the build.
Want a concrete anomaly-detection plan mapped to your cameras?
30 minutes with a senior Fora Soft engineer — we will sketch the architecture, pick the model family and hand you a 12-week delivery plan.


.avif)

Comments