Published 2026-05-23 · 34 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
If you have read the 2026 anomaly detection playbook, you already know the field is a layered stack rather than a single model. This article is the depth pass on each layer's algorithms. It is written for the engineer who has to choose between RTFM and MGFN for a layer-2 classifier and cannot tell from the marketing copy which one will hit the latency budget on a Hailo-8, the product manager who has to defend a CLIP-based architecture against a board that wants "the LLM that does everything", and the founder who is staring at twelve algorithm names in a vendor pitch deck and needs to know which three to ask follow-up questions about. By the end you will know the four-number sieve that picks the right algorithm for any real product, the failure mode of each algorithm in production, and the cost of each at the scale you actually deploy at. The article assumes you have read lesson 1.2 on latency and topology and lesson 1.4 on AI cost in video products, because every choice in this article reduces to a millisecond number or a dollar number.
The Four-Number Sieve — How To Pick Before You Read
The honest way to read this article is non-linear. Before you weigh any algorithm, write down four numbers about your product. They are the sieve that turns twelve candidates into two.
The first number is the labelled-example count for the anomaly class you care about. Zero examples sends you to one of the unsupervised reconstruction families (autoencoders, VAE, Isolation Forest, one-class SVM). Two to one hundred examples opens the CLIP-based and few-shot families (VadCLIP, AnomalyRuler). One hundred to one thousand examples is the sweet spot for weakly-supervised MIL on top of CLIP features. Ten thousand examples plus is fully-supervised territory — and on production video that count is rare enough to be noteworthy.
The second number is the end-to-end latency budget in milliseconds from event to alert. Under 200 ms forces edge inference and rules out anything that needs a cloud round-trip. 200 to 500 ms permits a thin edge-plus-cloud topology. Above 500 ms cloud is fine. Above ten seconds you have a forensic problem, not a real-time one, and the frontier VLMs become economical.
The third number is the per-camera hardware budget. Under $250 per camera means a Hailo-8 module or a Coral Edge TPU and rules out anything you cannot quantise to INT8. $250 to $1,000 per camera buys a Jetson Orin Nano Super or an Orin NX and opens the full deep temporal model family. Above $1,000 per camera is a cloud GPU shared across many cameras, and the algorithm choice is decoupled from the camera count.
The fourth number is the explanation requirement. If the alert can fire silently into a queue for a human reviewer, you save the cost of layer 4 entirely. If the alert has to be accompanied by a natural-language explanation that survives review by a regulator or a court, you have to wire a VLM into the stack — Holmes-VAD on-device for routine alerts, a frontier VLM in the cloud for the borderline 1%.
Run those four numbers and the field of twelve shrinks to a shortlist of two or three. The rest of this article is the depth pass that decides between the remaining candidates.
Figure 1. The four numbers — labels, latency, hardware, explanation — that decide which algorithm family you reach for.
Algorithm 1 — One-Class SVM (The Classical Baseline You Will Underestimate)
One-Class SVM (OCSVM) was introduced by Schölkopf et al. in 2001 as a way to characterise the support of a high-dimensional distribution from samples drawn from it. Translated into anomaly-detection language: you train it on examples of normal behaviour, and it learns a boundary in feature space. At inference, anything that falls outside the boundary is flagged as anomalous. There is no need for examples of the anomaly class itself — and that property is the only reason it still ships in 2026, because every modern alternative is more accurate when labels exist.
The math is a quadratic program that finds the smallest hyperplane separating the training data from the origin in a kernel-induced feature space. With an RBF kernel and the default nu parameter at 0.1, the model assumes that up to 10% of the training data are outliers and adjusts the boundary accordingly. Training complexity is O(n²) to O(n³) in the number of samples, which limits the practical training set to perhaps 50,000 to 100,000 frames before training time becomes a problem. Inference is O(n_sv) where n_sv is the number of support vectors — typically a few hundred for a well-trained model, which means inference is fast enough for any real-time budget.
The reason OCSVM still earns its place in 2026 is the long tail of "is anything moving in this normally-empty space?" use cases. A back-of-house corridor at 03:00. A locked warehouse aisle outside business hours. A server room where any human presence is anomalous by definition. For these cases, you have hundreds of hours of "normal" (empty room) footage and zero examples of the anomaly class (a burglar). An OCSVM trained on per-frame features extracted by a pretrained ResNet-50 or a small ViT will fire reliably when a person enters the frame, with no anomaly examples and no fine-tuning. It costs nothing to train and runs at 1000 frames per second on a CPU.
The production failure mode is the curse of dimensionality. As your feature vector grows past a few hundred dimensions, the distance metric the kernel relies on loses discrimination, and the boundary becomes either too loose (flagging everything) or too tight (flagging nothing). The fix is to feed OCSVM a low-dimensional embedding — a 64- or 128-dim projection of a pretrained encoder, not the raw 2048-dim ResNet output. The other failure mode is concept drift: an OCSVM trained on "normal" in spring will start flagging "normal" in autumn when the leaves change colour and the lighting shifts. Plan for retraining every quarter, or run a drift detector on top of it.
A worked example. A 16-camera deployment guarding a chain of bank-branch back rooms. Hours of normal footage, zero examples of intrusion. Extract a 128-dim embedding per frame with a quantised MobileNet ViT on a Raspberry Pi 5 ($80) co-located with each camera. Train an RBF-kernel OCSVM with nu=0.05, gamma=0.1 on 30 days of normal footage per camera. Inference latency is roughly 8 ms per frame on the Pi 5; alert is published over MQTT to the on-call mobile app. Total per-camera capex: roughly $100, including the Pi and the PoE splitter. The system catches every intrusion in pilot testing without ever seeing an intrusion in training. Above twenty cameras you would graduate to Isolation Forest for the linear-time training; below twenty, OCSVM is the simplest correct answer.
Algorithm 2 — Isolation Forest (The Default For Signal-Level Anomalies)
Isolation Forest (iForest) was introduced by Liu, Ting, and Zhou in 2008, and it is one of the few anomaly detection algorithms that scales gracefully. The intuition is the opposite of every other approach in this article. Instead of learning what "normal" looks like and flagging deviations, iForest tries to isolate each data point by recursively partitioning the feature space with random binary splits. Anomalies — being statistically rare and feature-wise different — get isolated in fewer splits than normal points, because there are fewer neighbouring points "shielding" them. The anomaly score is the average path length across an ensemble of random trees, normalised by the expected path length for the dataset size.
The training complexity is O(n log n) and the prediction complexity is O(log n) per tree — see the scikit-learn documentation for the canonical implementation. Both numbers are linear in the dataset size for practical purposes, which makes iForest comfortable on tens of millions of data points. It does not need normalised features, it does not need anomaly examples, and the hyperparameter surface is essentially one knob (contamination, the assumed fraction of outliers in the training data, with a default of auto).
The production fit is not raw frames but engineered features over the pipeline metadata. Per-frame mean pixel value to catch a frozen feed. Per-frame entropy to catch a black screen. Encoder bitrate to catch a stuck encoder. Inter-frame difference magnitude to catch a stuck camera. Model confidence histogram to catch a model that has collapsed to predicting one class. Dropped-packet rate on the WebRTC transport to catch a network issue. Feed those metrics — a vector of perhaps 20 to 60 numbers per frame, aggregated over a sliding 1-minute window — into iForest, and you get layer 5 of the five-layer stack from the 2026 playbook. This is the layer that catches the camera silently frozen on a still image of the parking lot at sunset.
The failure mode in production is the same failure mode every model has on time-series data: features that drift slowly are not flagged until the drift completes. A camera whose focus degrades over six months will not be flagged by iForest if you train on the last 30 days of its own data, because every individual day looks normal relative to the day before. The fix is to anchor the training data to a fixed "golden" baseline (the first month after install) and run periodic absolute comparisons against it, not just rolling comparisons against recent history.
A worked example. A 500-camera retail surveillance fleet. Engineered feature vector of 32 numbers per camera per minute (compression ratio, mean luma, frame freeze count, motion energy, model confidence histogram, alert rate, etc.). One iForest per camera, retrained nightly on the last 30 days of golden-baseline data. Inference cost: trivial; the whole fleet's iForests run in seconds on a single CPU core. Caught failure modes in the first quarter: 7 frozen feeds, 14 cameras that drifted out of focus, 3 cameras that someone had rotated 90 degrees during cleaning, 2 cases of a model collapsing on a single class after a botched deploy. None of those were caught by the vision-layer stack on top.
Algorithm 3 — Local Outlier Factor (When Density Varies Across Your Feature Space)
Local Outlier Factor (LOF) was introduced by Breunig, Kriegel, Ng, and Sander in 2000. It is iForest's older cousin and the right call when the feature space has regions of genuinely different density that you want the model to respect. LOF computes, for each point, the ratio of the local density of the point to the local densities of its k nearest neighbours; a point in a low-density region surrounded by higher-density neighbours scores high and is flagged anomalous.
The complexity is the headline issue. LOF needs pairwise distances to find nearest neighbours, which is O(n²) in the naive implementation. The scikit-learn implementation accelerates this with BallTree or KDTree to roughly O(n log n) on low-dimensional data, but the constant factor is still much larger than iForest, and the structure trees collapse to O(n²) on high-dimensional data. As a rule of thumb, prefer iForest for datasets above 100,000 samples, and reach for LOF only when the local-density intuition genuinely matches the data shape.
The production fit is narrow. The classic case is geographic anomaly detection — a vehicle stopping in the middle of an unusual road segment, where "the middle of a road segment" has different normal density depending on whether you are on a motorway or a city street. The LOF intuition — "this point is anomalous because its local neighbourhood is unusually sparse" — generalises better than the global threshold iForest produces. For most video pipeline metadata, iForest's global perspective is fine, and LOF is overkill. We mention it in the production-twelve because there are specialist applications (urban analytics, mobility data, sparse sensor networks) where LOF is genuinely the right tool.
Failure mode: tuning. The n_neighbors parameter is the single most important hyperparameter and the one with no good default. Too low and noise dominates; too high and the local-density intuition collapses to a global one and you might as well use iForest. The scikit-learn user guide suggests n_neighbors = 20 as a starting point and recommends sweeping it in the range 10 to 50 with a validation set.
Algorithm 4 — Autoencoder Reconstruction (The Unsupervised Default For Raw Video)
The vanilla autoencoder approach is the simplest deep-learning method that works on video. Train an encoder–decoder convolutional network to compress a frame (or a short clip) to a low-dimensional latent vector and reconstruct it from that latent. Train only on normal frames. At inference, flag any frame whose reconstruction error (typically mean squared error on pixel values, or perceptual loss on a deep feature space) exceeds a learned threshold. The intuition is that the network learns to reconstruct the patterns it has seen, and struggles on patterns it has not.
The first wave of papers — Hasan et al. 2016, Chong and Tay 2017, Luo et al. 2017 (ConvLSTM-AE) — established the family on the Avenue and ShanghaiTech datasets and showed that even a small autoencoder beats the classical statistical baselines on raw video by 5 to 15 AUC points. The second wave (Liu et al. 2018, Park et al. 2020 — MemAE, Gong et al. 2019) refined the architecture to address the dominant failure mode of vanilla autoencoders: the network generalises too well and learns to reconstruct anomalies passably, eroding the gap that should fire the alert.
In production, the vanilla autoencoder is the right unsupervised default. It runs at 60+ fps at 224×224 on a Jetson Orin Nano Super, it needs zero anomaly labels, and the threshold tuning is a one-line operation on a held-out validation set. Where it falls down is the long-tail behavioural anomaly — a person walking normally is reconstructed fine, but so is a person fighting, because the encoder generalises across "person doing something". For those cases you graduate to ST-AE or MemAE.
The failure mode you have to plan for is threshold drift. The reconstruction error distribution shifts with lighting, weather, camera focus, scene activity, and seasonal changes. Set a static threshold once and it will be either useless (zero alerts) or unusable (hundreds of false alerts) within two months. The fix is a rolling per-camera threshold derived from a percentile of the recent reconstruction-error distribution (typically the 99.5th percentile over the last 7 days) plus a "novelty" check that detects when the entire distribution has shifted and triggers retraining.
Algorithm 5 — ST-AE And MemAE (Specialised Video Autoencoders)
Spatial-Temporal Autoencoders (ST-AE) extend the vanilla autoencoder with explicit temporal modelling. Instead of compressing one frame at a time, the encoder ingests a short clip (typically 8 to 16 frames) and the decoder reconstructs it as a whole. The temporal axis can be modelled with a 3D convolution, a ConvLSTM, or a small transformer block. The benefit on video anomaly detection is real: behavioural anomalies that look "normal frame-by-frame but wrong in motion" — a person running where running is anomalous, a vehicle driving the wrong direction in a one-way zone — are caught by the temporal axis where they are missed by per-frame autoencoders.
MemAE (Gong, Liu et al. 2019, ICCV) adds an explicit memory module between the encoder and the decoder. The encoder produces a latent; the latent is used to query a memory of K prototype vectors (typically K = 2000), the closest prototypes are aggregated, and the aggregate is what the decoder reconstructs from. The effect is that the network can only reconstruct what is close to something it has seen — anomalies that produce a latent far from any prototype get reconstructed badly, and the alert fires. The published numbers on Avenue and ShanghaiTech showed roughly 2 to 3 AUC points improvement over vanilla autoencoders, with no labelling cost.
In production, both ST-AE and MemAE are the right choice for "we have hundreds of hours of normal footage and zero anomaly examples and the anomaly is temporal". The cost is the model size (ST-AE with 3D convs is 5–10× heavier than a per-frame autoencoder) and the latency (a 16-frame ingest window adds half a second of context — fine for cloud, marginal for sub-200 ms edge). Both quantise to INT8 cleanly on Jetson hardware, but the inference latency on a Jetson Orin Nano Super for a typical ST-AE clip lands at 80 to 150 ms, which is the upper edge of the under-200 ms band.
The failure mode is overfit-to-recent. If you retrain ST-AE or MemAE weekly on the last 7 days of footage, it will quickly learn that current scene patterns are normal and lose the ability to flag changes. The fix is to train once on a multi-month "golden" window per camera and only retrain when a drift detector says the underlying distribution has shifted meaningfully.
Algorithm 6 — Variational Autoencoders And GAN-Based Detectors
Variational Autoencoders (VAE) extend the autoencoder with a probabilistic latent — the encoder produces a mean and variance vector, the latent is sampled from the resulting Gaussian, and the decoder reconstructs from the sample. The training loss combines reconstruction error with a KL-divergence term that pushes the latent distribution toward a unit Gaussian. The anomaly signal is the log-likelihood under the model: anomalies have low likelihood and are flagged.
GAN-based methods — AnoGAN (Schlegl et al. 2017), GANomaly (Akcay et al. 2018), Skip-GANomaly (Akcay et al. 2019), and their video-specific descendants — train a generator to produce normal samples and a discriminator to tell normal from generated, then use either the reconstruction error or the discriminator's confidence as the anomaly signal. The intuition is identical to the VAE — the model knows what normal looks like and the anomaly is what it cannot match.
In academic benchmarks, VAEs and GAN-based detectors hover around the same AUC band as ST-AE — useful and competitive on the small benchmarks (Avenue, ShanghaiTech) but rarely the top of the leaderboard. In production they are much less common because the training is harder. VAE training requires careful balancing of the reconstruction term and the KL term, with a tendency to collapse to posterior-mean-equals-prior (the "posterior collapse" failure mode). GAN training is famously unstable, with mode collapse, vanishing gradients, and oscillating losses as standard failure modes that consume weeks of engineer time.
Our practical recommendation is to skip VAE and GAN for production anomaly detection unless you have a specialist on the team. Vanilla autoencoders are simpler and competitive; ST-AE and MemAE are competitive on temporal anomalies with much less training pain; the CLIP-based and MIL families are dramatically better when you have any labels at all. The VAE and GAN families remain interesting in research and in narrow industrial-inspection contexts where the generated samples themselves are useful (for data augmentation, for simulation), but for ordinary video anomaly detection they are not the place to spend training-engineer time in 2026.
Algorithm 7 — 3D-CNNs (I3D, C3D, SlowFast — The Workhorse Layer-2 Model)
The 3D-CNN family is the production workhorse for behavioural anomaly detection. C3D (Tran et al. 2015) was the first 3D-CNN to demonstrate that adding the time axis to convolution produces a much better video feature extractor than 2D-CNN-plus-LSTM. I3D (Carreira and Zisserman 2017) extended a 2D ImageNet-pretrained Inception network to 3D by "inflating" the convolution kernels along the time axis and seeded the result with the 2D weights, which gave I3D the first really strong pretrained video feature extractor. SlowFast (Feichtenhofer et al. 2019, ICCV) introduced the two-pathway architecture — a slow pathway at low temporal resolution that captures spatial semantics, a fast pathway at high temporal resolution that captures motion — and remains the best accuracy-per-FLOP point in the family.
For anomaly detection specifically, the 3D-CNN family is rarely the end-to-end model; it is the feature extractor. The typical pattern is: take a pretrained I3D or SlowFast, run it over the clip to produce a 1024- or 2048-dim feature vector per video segment, and feed those features into a downstream classifier — MIL, MLP, transformer, CLIP-similarity, whatever the layer-2 architecture requires. The reason is efficiency: the 3D-CNN is the expensive part, and you only want to pay that cost once per clip; the downstream classifier can be retrained for new anomaly classes without re-running feature extraction.
Quantised to INT8, I3D and SlowFast both run at roughly 30 frames per second on a Jetson Orin Nano Super, with feature-vector latency of 25 to 40 ms per clip. The Nano Super delivers about 67 TOPS of INT8 sparse performance — enough headroom to run the 3D-CNN plus a small MIL head plus a thin candidate-filter autoencoder on the same device. On the Orin NX (100 TOPS) you can fit a heavier SlowFast variant or run multiple model instances for multi-camera shared hardware. The ResNet50-I3D-NonLocal variant commonly used as the feature backbone for VAD weighs in at 34.6M parameters and 38.3 GFLOPs per clip, well within Orin Nano's edge envelope.
The failure mode is clip-boundary effects. A 3D-CNN ingests a fixed-length clip (typically 16, 32, or 64 frames). An anomaly that crosses a clip boundary can be split across two clips and seen as "half" by each, producing low confidence on both. The fix is to run with overlapping clips (a sliding window with stride less than clip length) and to apply temporal smoothing of the per-clip scores. The cost is roughly 2× compute, which has to be planned into the latency budget.
Algorithm 8 — Video Transformers (TimeSformer, VideoSwin, VideoMAE, ViViT — The Accuracy Ceiling)
The video transformer family arrived between 2021 and 2023 and pushed the accuracy ceiling on most action-recognition and anomaly-detection benchmarks by 2 to 5 points. TimeSformer (Bertasius et al. 2021) introduced "divided attention" — attention over space within each frame, then attention over time across frames — as a way to make video transformers tractable. ViViT (Arnab et al. 2021) explored a family of spatial-temporal attention factorisations. VideoSwin (Liu et al. 2022) brought the Swin Transformer's hierarchical shifted-window attention to video, dramatically reducing the compute cost while preserving accuracy. VideoMAE (Tong et al. 2022, NeurIPS) introduced masked-autoencoder pretraining for video — randomly mask 90% of the video tokens, train to reconstruct, and the result is a strong self-supervised pretrained video transformer that needs less labelled fine-tuning data than its supervised counterparts.
For anomaly detection, video transformers reliably beat 3D-CNNs on the small benchmarks (Avenue, ShanghaiTech) and are competitive on UCF-Crime when stacked under a MIL head. The headline number is that a VideoSwin-B or VideoMAE-L backbone with a learned MIL head lands at 95%+ AUC on UCF-Crime — a substantial lift over a SlowFast backbone with the same MIL head.
The cost is the compute. A VideoSwin-B variant runs at 5 to 15 frames per second on a Jetson Orin Nano Super even at INT8 — too slow for live edge inference at 30 fps. VideoMAE-L is heavier still. The video transformer family is the right choice for cloud-side re-scoring in a hybrid topology — the edge runs a quantised SlowFast at 30 fps for candidate filtering, the cloud runs VideoMAE on the candidates that the edge model flagged for a high-accuracy second opinion. On a cloud L40S you can run VideoMAE at 30+ fps on dozens of streams in parallel; on edge silicon, not yet in 2026.
The failure mode is transformer hyperparameter sensitivity. Video transformers are more sensitive to learning-rate warmup, weight decay, drop-path rate, and patch-size choice than 3D-CNNs. A team that has never trained a video transformer should expect a 4 to 8 week ramp-up before reproducing published numbers. The fix is to start from the published pretrained weights and fine-tune sparingly (a single learning-rate at 1e-5, weight decay at 0.05, drop-path at 0.1), rather than training from scratch.
Algorithm 9 — Weakly-Supervised MIL (RTFM, MGFN, MTFL — The UCF-Crime State Of The Art)
The Multiple Instance Learning (MIL) family is the algorithm class that made weakly-supervised video anomaly detection practical. The problem MIL solves is the labelling-cost problem. Frame-level labels for anomalies — "this exact frame contains a fight" — are prohibitively expensive to collect at scale; video-level labels — "this 10-minute clip contains a fight somewhere" — are cheap. MIL frameworks bridge the gap: they train on video-level labels and learn to localise the anomaly to a clip or a frame at inference.
The framework. Each video is treated as a "bag" of clips; each clip is an "instance" in the bag. Positive (anomalous) bags contain at least one anomalous instance; negative (normal) bags contain none. A neural network scores each instance; the bag score is some aggregation (commonly max, or top-k mean). The ranking loss pushes positive bags' top-k clip scores above negative bags' top-k clip scores. At inference the per-clip score is the localisation.
The state-of-the-art lineage on UCF-Crime tells the story. RTFM (Tian et al. 2021) introduced Robust Temporal Feature Magnitude learning and achieved 84.30% AUC on UCF-Crime. MGFN (Chen et al. 2023, AAAI) introduced the Magnitude-Contrastive Glance-and-Focus Network and pushed the bar to 86.98% AUC on UCF-Crime and 80.11% AP on XD-Violence with I3D features; with VST-RGB features in later comparisons, MGFN sits at roughly 85.80% AUC on UCF-Crime. MTFL (Karim et al. 2024) introduced Multi-Timescale Feature Learning and reported 89.78% AUC on UCF-Crime on VST-RGB features — a 5.48-point lift over RTFM and a clear pass over MGFN at the same feature backbone. Unbiased MIL (Lv et al. 2023, CVPR) addressed the class-imbalance bias in earlier MIL frameworks and is the right choice when the anomaly class is very rare in the training set.
In production, the MIL family is the right layer-2 classifier when you have video-level labels but not frame-level labels — which is essentially every real surveillance dataset. The MIL head is small (a few MLP layers on top of a feature backbone) and runs in milliseconds even on a Jetson; the cost is dominated by the feature extractor underneath, which is your 3D-CNN or video transformer choice from algorithms 7 and 8. The 2025 picture is roughly 86–90% AUC on UCF-Crime on raw features and 95%+ AUC when stacking the right CLIP-augmented layer 3 on top.
The failure mode is bag-level confusion. MIL trains to discriminate "any anomaly somewhere in the bag" from "no anomaly anywhere". A long normal video with a brief, mild anomaly trains a weak gradient; a short bag with a brief, strong anomaly trains a strong gradient. The result is that MIL is reliable on prominent anomalies and uneven on subtle ones — a fight or an arrest is caught reliably, a slow loitering pattern is missed. The fix is two-fold: dataset curation to balance bag duration and anomaly density, and the use of temporal segmentation (e.g., the 32-segment chunking from the original Sultani et al. 2018 paper) to control bag length. The MTFL multi-timescale architecture explicitly addresses this by representing each clip across multiple time scales rather than a single fixed window.
A worked production pattern. SlowFast-R50 INT8 feature extractor on a Jetson Orin Nano Super produces a 2048-dim feature per 32-frame clip. An MTFL head (a small transformer over the temporal axis with multi-scale heads) on top produces per-clip anomaly scores. Total edge inference budget: 60 ms per clip, well inside the 200 ms latency envelope. Labelling budget: video-level labels on 1,000 hours of footage from 50 cameras over 6 weeks of in-house annotation. Resulting AUC on the customer's site validation set: 91% — within 2 points of the published UCF-Crime number, which is the gap you should plan for when transferring research numbers to production data.
Algorithm 10 — CLIP-Based Methods (VadCLIP, TPWNG, VadCLIP++, TrCLIP-VAD, WSVAD-CLIP, AVadCLIP — The 2024–2026 Lift)
CLIP-based weakly-supervised VAD is the single biggest engineering shift in the field between 2024 and 2026. CLIP (Radford et al. 2021) is a contrastive vision-language model trained on image–text pairs from the open web. It maps images and short text descriptions into a shared embedding space where semantically similar items cluster. The application to anomaly detection is direct: describe the anomaly class in text ("a person climbing a fence", "a person fighting"), embed both the text and a video clip, and threshold on the cosine similarity. With zero training examples of the target class you get a credible detector. With a handful of examples you fine-tune toward the customer's site distribution. With a CLIP-equipped MIL head you reach published-SOTA numbers with a fraction of the labelling cost the older MIL family demanded.
The lineage. VadCLIP (Wu et al. 2024, AAAI) was the first CLIP-based weakly-supervised VAD method to land on the leaderboards. It uses a dual-branch architecture: one branch does coarse-grained binary classification on visual features, the other does fine-grained language–image alignment, and the two are fused. Published numbers: 88.02% AUC on UCF-Crime and 84.51% AP on XD-Violence. TPWNG (Text Prompt with Normality Guidance, Yang et al. CVPR 2024) is a pseudo-label generation and self-training framework that fuses visual features with CLIP text embeddings, achieving state-of-the-art performance on both UCF-Crime and XD-Violence in 2024. WSVAD-CLIP (Lin et al. 2025) adds temporal-aware prompt learning and frame-aware fusion using CLIP's cross-modal knowledge to bridge the semantic gap. VadCLIP++ (2025) introduces a dynamic vision-language model that adapts CLIP at inference time. TrCLIP-VAD (vpsg-research, 2026) improves CLIP training with text rewriting and is the current near-SOTA at 88.59% AUC on UCF-Crime and 86.38% AP on XD-Violence. AVadCLIP (2025) extends the approach to audio-visual collaboration, which is the right call for XD-Violence-style violence detection.
The production fit is the open-vocabulary and few-shot regime. If your customer wants to add a new anomaly class — "a person carrying a long object", "a person wearing a high-visibility vest in a no-go zone" — describe it in text, embed the description, threshold on cosine similarity against incoming clips, and you have a v1 detector in one hour without retraining anything. If you then collect 50 examples of the new class, you can fine-tune the CLIP text branch or train a small MIL head on top to add 5 to 15 AUC points. Both operations are orders of magnitude faster than the older "collect 5,000 labelled examples then retrain the whole pipeline" workflow that 3D-CNN-only architectures demanded.
The cost is the inference cost of CLIP itself. CLIP-ViT-L/14 produces a 768-dim image embedding at roughly 15 ms per image on an A100 (much higher on edge silicon). A 32-frame clip needs 32 forward passes (or one with a 3D adaptation), so per-clip latency lands at 50 to 100 ms on a cloud GPU and 200 to 500 ms on edge silicon. The right production pattern is to use CLIP in the cloud as the layer-3 re-scorer for candidates that the edge layer-2 detector has already filtered, rather than running CLIP on every frame.
The failure mode is prompt brittleness. CLIP is sensitive to the exact phrasing of the text prompt. "A person fighting" and "two people fighting" can produce different similarity scores against the same clip, and small prompt rewordings can shift AUC by several points. The fix is two-fold: ensemble across multiple prompt variants (the trick that lifted CLIP zero-shot ImageNet accuracy by 1.4 points in the original paper) and use a learned prompt-tuning layer (CoOp, CoCoOp, MaPLe) rather than hand-crafted prompts.
Algorithm 11 — Multimodal LLMs (Holmes-VAD, AnomalyRuler, Cerberus, SlowFastVAD — The Explanation Layer)
The arrival of multimodal LLMs as anomaly detectors is the second 2024–2026 shift. Where CLIP-based methods provide a similarity score, multimodal LLMs provide a natural-language explanation — and in 2026 the explanation is what differentiates a product that passes a regulator audit from one that does not.
The lineage. Holmes-VAD (Zhang et al. 2024) fine-tunes a multimodal LLM on the new VAD-Instruct50k benchmark — 50,000 instruction-response pairs for video anomaly detection — and produces both precise temporal localisation and a coherent natural-language explanation. The architecture: a frozen video encoder, a lightweight temporal sampler that picks anomaly-relevant frames, an LLM that ingests the frames and produces a structured response. AnomalyRuler (Yang et al. ECCV 2024) treats VAD as a few-shot reasoning task: feed the LLM a few examples of normal behaviour, ask it to induce rules, then apply those rules to new footage. The published numbers showed the first VLM-based method to beat the older fully-supervised baselines on the small unsupervised benchmarks. Cerberus (Liu et al. 2025) cascades smaller and larger VLMs to balance accuracy and frame rate, hitting 97.2% accuracy on UCF-Crime at 57.68 fps on an NVIDIA L40S — within range of real-time, finally, for VLM-class methods. SlowFastVAD (Wang et al. 2025) integrates a fast detector with a RAG-enhanced VLM, retrieving relevant context from a video knowledge base before producing the explanation.
In production today the practical pattern is layered. A small on-device VLM (Qwen-VL 7B, Llama 3.2 Vision 11B, Moondream 2B) handles the routine explanations on-camera. A frontier VLM (Gemini 2.5 Pro, Claude Opus 4.7, GPT-5) handles the borderline 1 to 5% of cases that the smaller model is unsure about, called from the cloud via API. This split keeps the cost manageable while preserving the explanation-quality ceiling for the cases that need it most.
The cost arithmetic is the trap. A frontier VLM at $1.25 per million input tokens, ingesting a 30-second clip at default media resolution (about 9,000 input tokens) plus an output of 200 tokens, costs roughly $0.012 per clip — call it a cent. Multiply that by a thousand alerts a day per camera and you have $12 per day per camera, or $360 a month per camera — more than every SaaS surveillance contract on the market. The fix is the layered pattern above plus aggressive routing: only the alerts with confidence between 0.4 and 0.7 from the MIL layer (typically 1 to 5% of all alerts) go to the frontier VLM. We walk the full cost arithmetic in lesson 1.4 on real AI cost in video products.
The failure mode is hallucination. A frontier VLM will produce confident, fluent, wrong explanations on out-of-distribution footage. The fix is structured-output prompting (force the model to fill a JSON schema rather than write free prose), a confidence threshold on the structured fields, and a confirm-with-the-CLIP-score sanity check that rejects explanations that contradict the upstream layers. In high-stakes deployments the VLM output is always a draft that a human reviewer signs off on, never a fully-automated decision.
Algorithm 12 — LSTM And Temporal CNN On Engineered Features (The Pipeline Watchdog)
The last algorithm is the one nearly every architecture forgets and the one that catches more silent failures in production than the other eleven combined. It watches the pipeline itself, not the video content. An LSTM or a 1D temporal CNN on a vector of engineered features over a sliding time window catches the failure modes that the vision layers cannot.
The engineered feature vector is the heart of the approach. Per-second metrics: frame freeze duration; encoder bitrate; encoder QP; decoded-frame mean luma; decoded-frame entropy; inter-frame difference magnitude; ONVIF event rate; alert rate; rolling MIL confidence histogram; CLIP-score distribution; VLM-cost spike. Aggregate into a 60-second window with 30-second stride; feed into an LSTM or a temporal-CNN; train against historical normal patterns. The anomaly is any window whose reconstruction error or whose predicted-next-window distance exceeds a threshold.
The failure modes this catches are the ones a surveillance team will eventually live through. The camera silently froze and the pipeline kept emitting the last good frame's predictions. The encoder restarted and is producing a corrupted bitstream that decodes to apparent video but with degraded features. The MIL model collapsed to predicting one class after a botched deploy. The frontier VLM API began returning empty responses because the API key rotated and the failover did not pick it up. The cloud GPU queue is backed up and inference latency has crept from 50 ms to 1500 ms over the past hour. The drop in upstream RTSP packet delivery has been silently filled by the local jitter buffer, masking a developing network issue.
None of these are visible to the vision-layer stack. All of them are visible to a 60-feature LSTM watching the pipeline. The cost is trivial — the model runs in microseconds per window on a CPU sidecar. The benefit is the difference between a system you trust and a system that occasionally fails open.
The Isolation Forest version of this same idea (algorithm 2) works for the simplest cases. The LSTM or temporal CNN version is the right call when the feature dynamics matter — when "low bitrate" is normal at night but anomalous in the day, when "high alert rate" is normal during a sports event but anomalous on a Tuesday morning. The temporal model learns those patterns; the iForest treats every minute as independent and misses them.
A worked example. A 200-camera transit deployment. Sixty engineered features per minute per camera. A small (2-layer, 128-hidden) LSTM per camera trained on the last 90 days of normal pipeline patterns. Inference cost: roughly 5 microseconds per window on a single CPU core; the entire fleet's LSTMs run in seconds per minute. First-month catches: 11 cameras with developing focus issues, 4 cameras whose encoder bitrate had drifted out of spec, 3 nights when the cloud inference queue exceeded latency SLA, 1 case of a model collapse after a deploy, 2 cases of an upstream packet-loss issue that the WebRTC jitter buffer was masking. None of these were visible to the vision-layer stack.
Figure 2. The twelve algorithms positioned by labelled-example count and latency budget, with their layer in the five-layer stack indicated by colour.
The Production-Ready Stack — Wiring The Twelve Into A Single Pipeline
The Fora Soft default 2026 architecture wires several of the twelve algorithms into a single pipeline. The shape is the five-layer stack from the 2026 playbook; the algorithm choices below are the concrete ones we ship when no customer-specific constraint dictates otherwise.
Layer 1, candidate filter on the edge: a vanilla autoencoder or MemAE quantised to INT8, running at 30 fps per camera on a Jetson Orin Nano Super or a Hailo-8 module. The job is to surface windows worth a second look; precision is irrelevant, recall has to be high. Threshold tuned to a 5 to 10% candidate rate against historical normal footage.
Layer 2, classifier on the edge or thin-edge tier: a SlowFast INT8 feature extractor plus an MTFL or RTFM MIL head, producing per-clip anomaly scores and class labels. Inference budget: 60 ms per clip on Orin Nano Super. Output: per-clip score and class probability vector.
Layer 3, re-scorer in the cloud: VadCLIP or TPWNG running on a shared cloud GPU (L4 or L40S), called only on the candidates the edge layer 2 flagged with mid-range confidence. Open-vocabulary anomaly detection lives here. Output: refined per-clip score and a class label that can include classes the layer-2 MIL was never trained on.
Layer 4, explanation in the cloud: a small VLM (Qwen-VL 7B) on every alert that needs human review, plus a frontier VLM (Gemini 2.5 Pro, Claude Opus 4.7) on the top-1% borderline cases. Output: a structured-JSON narrative of the event suitable for a court, a regulator, or a security analyst.
Layer 5, pipeline watchdog as a sidecar: an LSTM on 60 engineered pipeline-health features per minute per camera. Runs everywhere — on the edge box as a separate process, or in the cloud as a per-camera sidecar. Output: alerts on silent pipeline failures.
The classical statistical algorithms (1, 2, 3) live inside layer 5 as the simplest version of the pipeline watchdog, and inside Layer 1 for the special cases where the use case is "is anything moving in this empty room". The VAE and GAN family (6) are essentially unused in our default architecture; we keep them in the production-twelve for completeness because they show up in vendor pitch decks and you should know what they are.
Algorithm Selection — A Comparative Table
The table below condenses the twelve algorithms against the four numbers from the sieve, the layer they belong in, the typical hardware they run on, and a one-line note on the failure mode you should plan for.
| # | Algorithm | Labels needed | Latency (edge) | Best layer | Typical HW | Watch for |
|---|---|---|---|---|---|---|
| 1 | One-Class SVM | 0 anomaly | <10 ms | 5 (empty-room L1) | CPU | Concept drift; retrain quarterly |
| 2 | Isolation Forest | 0 anomaly | <1 ms | 5 | CPU | Slow drift goes uncaught; anchor to golden baseline |
| 3 | Local Outlier Factor | 0 anomaly | 10–50 ms | 5 (specialist) | CPU | O(n²) training; prefer iForest above 100k samples |
| 4 | Vanilla autoencoder | 0 anomaly | 15–25 ms | 1 | Edge NPU | Threshold drift; rolling per-camera percentile |
| 5 | ST-AE / MemAE | 0 anomaly | 80–150 ms | 1 (temporal) | Edge NPU | Overfit-to-recent; train on multi-month window |
| 6 | VAE / GAN | 0 anomaly | 30–100 ms | 1 (rare) | GPU | Training instability; usually skip |
| 7 | 3D-CNN (I3D, SlowFast) | 100+ | 25–40 ms | 2 (feature) | Edge NPU | Clip-boundary effects; use overlap |
| 8 | Video transformer | 1k+ | 60–200 ms | 2 (cloud) | Cloud GPU | Hyperparameter sensitivity; start from pretrained |
| 9 | Weakly-supervised MIL | Video-level | 5–15 ms | 2 (head) | Edge or cloud | Bag-level confusion; balance bag duration |
| 10 | CLIP-based | 0–100 | 50–500 ms | 3 | Cloud GPU | Prompt brittleness; ensemble prompts |
| 11 | Multimodal LLM | 0 (zero-shot) | 200 ms–10 s | 4 | Cloud (with API) | Hallucination; structured-output prompts |
| 12 | LSTM on engineered features | 0 | <1 ms | 5 | CPU sidecar | Feature engineering is the work; the model is small |
Figure 3. The twelve algorithms summarised against the four-number sieve and the five-layer stack.
The pattern in the table is the pattern of the field. The classical methods are cheap, label-free, and narrow. The reconstruction methods are the unsupervised default for raw video. The deep temporal models are the workhorse layer-2 classifiers. The CLIP family is the lift for low-label and open-vocabulary regimes. The multimodal LLMs are the explanation layer. The signal-watchdog algorithms catch the failures the vision layers cannot. A production system uses six or seven of these in concert; no production system uses only one.
Figure 3. The production pipeline that wires the twelve algorithms into a single architecture, with the edge-cloud split annotated.
A Common Mistake: Picking The Heaviest Algorithm First
The most expensive architectural mistake we see teams make in this field is starting from the heaviest algorithm. The conversation begins with "let's use a multimodal LLM, those are state of the art", and three months later the team is staring at a $400-per-camera-per-month inference bill with no way to bring it down because the entire pipeline is structured around the assumption that every frame goes through a VLM.
The reverse direction is correct. Start with the lightest algorithm that could plausibly solve the problem. One-Class SVM for empty-room intrusion. Isolation Forest for pipeline health. A vanilla autoencoder for raw-video novelty. A SlowFast plus an MTFL head for the bulk of behavioural anomalies. Only escalate to CLIP-based methods when the labelling cost of MIL hits its ceiling. Only escalate to multimodal LLMs when explanation is a hard requirement, and even then only on the borderline cases.
The reason this matters in dollars is that the cost curve is non-linear. Algorithms 1, 2, 3, and 12 run essentially free on a CPU. Algorithms 4, 5, 7, and 9 run on $250 edge hardware. Algorithm 8 needs a shared cloud GPU. Algorithm 10 needs a cloud GPU plus the bandwidth to send candidate clips to it. Algorithm 11 needs a frontier VLM API contract and disciplined routing. Each step up the cost ladder is a 5–10× jump. A pipeline built bottom-up reaches the right architecture and stops; a pipeline built top-down reaches the right architecture and keeps every unnecessary layer above it, paying for them forever.
Pitfall avoided: when a stakeholder says "let's use GPT-5 / Claude / Gemini for anomaly detection", the right response is "what's the labelled-example count, what's the latency budget, what's the hardware budget, and what's the explanation requirement?" Those four questions force the conversation back to the sieve, and the sieve picks the right layer.
Where Fora Soft Fits In
We have shipped video products for two decades — surveillance with Netcam Studio, WebRTC conferencing, telemedicine, e-learning, OTT, and AR/VR. The anomaly detection layer has been our most active workload in 2024–2026, across greenfield surveillance, retrofit projects on existing video pipelines, and embedded analytics on intelligent video analytics platforms. The default architecture we ship wires algorithms 4 or 5, 7, 9, 10, 11, and 12 into the five-layer stack from the playbook. The integration story is ONVIF Profile M to whatever VMS the customer already uses, MQTT for the high-frequency event bus, and a regulator-grade audit log to immutable storage. The verticals we have worked across — retail, transit, industrial safety, healthcare, conferencing — share more in architecture than they differ; the algorithm choice within the architecture is what makes each one ship.
What To Read Next
- Anomaly detection in video — the 2026 playbook — the architectural-level decision frame this article extends.
- Computer vision in retail, industrial, and intelligent video analytics — vertical playbooks for the largest commercial use cases.
- The cost model — what AI in video actually costs at scale — the dollar arithmetic behind every algorithm here.
Talk To Us / See Our Work / Download
Talk to a video engineer. Book a 30-minute scoping call with a Fora Soft senior engineer; we will run the four-number sieve against your product and produce a shortlist of two or three algorithms before the call ends.
See our case studies. Our surveillance and video analytics work is collected at forasoft.com/projects.
Download the algorithm selection worksheet. A one-page printable that walks the twelve algorithms against the four-number sieve. Download the worksheet (PDF).
References
- Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the Support of a High-Dimensional Distribution. Neural Computation 13(7). The foundational paper for One-Class SVM, the algorithm referenced in §"Algorithm 1".
- Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation Forest. ICDM 2008. The original Isolation Forest paper underlying §"Algorithm 2".
- Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). LOF: Identifying Density-Based Local Outliers. ACM SIGMOD. The original Local Outlier Factor paper underlying §"Algorithm 3".
- scikit-learn developers. 2.7. Novelty and Outlier Detection — scikit-learn 1.8.0 documentation. Read 2026-05-23. https://scikit-learn.org/stable/modules/outlier_detection.html — Canonical implementation reference for OCSVM, iForest, and LOF, used for complexity citations.
- scikit-learn developers. Evaluation of outlier detection estimators. Read 2026-05-23. https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_outlier_detection_bench.html — Benchmark comparison of OCSVM, iForest, and LOF used in §"Algorithm 3".
- Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A. K., & Davis, L. S. (2016). Learning Temporal Regularity in Video Sequences. CVPR. The first-wave video autoencoder paper referenced in §"Algorithm 4".
- Gong, D., Liu, L., Le, V., Saha, B., Mansour, M. R., Venkatesh, S., & van den Hengel, A. (2019). Memorizing Normality to Detect Anomaly: Memory-augmented Deep Autoencoder for Unsupervised Anomaly Detection. ICCV. The MemAE paper referenced in §"Algorithm 5".
- Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning Spatiotemporal Features with 3D Convolutional Networks. ICCV. The C3D paper referenced in §"Algorithm 7".
- Carreira, J., & Zisserman, A. (2017). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CVPR. The I3D paper referenced in §"Algorithm 7".
- Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). SlowFast Networks for Video Recognition. ICCV. The SlowFast paper referenced in §"Algorithm 7".
- Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video Swin Transformer. CVPR. The VideoSwin paper referenced in §"Algorithm 8".
- Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). VideoMAE: Masked Autoencoders Are Data-Efficient Learners for Self-Supervised Video Pre-Training. NeurIPS. The VideoMAE paper referenced in §"Algorithm 8".
- Sultani, W., Chen, C., & Shah, M. (2018). Real-World Anomaly Detection in Surveillance Videos. CVPR. https://arxiv.org/pdf/1801.04264 — Foundational paper introducing UCF-Crime and the MIL framework, referenced throughout §"Algorithm 9".
- Tian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J. W., & Carneiro, G. (2021). Weakly-Supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning. ICCV. https://arxiv.org/pdf/2101.10030 — The RTFM paper referenced in §"Algorithm 9".
- Chen, Y., Liu, Z., Zhang, B., Fok, W., Qi, X., & Wu, Y.-C. (2023). MGFN: Magnitude-Contrastive Glance-and-Focus Network for Weakly-Supervised Video Anomaly Detection. AAAI. https://arxiv.org/pdf/2211.15098 — The MGFN paper referenced in §"Algorithm 9".
- Karim, H., Doughty, H., & Hospedales, T. (2024). MTFL: Multi-Timescale Feature Learning for Weakly-Supervised Anomaly Detection in Surveillance Videos. https://arxiv.org/pdf/2410.05900 — The MTFL paper underlying the 89.78% UCF-Crime AUC figure in §"Algorithm 9".
- Lv, H., Yue, Z., Sun, Q., Luo, B., Cui, Z., & Zhang, H. (2023). Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection. CVPR. https://openaccess.thecvf.com/content/CVPR2023/papers/Lv_Unbiased_Multiple_Instance_Learning_for_Weakly_Supervised_Video_Anomaly_Detection_CVPR_2023_paper.pdf — Unbiased MIL paper referenced in §"Algorithm 9".
- Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML. The original CLIP paper referenced in §"Algorithm 10".
- Wu, P. et al. (2024). VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection. AAAI. https://arxiv.org/abs/2308.11681 — The VadCLIP paper with the 88.02% UCF-Crime AUC figure in §"Algorithm 10".
- Yang, Z. et al. (2024). Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection. CVPR. https://openaccess.thecvf.com/content/CVPR2024/papers/Yang_Text_Prompt_with_Normality_Guidance_for_Weakly_Supervised_Video_Anomaly_CVPR_2024_paper.pdf — The TPWNG paper in §"Algorithm 10".
- Lin, X. et al. (2025). WSVAD-CLIP: Temporally Aware and Prompt Learning with CLIP for Weakly Supervised Video Anomaly Detection. MDPI Imaging. https://www.mdpi.com/2313-433X/11/10/354 — The WSVAD-CLIP paper in §"Algorithm 10".
- VadCLIP++: Dynamic vision-language model for weakly supervised video anomaly detection. (2025). ScienceDirect. https://www.sciencedirect.com/science/article/abs/pii/S1051200425005822 — The VadCLIP++ paper in §"Algorithm 10".
- vpsg-research. TrCLIP-VAD GitHub repository. https://github.com/vpsg-research/TrCLIP-VAD — TrCLIP-VAD with the 88.59% UCF-Crime AUC and 86.38% XD-Violence AP figures in §"Algorithm 10".
- Zhang, H. et al. (2024). Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM. https://arxiv.org/pdf/2406.12235 — The Holmes-VAD paper in §"Algorithm 11".
- Yang, Y. et al. (2024). Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models. ECCV. https://arxiv.org/pdf/2407.10299 — The AnomalyRuler paper in §"Algorithm 11".
- Liu, X. et al. (2025). Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models. https://arxiv.org/pdf/2510.16290 — The Cerberus paper with the 97.2% UCF-Crime accuracy at 57.68 fps on L40S figure in §"Algorithm 11".
- Wang, J. et al. (2025). SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model. https://arxiv.org/pdf/2504.10320 — The SlowFastVAD paper in §"Algorithm 11".
- Foundation Models and Transformers for Anomaly Detection: A Survey. (2025). https://arxiv.org/abs/2507.15905 — July 2025 survey used as cross-check for the algorithm taxonomy.
- A Comprehensive Survey of Transformer-Based Models for Video Anomaly Detection. (2025). https://iieta.org/download/file/fid/191034 — October 2025 survey used as cross-check for the transformer-based algorithms.
- Rethinking Metrics and Benchmarks of Video Anomaly Detection. (2025). https://arxiv.org/html/2505.19022v1 — May 2025 paper on benchmark methodology used as a sanity check on the AUC numbers throughout the article.


