Tuning Analytics: False Alarms, Accuracy & the Operator · Video Surveillance & VMS

This is engineering guidance, not legal advice. Confirm specifics with qualified counsel.

Why this matters

This is the article every other analytics article points to, because tuning is where a surveillance system succeeds or fails in production. A detector that scores well in a vendor demo can be useless on your loading dock if it fires forty times a night at swaying trees, and a system tuned so conservatively that it never false-alarms may also be missing the one intrusion it was bought to catch. The person who lives with that outcome — the security integrator who configures it, the product manager who scoped it, the operations lead whose team watches it — needs to understand that "accuracy" is a range set by tuning, not a fixed property of the camera. Get the dial and the levers right and the control room trusts the system and acts on it; get them wrong and you have spent the whole analytics budget on alerts nobody reads.

The honest starting point: accuracy is a range, not a number

Start by retiring the word that wrecks surveillance projects: "accuracy," used as if it were a single percentage stamped on a box. It is not. Every analytic — object detection, a line-crossing rule, a face match, an anomaly score — produces a confidence number for each thing it sees, and a threshold decides which of those become alarms. Move the threshold and the same camera, the same model, the same scene produce a different "accuracy." So a vendor who tells you a detector is "98% accurate" has told you almost nothing useful, because they have not told you the threshold, the scene, the lighting, or which of the two mistakes that 98% is hiding.

There are exactly two mistakes an analytic can make, and naming them is the whole foundation. The first is a false positive, also called a false alarm: the system says "person!" when it was a shadow, a cat, or a headlight sweep. The second is a false negative, also called a miss: a real person walks through and the system says nothing. Alongside them are the two ways of being right — a true positive (real event, correctly flagged) and a true negative (nothing happened, correctly silent). Those four boxes are called a confusion matrix, and every accuracy claim is just some arithmetic on them.

A 2x2 confusion matrix showing true positive, false positive, false negative, and true negative, with precision and recall defined from the boxes and the cost of each error in surveillance. Figure 1. The four outcomes of any analytic. A false positive is a false alarm; a false negative is a miss. Precision and recall are two different ratios read off these boxes — and in surveillance the two errors cost very different things.

From those four boxes come the only two numbers that matter, and the reason one number was never enough. Precision asks: of all the alarms the system raised, how many were real? Write it as precision = true positives ÷ (true positives + false positives). High precision means when it cries wolf, there is usually a wolf. Recall — also called sensitivity — asks the opposite: of all the real events that happened, how many did the system catch? Write it as recall = true positives ÷ (true positives + false negatives). High recall means few real events slip through. A system can have wonderful precision and terrible recall (it only ever alarms when it is dead certain, so it is right when it speaks but silent most of the time), or wonderful recall and terrible precision (it alarms at everything, so it never misses a real event but buries it in noise). The single score people quote to combine them is the F1 score, the harmonic mean of precision and recall — useful as a summary, useless as a substitute for knowing which of the two you actually care about.

The dial: why you cannot maximize both at once

Here is the mechanism underneath all of it. A modern detector does not output "person / not person." It outputs a confidence score — a number, say from 0 to 1, for how sure it is. Turning that score into an alarm requires a cutoff: a confidence threshold. Anything above the threshold becomes an alarm; anything below it is ignored. That single cutoff is the master dial of the entire system.

Now watch what the dial does. Lower the threshold — say from 0.7 down to 0.3 — and you accept weaker, less certain detections. You catch more real events (recall goes up) but you also let in more junk (precision goes down, false alarms climb). Raise the threshold back up toward 0.9 and the opposite happens: the junk disappears (precision goes up) but the weak-but-real detections fall below the line and become misses (recall goes down). You are not improving or degrading the model when you do this — the model is fixed. You are sliding along a trade-off curve that the model already has, choosing how to split its fixed amount of error between false alarms and misses.

A confidence-threshold dial with a precision-recall curve: a low threshold gives high recall and low precision with many false alarms, a high threshold gives high precision and low recall with more misses, and you cannot reach the top-right corner. Figure 2. The threshold is the master dial. Slide it down for recall (catch everything, more false alarms); slide it up for precision (clean alerts, more misses). The perfect top-right corner — everything caught, nothing false — does not exist on a real curve.

The curve that draws this out is the precision-recall curve (a close cousin is the ROC curve): plot precision against recall as you sweep the threshold across its whole range, and you see the menu of operating points the model offers. The dream point — 100% recall and 100% precision, top-right corner — is not on the curve for any real detector in any real scene. That is the precise, technical reason this whole section bans the phrase "100% accuracy": it describes a point that does not exist. What exists is a curve, and tuning is the act of choosing the one point on that curve that is right for your site. A summary number like mean average precision (mAP) describes the whole curve's quality, but the model internals behind that number — the architectures and training that decide how good the curve can be — belong to the model-engineering layer, covered in the AI for Video Engineering section, not here. Here we choose the operating point and live with it.

Choosing the operating point: it depends on what a mistake costs

If you cannot have both, you have to decide which mistake you can tolerate — and that depends entirely on what each mistake costs at your specific site. This is the judgment that separates someone who understands tuning from someone reading a spec sheet.

Consider two systems at opposite ends. The first is the perimeter of a substation or a data center. Here a miss — a real intruder the system ignores — is a catastrophe, while a false alarm is a guard glancing at a screen for ten seconds. The cost of a miss vastly exceeds the cost of a false alarm, so you tune for recall first: set the threshold low, accept that you will get nuisance alarms, and add a verification step (a human or a second analytic) to filter them. You would rather chase ten shadows than miss one person on the fence.

The second is retail people-counting for a footfall report. Here a miss is one uncounted shopper out of thousands — statistically invisible — while a flood of false counts quietly corrupts the number the merchandising team is trusting. The costs are reversed, so you tune for precision, or for a balance, and let the occasional miss go. Same technology, opposite dial setting, because the cost of being wrong points the other way. The lesson is the one buyers most often miss: there is no universally "correct" threshold. The right operating point is a property of your use case, not of the camera, and a good integrator sets it deliberately rather than shipping the vendor's default of 0.5.

A decision view mapping use cases to operating points: perimeter intrusion tunes recall-first, watch-list matching needs precision plus verification, retail counting tunes for balance, with the cost of a miss versus a false alarm driving each choice. Figure 3. The operating point is a business decision. Where a miss is catastrophic (perimeter), tune for recall and verify the noise. Where false positives corrupt the data or harm a person (counting, watch-lists), tune for precision and keep a human in the loop.

Why rare events are mathematically brutal

There is a trap hiding inside even a very good detector, and it catches teams who reason from accuracy alone: when the thing you are looking for is rare, false alarms swamp real ones no matter how accurate the system sounds. This is the base-rate fallacy, sometimes called the false-positive paradox, and it is arithmetic, not pessimism.

Walk the numbers out loud, because the result surprises almost everyone the first time. Picture a busy entrance where the analytics process 50,000 person-detections in a day. You want to flag a rare event — say a specific watch-listed individual — that truly occurs about 5 times that day. Your detector is "99% accurate," which sounds airtight; read that as a 1% false-positive rate. Now do the multiplication. False positives: 1% of the roughly 50,000 non-matches is about 500 wrong flags. True positives: 99% of the 5 real events is about 5 correct flags. So the operator's screen shows about 505 alerts, of which 5 are real — a precision near 1%. The system is "99% accurate" and almost every alarm it raises is wrong, because the real event is so rare that even a tiny error rate, applied to a huge number of non-events, produces far more false alarms than true ones.

The base-rate trap: 50,000 detections with five true rare events and a 1 percent false-positive rate produce about 500 false alarms versus five real ones, so a 99 percent accurate system is right about 1 percent of the time it alarms. Figure 4. Why "99% accurate" can mean "wrong almost every time it alarms." When real events are rare, a tiny false-positive rate applied to a huge number of non-events buries the few true hits. Rare-event analytics must be triage plus verification, never an autonomous alarm.

This is not an argument against analytics; it is an argument about how to deploy the rare-event ones. The same math governs face-recognition watch-lists (see face recognition in surveillance) and learned anomaly detection, which is why both must be run as a triage layer that cues a human, not a button that acts on its own. You raise precision with verification — a second analytic, a confirming sensor, or an operator who confirms before anything happens — so that the 5 real events get found without the 500 false ones triggering a response. The base rate cannot be argued with; it can only be designed around.

The operator's reality: alert fatigue is the real failure mode

Now the part the data sheets never mention, and the reason tuning is a human problem as much as a technical one. A surveillance system does not run in a vacuum; it runs in front of a person, and people have hard limits. The dominant failure mode of a deployed analytics system is not a model that is insufficiently accurate. It is alert fatigue — an operator who has been flooded with false alarms stops trusting the system and starts ignoring it, so that when a real alarm finally arrives it lands in a stream of noise the operator has already tuned out. This is the "cry wolf" effect, and it is lethal precisely because it is invisible: the system is "working," the alerts are firing, and nobody is reading them.

The numbers behind human monitoring are sobering. A person asked to watch video feeds for activity loses attention fast: a widely cited figure in the security field holds that an operator misses up to 45% of screen activity after about 12 minutes of continuous watching, and up to 95% after about 22 minutes — and the peer-reviewed CCTV-vigilance literature confirms a real, measurable decline in detection over a monitoring shift. Field observation of live control rooms found that only about a third of detections came from operators proactively watching, rather than from someone phoning in or an alarm pointing them at a screen. Humans are bad at staring at quiet video and excellent at responding to a trustworthy alert — which is the entire reason analytics exist, and the entire reason a system that cannot be trusted is worse than none.

This reframes what good tuning is for. You are not tuning to make a number on a dashboard look better. You are tuning to produce a stream of alerts an operator will still believe at hour seven of a night shift. A system that throws 1,200 alarms a day teaches the operator to ignore it by lunchtime; a system that throws 12 alarms a day, almost all of them real, keeps the operator responsive. The target is not zero false alarms — that is impossible — but few enough that trust survives.

The levers: how a real operator cuts false alarms

So how do you actually get from 1,200 noisy alarms to 12 trustworthy ones without going blind to real events? Not by yanking the master threshold up until the noise stops — that throws away recall and starts hiding real intrusions. You get there by stacking several filters, each of which removes a category of false alarm while costing you almost no real detections. Think of it as a cascade: each stage removes noise the previous stage let through.

The starting point in most legacy systems is plain motion detection (often called video motion detection, or VMD), which fires whenever enough pixels change. It is the noisiest analytic in existence: rain, snow, wind in foliage, shifting shadows, headlights, insects on the lens, and a flag on a pole all "move." A site of 40 cameras on raw motion can easily generate well over a thousand nuisance alarms a day. The levers below turn that into something a person can actually watch.

The first and biggest lever is object classification. Instead of alarming on motion, alarm only when the AI recognizes a person or a vehicle and ignore everything else. This alone is what the industry means when it reports AI cutting false alarms by up to around 90% versus motion — the swaying tree still moves, but it is not a person, so it is silent. The remaining levers each cut the survivors further.

A detection zone (a region of interest) restricts the analytic to the part of the scene that matters — the doorway, not the public sidewalk behind it. A direction filter alarms only on the relevant motion — someone entering through a one-way exit, not the normal outbound flow. An object-size filter, set with a quick perspective calibration so the system knows how big a person looks near versus far, discards a cat in the foreground and a distant bird. A dwell or persistence requirement — the object must be present for, say, three seconds before it counts — kills the momentary blips that cause a surprising share of false alarms. A schedule arms the rule only when it should be armed (after hours, not during the working day). And rule combination ties them together with AND logic: alarm only when a person is in this zone after hours for at least three seconds. Each condition the event must satisfy strips out another class of false alarm.

A false-alarm reduction cascade: 1,200 raw motion alarms per day on 40 cameras fall to about 120 after AI object classification and to about 12 after detection zones, schedules, and a dwell requirement. Figure 5. Stack filters, do not just raise the threshold. Object classification removes the biggest category of noise; zones, schedules, direction, size, and dwell each strip out another. The same logic that cuts the count protects recall, because each filter targets noise, not real events.

Walk the arithmetic, because the cascade is the practical heart of this article. Start with 40 cameras throwing about 30 raw motion alarms each per day: 40 × 30 = 1,200 alarms a day, a number no team will read. Add object classification that removes roughly 90% of the non-person, non-vehicle noise: about 120 a day survive. Add detection zones, an after-hours schedule, and a three-second dwell requirement, which together remove roughly another 90% of what is left: about 12 a day. Twelve reviewed alerts, almost all real, is a system an operator trusts and acts on. The cascade got there without ever lowering recall in the way a brute-force threshold increase would, because each filter targets a kind of false alarm rather than the model's overall confidence. This is also exactly how the police world solved the same problem at scale: jurisdictions that require verified alarms before dispatch, using video or a second signal to confirm, have cut false dispatches by an estimated 90%, because verification is just another filter stage applied to a notoriously noisy detector.

Common mistakes to avoid

Four tuning errors sink real deployments. The first is chasing zero false alarms by cranking the threshold, which silently destroys recall — you stop seeing nuisance alarms and you stop seeing intruders. The second is tuning once and walking away, when scenes change with the seasons (bare branches in winter, dense foliage in summer), with new lighting, and with new traffic patterns, so a system tuned in March is mis-tuned by July. The third is tuning a number nobody measured: if you are not tracking false alarms per camera per day and testing recall with deliberate walk-tests, you are guessing, not tuning. The fourth, and the most consequential, is treating an analytic's output as a decision rather than a cue — letting a noisy detector trigger a real-world action against a person without a human confirming it, which is both an operational and a legal mistake, as the next section explains.

How much is standardized — and how much is vendor-locked

If you run a multi-vendor fleet, you need to know which parts of this tuning surface are portable and which are locked to the camera's maker — because here, as everywhere in surveillance, the ONVIF interoperability standard guarantees a baseline, not full feature parity.

The ONVIF standard does standardize the configuration of rules. Its Analytics Service Specification defines operations to create, read, modify, and delete analytics rules (CreateRules, GetRules, ModifyRules, DeleteRules), and its Annex A defines a set of normative rule types — line detectors, field (zone) detectors, loitering detectors, and counting — that a conformant device exposes and a Video Management System (the software platform that records and manages the cameras, called a VMS) can configure over ONVIF Profile M, the metadata-and-analytics profile. So drawing a zone, placing a tripwire, and reading the resulting events are, in principle, portable across conformant devices. The deep treatment of that interface is in events, metadata, and the ONVIF analytics interface.

But the dial itself is not standardized. The actual confidence threshold, the sensitivity curves, the scene-calibration model, and the quality of the underlying detector are implementation-specific — they live in the vendor's firmware and SDK, not in the ONVIF interface. Two ONVIF-conformant cameras can expose the same rule types and produce wildly different false-alarm rates, because conformance is about the interface, not the accuracy. The clean way to hold it: ONVIF standardizes how you describe a rule; the vendor decides how well the rule actually works. For features beyond the ONVIF baseline you reach into the vendor SDK, the subject of proprietary camera SDKs beyond ONVIF. For the commercial overview of how the ONVIF profiles fit a security system, see Fora Soft's guide to ONVIF profiles in security systems.

The tuning levers at a glance

The table is the operator's toolbox in one view. Each lever removes a different class of false alarm; the right tuning stacks several rather than relying on the master threshold alone.

Lever	What it removes	Recall risk if overused	Standardized in ONVIF?
Confidence threshold	Weak, low-certainty detections	High — the bluntest cut; hides real events	No — vendor-specific value
Object classification	Non-person / non-vehicle motion (trees, animals)	Low — targets a clear noise class	Object type via Profile M metadata
Detection zone (ROI)	Activity outside the area of interest	Low — if the zone is drawn correctly	Yes — field detector (Annex A)
Direction / line rule	Motion in the irrelevant direction	Medium — a wrong line misses crossings	Yes — line detector (Annex A)
Object-size + calibration	Too-small / too-large objects (cat, bird)	Low–medium — needs correct perspective	Partly — vendor calibration
Dwell / persistence	Momentary blips and flickers	Low — a few seconds rarely hides intent	Partly — loitering / rule param
Schedule (arming)	Alarms during permitted hours	Low — if the schedule matches policy	Configured in the VMS
Rule combination (AND)	Events that miss any condition	Medium — over-constraining hides events	Composed in the VMS

Where Fora Soft fits in

Fora Soft has built video streaming, real-time communication, and computer-vision software since 2005, across 625+ delivered projects for 400+ clients, with surveillance and computer vision at the center of that work. On tuning our stance is the accuracy-vs-performance one this article argues: we set the operating point to the cost of a miss versus a false alarm for each specific scene, stack classification, zones, calibration, dwell, and scheduling rather than leaning on the raw threshold, and we measure false alarms per camera per day and recall with deliberate walk-tests before and after. We treat rare-event and biometric analytics as triage that cues a human, never as an autonomous trigger, and we build the human-in-the-loop and the audit trail in from the start — so the control room still trusts the alerts at the end of a long shift, and the system holds up under review.

Call to action

Talk to a surveillance engineer — book a 30-minute scoping call to talk through your false alarm tuning analytics plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Analytics Tuning / False-Alarm Checklist — One-Page Reference — How to tune deployed video analytics so the alerts stay trustworthy, on one page: the vocabulary (false positive vs false negative; precision = TP/(TP+FP), recall = TP/(TP+FN); accuracy is a range, never 100%); the confidence-threshold….

References

ONVIF Analytics Service Specification, Ver. 25.12 — ONVIF. (Tier 1.) Defines the standardized analytics-rule configuration interface — CreateRules, GetRules, ModifyRules, DeleteRules — and, in Annex A, the normative rule types (line detector, field/zone detector, loitering detector, and counting) that a conformant device exposes. The basis for "ONVIF standardizes how you describe a rule (zone, tripwire, loitering) while the detection sensitivity, threshold values, and underlying model accuracy are implementation-specific." https://www.onvif.org/specs/srv/analytics/ONVIF-Analytics-Service-Spec.pdf
ONVIF Profile M Specification v1.0 — ONVIF. (Tier 1.) Standardizes the configuration and query of analytics rules and the filtering/streaming of the resulting metadata and events from a device into a VMS, and states that conformance is a baseline for interoperability, not a guarantee of accuracy. The basis for "rule configuration and events are portable over Profile M; the false-alarm rate is not." https://www.onvif.org/wp-content/uploads/2021/06/onvif-profile-m-specification-v1-0.pdf
GDPR — Regulation (EU) 2016/679, Art. 4(1) (personal data), Art. 22 (automated individual decision-making), Art. 35 (DPIA) — European Union (EUR-Lex). (Tier 1.) Art. 22 gives a person the right not to be subject to a solely automated decision with legal or similarly significant effects; Art. 35 requires a DPIA for systematic monitoring; Art. 4(1) makes the footage personal data. The legal basis for "an analytic must cue a human, not act on its own — keep a human in the loop, especially for a false-positive that would trigger action against a person." https://eur-lex.europa.eu/eli/reg/2016/679/oj
EU AI Act — Regulation (EU) 2024/1689, Art. 14 (human oversight) — European Union. (Tier 1.) Requires that high-risk AI systems be designed for effective human oversight, that the overseer understand the system's capabilities and limits, and that they remain aware of automation bias (over-relying on the system's output). The basis for the human-in-the-loop and automation-bias framing of tuning. https://artificialintelligenceact.eu/article/14/
Guidelines 3/2019 on processing of personal data through video devices — European Data Protection Board (EDPB). (Tier 1.) Ranks intelligent video analysis by intrusiveness and sets the DPIA and human-review expectations for systematic monitoring. The basis for "the more consequential the alarm, the more a human must confirm it." https://www.edpb.europa.eu/our-work-tools/our-documents/guidelines/guidelines-32019-processing-personal-data-through-video_en
The Base-Rate Fallacy and Its Implications for the Difficulty of Intrusion Detection — Stefan Axelsson, ACM CCS (1999/2000). (Tier 5, academic.) Shows formally that when the target event is rare, even a very low false-positive rate produces a flood of false alarms relative to true ones, so detector "accuracy" alone does not predict usable precision. The basis for the rare-event base-rate worked example. https://www.researchgate.net/publication/234791135_The_Base-Rate_Fallacy_and_the_Difficulty_of_Intrusion_Detection
The False-Positive Paradox Explains Why You Misjudge Risk — Scientific American (2021). (Tier 6, educational.) Plain-language treatment of the false-positive paradox with the worked example of a 99%-accurate face-recognition camera screening 10,000 people producing far more false positives than true matches. Orientation source for the base-rate section. https://www.scientificamerican.com/article/the-false-positive-paradox-explains-why-you-misjudge-risk/
Work exposure and vigilance decrements in closed circuit television surveillance — Applied Ergonomics (2015). (Tier 5, academic.) Peer-reviewed evidence that CCTV operators' detection performance declines measurably over a monitoring period — the vigilance decrement — confirming alert fatigue as a real operational limit, not folklore. The academic basis for the operator-attention section. https://pubmed.ncbi.nlm.nih.gov/25479991/
International Association of Chiefs of Police — verified-response / false-alarm position; Center for Problem-Oriented Policing, "False Burglar Alarms" — IACP / ASU COPS. (Tier 5, institutional.) Documents that the large majority of monitored-alarm activations historically dispatched to police are false (commonly cited in the 90–99% range), that false alarms consume major police resource, and that verified-response policies cut false dispatches by an estimated ~90%. The basis for "verification is just another filter stage, and the police world solved this exact problem with it." https://popcenter.asu.edu/content/false-burglar-alarms-0
Object detection evaluation: precision, recall, the precision-recall curve, and mAP — Label Your Data; MathWorks object-detection metrics. (Tier 6, educational.) Describes how a detector's confidence threshold is swept to trade precision against recall, how the precision-recall curve visualizes the trade-off, and how mAP summarizes it — the basis for the threshold-dial and precision/recall mechanics. https://labelyourdata.com/articles/object-detection-metrics

Tuning Analytics: False Alarms, Accuracy, and the Operator's Reality

Why this matters

The honest starting point: accuracy is a range, not a number

The dial: why you cannot maximize both at once

Choosing the operating point: it depends on what a mistake costs

Why rare events are mathematically brutal

The operator's reality: alert fatigue is the real failure mode

The levers: how a real operator cuts false alarms

Common mistakes to avoid

How much is standardized — and how much is vendor-locked

The tuning levers at a glance

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Tuning Analytics: False Alarms, Accuracy, and the Operator's Reality

Why this matters

The honest starting point: accuracy is a range, not a number

The dial: why you cannot maximize both at once

Choosing the operating point: it depends on what a mistake costs

Why rare events are mathematically brutal

The operator's reality: alert fatigue is the real failure mode

The levers: how a real operator cuts false alarms

Common mistakes to avoid

How much is standardized — and how much is vendor-locked

The tuning levers at a glance

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Recall

Precision

ONVIF

Object classification

Confidence threshold

Motion detection

Object detection

ONVIF Profile M