This is engineering guidance, not legal advice. Confirm specifics with qualified counsel.

Why this matters

If you are scoping or buying a surveillance system, "AI object detection" is the feature that does the most work for the least money — and the one most often misunderstood. Get it right and your operators stop drowning in false alarms, your recorded footage becomes searchable, and every other analytic you might add later has something solid to stand on. Get it wrong — the wrong camera placement, a detector tuned for a demo instead of your loading dock at 2 a.m., or a vendor's single "accuracy" number taken at face value — and you pay for it every night in alerts nobody trusts. This article is written for the security integrator, product manager, or operations lead who needs to specify detection sensibly and talk to engineers about it, without needing to know how the underlying neural network is built. A senior video engineer will also find the accuracy and standards framing accurate; the writing serves the non-technical reader first.

What "object detection" actually means — and how it differs from classification

Start with two words people use interchangeably and shouldn't, because the difference decides what you can build.

Image classification answers one question about a whole picture: what is this an image of? Show it a frame and it replies "this image contains a car" — one label for the entire frame, with no idea where the car is or whether there are two of them. It is the simplest form of computer vision, and on its own it is nearly useless for surveillance, because a camera scene is never "an image of one thing."

Object detection answers a harder, more useful pair of questions at once: what objects are here, and where is each one? It draws a bounding box — a rectangle — around every object of interest and attaches a class label to each box: person, car, truck, bicycle, bag, animal. So detection is really localization (the box) plus classification (the label), done for every object in the frame at the same time. The reader-friendly way to hold it: classification labels the photo; detection labels the things in the photo and tells you where they are.

That "where" is the whole point for surveillance. "There is a person in this frame" is mildly interesting; "there is a person inside the fenced zone you told me to watch, at these coordinates, at 02:14" is an alarm you can act on. Detection is what turns a camera from something that records into something that notices.

Classification labels a whole frame with one tag; detection draws a labelled box around every object and gives its location. Figure 1. Classification versus detection. Classification produces one label for the entire frame. Detection localizes every object with a box and a class, which is what makes a surveillance scene actionable — you know not just that a person is present, but where, and how many.

From a box to a searchable event

A detected box on its own lives and dies in a single video frame. The value appears when the system turns that box into an event — a small, typed, timestamped record: object class, the box coordinates, a confidence score, and sometimes a simple attribute like color. That record is what makes a surveillance system alertable (drive a notification now) and searchable (find it months later). A camera produces pixels; object detection produces events; the events are the product.

There is an industry standard for the "becomes an event" step, and it is the part this surveillance course owns rather than the AI course. ONVIF is the common language that lets cameras and software from different makers understand each other, and one ONVIF profile is built specifically for analytics. ONVIF Profile M standardizes generic object classification plus metadata for vehicle, license plate, human face, and human body, and the event interfaces for things like object counting — and it lets that metadata travel inside the video stream, through the ONVIF event service, or over MQTT, a lightweight messaging protocol common in connected-device systems (ONVIF, Profile M, spec v1.1). The detail that matters for a buyer: a Profile M product can be the camera, an on-site server, or a cloud service, and the consumer can be the VMS, a network video recorder (NVR), or a cloud app — so the same classified-object event language works no matter where the detection ran (ONVIF). Bounding boxes, a center-of-gravity point, class labels, and simple attributes are serialized as ONVIF "Scene Description" data carried with the stream.

As always with ONVIF, conformance is a baseline, not a guarantee of every feature. Two Profile M products will reliably exchange standard object-classification events; a vendor's special attribute — a particular sub-class, a proprietary confidence model — may still need that maker's own software kit. Keep "ONVIF-conformant" and "fully featured over ONVIF" separate in your head. For the standards layer beneath this, see events, metadata, and the ONVIF analytics interface; for the commercial overview of how Profile M ends multi-vendor metadata chaos, Fora Soft's Profile M explainer is the companion piece.

Pipeline from camera pixels through a detector to a classified-object event, into the VMS index, and out to live alerts and forensic search. Figure 2. The detection pipeline. Pixels enter on the left; the detector produces a box plus a class; that becomes a typed ONVIF Profile M event; the VMS indexes it so it can raise a live alert now and answer a forensic search later. The box is intermediate; the event is the deliverable.

The accuracy reality: a range, never a number

Here is the rule the whole section is built on, applied to detection: accuracy is always a range tied to scene, lighting, angle, and tuning, and it is never "100%." Two ideas make that concrete.

First, the metrics. Detection quality is reported with three linked numbers. Precision is the share of the detector's alerts that are real — if it flags 100 "people" and 90 are truly people, precision is 90%. Recall is the share of real objects it actually catches — if 100 people walk through and it finds 75, recall is 75%. The two trade off: turn sensitivity up and recall rises but precision falls (more false flags); turn it down and the reverse. To score detectors on one number, researchers use mean average precision (mAP), which rolls precision and recall across confidence thresholds and object classes into a single figure, judged by how well each predicted box overlaps the true box — a measure called Intersection over Union (IoU) (Roboflow; Encord).

Second, the numbers themselves, and why a low-looking score is not bad news. On the standard academic benchmark — the COCO dataset, scored at the strict mAP@0.5:0.95 setting that demands tight boxes across 80 object classes and many object sizes — current real-time detectors land roughly in the 38–55% band. As concrete 2026 examples, the compact YOLOv12-N scores about 40.6% mAP at 1.6 ms per image on a test GPU, while the largest YOLOv12-X reaches about 55.2%, and transformer-based detectors such as RT-DETR/RF-DETR sit on the same frontier (YOLOv12, arXiv 2502.12524; Roboflow). That sounds low only because the benchmark is deliberately punishing. In a fixed surveillance scene, watching the few classes you care about (person, vehicle) under known lighting, a tuned detector reaches far higher operational precision and recall — published surveillance work reports mAP@0.5 around 96% for medium objects and 85% for small ones after fine-tuning on scene footage (ScienceDirect, 2025). The benchmark measures a hard general problem; your camera solves an easier specific one.

The honest takeaway for a buyer: never accept a single "99% accurate." Ask for precision and recall in your conditions — your camera height, your lighting, the object sizes you care about — and ask what happens to small or distant objects, which are where detectors struggle most.

Horizontal bars showing strict COCO mAP at roughly 38 to 55 percent versus higher operational precision and recall in a fixed scene, with a never-100-percent marker. Figure 3. Detection accuracy is a range. The strict COCO mAP@0.5:0.95 band (~38–55%) is the hard general benchmark; operational precision and recall in a fixed, tuned scene run much higher. No setting reaches 100% — accuracy moves with scene, lighting, angle, and object size.

Why detection beats motion detection: the false-alarm math

The single biggest practical reason to care about object detection is what it does to false alarms. The technology it replaces, classic motion detection, works by comparing pixels between frames and firing whenever enough of them change. It cannot tell a person from a swaying branch, a cat, a cloud shadow, or a car's headlight sweeping a wall — so it fires on all of them. Traditional motion-triggered systems are widely reported to produce false-alarm rates above 90% (industry analyses, 2026). An operator who stops trusting alerts is an operator who misses the real one; this is the core failure mode of surveillance.

Object detection fixes this by filtering on class. The rule changes from "alert when pixels change" to "alert when a person (or a vehicle) appears here" — and shadows, leaves, and weather simply are not in those classes. Vendors and integrators report that moving from motion to AI object detection removes on the order of 80–95% of nuisance alarms (multiple 2026 sources). That is the difference between an alert stream a human can watch and one they tune out.

But — and this is the number that decides whether a deployment is usable — even a very good detector can drown operators if the volume is high enough. A little arithmetic shows why. Take a 30-camera site that produces, across all cameras, about 100,000 candidate detections a day (a busy mixed indoor/outdoor site easily does). Suppose the detector runs at a 1% false-positive rate — the kind of figure a vendor would call "99% accurate":

false alarms per day = 1% × 100,000 candidates = 1,000 false alarms.

If the number of real events worth acting on is, say, 10 a day, operators see roughly 1,010 alerts to find 10 that matter — so the precision they actually experience is:

precision ≈ 10 real ÷ 1,010 total ≈ 1%.

A "99% accurate" detector delivered a 1%-useful alert stream. This is the base-rate problem, and it is exactly why this section refuses the phrase "100% accuracy": at scale, what decides usefulness is not how often the detector is right on one frame, but how many false alarms survive to reach a human. The fix is the layered design detection makes possible — a light, cheap detector filters the candidates so that heavier analytics and your operators only ever see the few that matter. Getting that dial right is the subject of the block's honest capstone, tuning analytics: false alarms, accuracy, and the operator's reality.

Motion detection fires on shadows, animals, weather, and headlights; object detection fires only on person and vehicle classes, cutting nuisance alarms. Figure 4. Why class filtering wins. Motion detection reacts to any pixel change — shadows, foliage, animals, headlights. Object detection reacts only to the classes you asked for, removing 80–95% of nuisance alarms — though the false-alarm volume that survives still has to be tuned down to something operators trust.

Where detection runs, and what it costs in compute

A useful property of object detection is that it is light enough to run on the camera itself. Modern surveillance cameras carry a small AI chip — a neural processing unit, or NPU, rated in trillions of operations per second (TOPS). Mainstream camera silicon from vendors like Ambarella, Hisilicon, Novatek, and Rockchip delivers roughly 1–6 TOPS, which is enough to run a quantized (slimmed-down) detector at 15–30 frames per second on the camera (industry analysis, 2026). As a rough hierarchy, detecting a human body is the lightest task, vehicle classification wants around 2 TOPS or more, and richer behavioral analytics want 3–4 TOPS and up; higher-end edge chips now reach 20–60+ TOPS for multi-stream or multi-analytic work.

Running detection on the camera — the edge tier — has three consequences a buyer should weigh. It keeps latency low (no round trip to a server), it slashes bandwidth (the camera sends a tiny event, not a full video stream, when nothing is happening), and it keeps raw video on-device, which is friendlier to privacy. The trade is that an on-camera model is smaller and so a little less accurate than a heavy server-side model, and you are limited to the analytics the camera's chip can run. Where analytics run — camera, on-site server, or cloud — is its own decision with its own latency, bandwidth, and privacy profile, and it is covered in depth in edge vs cloud video analytics and on-camera (edge) AI. This article owns what detection does and how it surfaces; those own where.

The model lives in another section — on purpose

One deliberate boundary keeps this article honest. How the detector is built — the network architecture, the training data, the accuracy-per-compute-budget tradeoffs of the YOLO family, transformer detectors, and open-vocabulary models — is the territory of our AI for Video Engineering section. The deep dives on the YOLO production lineage (v8–v12) and open-vocabulary detection own the model engineering. This surveillance article owns the application: how detection plugs into the camera, the VMS, and storage, and what it delivers in practice. The intent split is simple — that section engineers the detector; this one embeds it in a working surveillance system.

A clear line: detecting a person is not identifying one

This matters enough to state plainly. Object detection that classifies a box as "person" or "vehicle" tells you that something is there and what kind — it does not tell you who. It builds no facial template and reads no plate. On its own, plain person-or-vehicle detection is therefore a low privacy-weight analytic, and it is generally not the "biometric data" that the EU's GDPR treats as special-category data (GDPR Art. 9), nor the biometric identification restricted under the EU AI Act and Illinois' Biometric Information Privacy Act (BIPA, 740 ILCS 14).

Two analytics built on top of detection cross that line. Face recognition measures a face into a numeric template to answer "is this Person X?" and license-plate recognition reads a plate into text that links to a registered owner — both are a legal gate before they are a feature, and both get their own articles: face recognition in surveillance and the LPR deep dive, with the full law in Block 6. Even non-biometric detection can acquire privacy weight once it feeds tracking and re-identification — following one person across cameras builds a movement trail even without a name — which is the subject of object tracking and re-identification. The clean rule: detection is the foundation; the legal gate sits at the analytics that turn "a person" into "this person." This is engineering guidance, not legal advice.

Object detection of a person or vehicle as a class is low privacy weight; face recognition and license-plate recognition pass through a biometric legal gate. Figure 5. The privacy line. Detecting a person or vehicle as a class is not biometric identification and is low privacy weight. The analytics that build on detection — face recognition, license-plate recognition — pass through a legal gate (GDPR, EU AI Act, BIPA) first; cross-camera tracking sits in between.

Detection at a glance

Image classification Object detection Object detection + tracking
Question answered What is this an image of? What is here, and where? Where did this object go?
Output One label per frame A box + class per object Boxes linked over time
Surveillance value Low on its own High — actionable per object Path/dwell, multi-camera
Surfaces in VMS as A frame tag Classified-object events (ONVIF Profile M) Object tracks
Typical tier Edge Edge (camera NPU) Edge + server
Privacy weight Low Low (a class, not an identity) Medium (a movement trail)

Table 1. Classification, detection, and detection-plus-tracking compared on what a buyer needs to know. Detection is the row most surveillance value rests on; tracking (article 4.3) and the biometric analytics (4.4, 4.5) build upward from it.

A common mistake to avoid

The costliest pattern we see is specifying detection by a single accuracy number from a datasheet instead of by precision and recall in the actual scene. A vendor's "99%" was measured somewhere — rarely on a camera at your mounting height, in your light, against the object sizes you care about. The close runner-up is mounting the camera for a pretty picture rather than for the detector: a camera angled for a wide human-pleasing view often gives the model tiny, oblique objects it cannot reliably classify, while a slightly lower, tighter placement transforms recall. Detection accuracy is set as much by camera placement and scene control as by the model. Demand a short on-site pilot that reports precision and recall on your footage before you commit to a platform.

Where Fora Soft fits in

Fora Soft has built real-time video, streaming, and computer-vision software since 2005, across 625+ shipped projects, and object detection is the analytic we tune most often, because it is where the gap between a demo and a deployment is widest. Teams come to us when an off-the-shelf platform's detection fires too much (operators stop trusting it) or too little (it misses the events that matter), and the fix is almost never a "better model" in the abstract — it is the right model for the camera, quantized to the edge chip, tuned to the scene, and wired into the VMS so the events are actually searchable. The framing we lead with is how the system behaves under real load first: the realistic precision and recall at your camera heights and lighting, the false-alarm rate your operators will live with, and only then the capability. A detector operators trust beats one that demos well.

What to read next

Call to action

References

  1. ONVIF — "Profile M" (standardizes generic object classification and metadata for vehicle, license plate, human face and human body; event interfaces for object counting, face and license-plate recognition; metadata over the stream, the ONVIF event service, or MQTT; a conformant product can be a camera, a server, or a cloud service, and a client can be a VMS, NVR, or cloud service; conformance is a baseline). Primary (tier 1). https://www.onvif.org/profiles/profile-m/
  2. ONVIF — "Profile M Specification v1.1" (the normative profile text behind the summary above; object metadata serialized as ONVIF Scene Description). Primary (tier 1). https://www.onvif.org/wp-content/uploads/2024/04/onvif-profile-m-specification-v1-1.pdf
  3. IEC — "IEC 62676 series, Video surveillance systems for use in security applications" (the international VSS standard; Part 6 covers video content analytics — performance testing and grading, the standards basis for grading detection performance rather than asserting a single accuracy; Part 4:2025 application guidelines). Primary (tier 1). https://webstore.iec.ch/en/publication/83425
  4. European Union — "GDPR, Regulation (EU) 2016/679, Art. 9" (biometric data processed to uniquely identify a person is special-category data; the basis for the line that classifying a 'person' is not biometric identification, while face recognition and LPR are). Primary (tier 1). https://eur-lex.europa.eu/eli/reg/2016/679/oj
  5. YOLOv12 authors — "YOLOv12: Attention-Centric Real-Time Object Detectors" (arXiv 2502.12524) (YOLOv12-N ≈ 40.6% mAP at ~1.6 ms on a T4 GPU; YOLOv12-X ≈ 55.2% mAP — concrete 2026 anchors for the 38–55% strict-COCO band). First-party engineering / research (tier 3). https://arxiv.org/abs/2502.12524
  6. Roboflow — "Best object detection models 2026 / mAP explained" (real-time detectors including YOLOv12 and RF-DETR/RT-DETR on the same accuracy-speed frontier; mAP rolls precision and recall across thresholds and classes via IoU). First-party engineering / educational (tier 3/6). https://blog.roboflow.com/best-object-detection-models/
  7. ScienceDirect — "Object detection in real-time video surveillance using attention-based transformer-YOLOv8" (fine-tuned detectors reach mAP@0.5 ≈ 96% for medium and ≈ 85% for small objects on scene footage — the gap between strict COCO and a tuned fixed scene). Peer-reviewed research (tier 3/5). https://www.sciencedirect.com/science/article/pii/S1110016825000468
  8. Uniview / Wisenet-class vendor analyses and integrator reports, 2026 — "AI object detection vs motion detection" (traditional motion detection produces 90%+ false-alarm rates; moving to AI object/person detection removes ~80–95% of nuisance alarms by filtering on object class). Vendor engineering / institutional (tier 4/5). https://wiznet.ae/cctv-motion-detection-vs-ai-detection/
  9. AIMultiple — "Edge AI chips 2026" (mainstream camera NPUs deliver ~1–6 TOPS, enough to run quantized detectors at 15–30 FPS on-camera; human-body detection lightest, vehicle classification ≈2 TOPS, behavior analytics 3–4 TOPS; high-end edge chips 20–60+ TOPS). Institutional/analyst (tier 5). https://research.aimultiple.com/edge-ai-chips/
  10. Encord — "Mean Average Precision (mAP) in object detection: a comprehensive guide" (precision, recall, IoU, AP, and the COCO mAP@0.5:0.95 convention — the metric definitions used throughout this article). Educational (tier 6). https://encord.com/blog/mean-average-precision-object-detection/