Object tracking follows a detected object from frame to frame, giving it a persistent identity so the system knows that the person in this frame is the same one as a moment ago. Where detection sees each frame fresh, tracking adds memory: it stitches a sequence of detections into a continuous track with a path, a duration, and an ID. This is what lets analytics reason about movement — direction, speed, dwell — rather than isolated snapshots.

Tracking is the layer most behavioural rules sit on. "Crossed this line", "loitered for two minutes", "moved against the flow" all require knowing it was one object doing the moving, which is the track. The common approach is tracking-by-detection: a detector finds objects each frame and the tracker links them, typically running on the camera or a local server and surfacing the track via ONVIF Profile M with a per-object ID.

The pitfall is expecting tracks to be unbroken. Occlusion (a pillar, a crowd), objects that look alike, and fast motion cause identity switches and broken tracks, so single-camera tracking accuracy, while good — often around 80% on standard measures — is not perfect. And the per-object ID is local to that camera: it does not, by itself, mean the same identity on the next camera, which is the harder problem of re-identification. The tracking model internals belong to the AI for Video Engineering section.