Cross-Camera Object Tracking and Re-Identification · Video Surveillance & VMS

This is engineering guidance, not legal advice. Confirm specifics with qualified counsel.

Why this matters

The moment a buyer asks "can it follow someone from the parking lot to the loading dock?", they have left object detection behind and walked into tracking and re-identification — and the gap between the demo and the deployment is wider here than almost anywhere else in surveillance. Single-camera tracking is mature and runs on the camera itself; cross-camera re-identification is a research-grade problem that degrades sharply with distance, lighting, and time, and a vendor who quotes you a single accuracy number is hiding that. This article is written for the security integrator, retail-operations lead, or product manager who needs to specify cross-camera tracking sensibly, understand what it can and cannot promise, and — critically — recognize that a movement trail is personal data even when no name is attached. A senior video engineer will find the metrics and standards framing accurate; the writing serves the non-technical reader first.

From detection to a track: what tracking actually adds

Start with what comes before tracking, because tracking is built directly on top of it. As covered in object detection and classification in surveillance, a detector looks at a single frame and draws a labelled box around each object — "person here, car there." But a detector has no memory. Run it on 30 frames of one second of video and it produces 30 independent sets of boxes, with no idea that the person in frame 2 is the same person as in frame 1. To the raw detector, every frame is a fresh crowd of strangers.

Object tracking adds the memory. It takes the per-frame boxes and links them over time into tracks — continuous trajectories, each with a stable track ID. Track #7 is "this particular person," frame after frame, until they leave the scene. The dominant method is called tracking-by-detection: detect objects in each frame, then associate each new box with an existing track (the same object, moved a little) or start a new track for anything genuinely new. Detection answers "what is here?"; tracking answers "and is it the same thing I saw a moment ago?"

The association step is cheaper and more mechanical than people expect. Two classic tools do most of the work. A Kalman filter predicts where each tracked object should appear in the next frame based on its recent motion — a person walking left keeps moving left — so the system knows roughly where to look. The Hungarian algorithm then matches the new detections to those predictions at lowest total "cost" (closest box, most similar appearance). Modern trackers such as ByteTrack and BoT-SORT refine this: ByteTrack's insight is to associate every detection box, including low-confidence ones, which recovers objects during brief occlusion; BoT-SORT adds camera-motion compensation so tracking survives a camera that pans or shakes, plus an appearance model for harder matches (ByteTrack, arXiv 2110.06864; BoT-SORT, arXiv 2206.14651).

Detection makes independent boxes each frame; tracking links them into continuous tracks with stable IDs. Figure 1. Tracking adds memory to detection. The detector emits independent boxes every frame; the tracker associates them across frames — predicting motion with a Kalman filter and matching with the Hungarian algorithm — into continuous tracks, each with a stable ID. The box is per-frame; the track is the object's path through time.

The easy case: tracking within one camera's view

Tracking one object inside one camera's field of view is the mature, solved-enough part of this story. The objects stay roughly the same size, the lighting is consistent, and the gaps between frames are tiny, so the motion prediction is usually right. This single-camera tracking is light enough to run in real time on the camera's own AI chip or a small edge box, and it is what powers a long list of everyday analytics: counting how many people crossed a line and in which direction, measuring how long someone lingered in an aisle (dwell time), detecting loitering, and feeding the rule-based behavioral analytics — loitering, intrusion, and zones. None of those work without a stable track; you cannot measure dwell time if every frame is a new stranger.

The characteristic failure of single-camera tracking has a name worth knowing: the identity switch (or "ID switch"). When two people cross paths, or one walks behind a pillar and re-emerges, the tracker can lose the thread and assign the re-emerging person a new track ID — or, worse, swap two people's IDs. A high identity-switch count is the difference between "person #7 dwelled for 40 seconds" and a garbled trail that splits one visitor into three. Occlusion — one object hiding another — is the usual cause, and it is exactly what ByteTrack's "associate every box" trick was built to reduce.

The hard case: re-identification across cameras

Now make the problem genuinely difficult. A person walks out of Camera A's view and, twenty seconds later, appears in Camera B around the corner. Camera B's tracker has never seen this person; to Camera B, they are a brand-new track. Connecting "track #7 on Camera A" to "track #3 on Camera B" — recognizing they are the same person — is re-identification, and it is a different and much harder problem than tracking within one view.

Re-ID works by turning each object into an appearance signature — a compact numeric summary (an "embedding") of what the object looks like: the colors and textures of clothing, body shape, the make and color of a vehicle. When a new person appears on Camera B, the system computes their signature and searches for the closest match among recently seen people from other cameras. A close enough match says "same person." Crucially, on its own this is appearance matching, not facial or plate recognition — it is "the person in the red jacket with the black backpack," not "John Smith."

Why is it so hard? Because the two cameras may share no overlap at all, so there is no moment where both see the object at once to anchor the match. Between them the viewpoint flips (front view on A, back view on B), the lighting changes (daylight entrance, fluorescent corridor), the object is partly hidden, and time passes. Researchers frame cross-camera tracking — formally multi-target multi-camera tracking (MTMC) — as three stacked problems: track within each camera, re-identify across cameras, and stitch the result into one network-wide trajectory (Ristani et al., "Features for Multi-Target Multi-Camera Tracking and Re-Identification," CVPR 2018). Each layer adds error. And appearance is fragile in a specific way every buyer should hear: change the clothing and appearance re-ID breaks. Someone who removes a jacket can become a different "person" to the system.

A person leaves one camera and reappears on a non-overlapping one; re-ID matches their appearance to link the tracks. Figure 2. Re-identification across non-overlapping cameras. Each camera tracks independently and assigns its own local ID. Re-ID compares appearance signatures to decide that Camera A's track #7 and Camera B's track #3 are the same object, stitching them into one cross-camera trajectory. The match is on appearance, not a face or a plate.

The accuracy reality: a range, and it falls off a cliff between cameras

Here is the rule the whole section is built on, applied to tracking: accuracy is a range tied to scene, lighting, viewpoint, and time — never a single number, and never 100%. Tracking has its own honest metrics, and the gap between the single-camera number and the cross-camera number is the most important thing a buyer can understand.

First, the metrics, in plain language. Single-camera tracking quality is reported with three linked scores. MOTA (Multi-Object Tracking Accuracy) mostly measures detection quality — how many objects were missed or invented. IDF1 (the Identity F1 score) measures identity quality — how consistently each object kept the same ID over its whole path, which is what you actually care about for following someone. HOTA (Higher Order Tracking Accuracy) is the modern balance of the two: the geometric mean of how well the system detects objects and how well it associates them over time (Luiten et al., HOTA, IJCV 2021). When a vendor shows tracking numbers, ask which metric — a high MOTA with a low IDF1 means "good at spotting objects, bad at keeping their identities straight."

Second, the numbers. On the standard single-camera pedestrian benchmark MOT17, the best real-time trackers land around 80% MOTA, 77–80% IDF1, and 63–65% HOTA — for example ByteTrack at roughly 80.3 MOTA / 77.3 IDF1 / 63.1 HOTA at 30 frames per second, and BoT-SORT slightly ahead at about 80.5 / 80.2 / 65.0 (ByteTrack, arXiv 2110.06864; BoT-SORT, arXiv 2206.14651). Re-identification as a retrieval task looks even better on its home benchmark: on the Market-1501 person re-ID dataset, strong 2025 models reach about 96% Rank-1 accuracy and 91–93% mAP (meaning the correct match is the top result ~96% of the time). But move to the harder, more realistic MSMT17 dataset — more cameras, more lighting variation — and the same class of model drops to roughly 86–87% Rank-1 and 70–75% mAP (recent re-ID literature, 2025). That fall, from one clean benchmark to a messier one, is the whole story in miniature.

Now the cliff. Stitch tracking and re-ID into full cross-camera tracking and the end-to-end identity score drops well below the single-camera figure. On the city-scale vehicle benchmark CityFlow — 40 cameras spread across intersections up to 2.5 km apart — leading multi-camera systems report cross-camera IDF1 in the 70–85% range, and a 2025 camera-only entry to the AI City Challenge reported an end-to-end HOTA around 45% (CityFlow, arXiv 1903.09254; AI City Challenge 2025). The honest takeaway: a system that tracks beautifully on one camera will lose a meaningful fraction of identities every time an object crosses between cameras, and the more cameras and the longer the gaps, the more it loses. Never accept "we track people across your whole site with 99% accuracy." Ask for the cross-camera identity score, on a layout like yours, with your camera spacing.

Single-camera tracking near 80% drops to cross-camera identity scores of about 45 to 85%, with a never-100% marker. Figure 3. Accuracy falls between cameras. Single-camera tracking (MOTA/IDF1 ≈ 80%) and re-ID retrieval on a clean benchmark (Rank-1 ≈ 96%) look strong; stitched cross-camera tracking (IDF1 ≈ 70–85%, end-to-end HOTA can fall to ~45%) is much harder. Every figure is a range that moves with camera spacing, lighting, and time — never 100%.

How a track surfaces in the VMS — and the ONVIF caveat that catches teams out

A track is only useful if the Video Management System — the software that ingests and records many camera streams, called a VMS — can read it. There is an industry standard for that, and it is the part this surveillance course owns rather than the AI course. ONVIF is the common language that lets cameras and software from different makers understand each other, and ONVIF Profile M is the profile built for analytics metadata. Profile M standardizes generic object classification and the metadata stream that carries detections, and it lets that metadata travel inside the video stream, through the ONVIF event service, or over MQTT, a lightweight messaging protocol common in connected-device systems (ONVIF, Profile M).

Tracking specifically surfaces through the object identifier in the ONVIF scene description. Each tracked object gets an ObjectId, and the standard's analytics interface even models the messy reality of tracking: it defines operations to rename an object's ID once the algorithm gathers enough evidence about its identity, to split one object into two (assigning a fresh ID when it cannot yet tell which is which), to merge, and to delete an object once it leaves the algorithm's memory (ONVIF Analytics Service Specification). In other words, the standard itself encodes the identity-switch problem we described earlier — it gives the camera a clean way to say "I thought this was one object; it's now two."

Here is the caveat that catches integration teams out, and it is the single most important practical point in this article: the ONVIF ObjectId is local to one device's analytics. It is not a cross-camera identity. Camera A's "object 7" and Camera B's "object 7" have nothing to do with each other — the numbers are reused independently. ONVIF standardizes how one camera reports its own tracks; it does not standardize re-identification across cameras. Stitching tracks across the camera network is the job of the VMS or a dedicated analytics layer on top, and it is where the vendor's own software and tuning live. As always with ONVIF, conformance is a baseline, not a guarantee of every feature — keep "ONVIF-conformant" and "cross-camera tracking works out of the box" firmly separate. For the standards layer beneath this, see events, metadata, and the ONVIF analytics interface; the commercial overview is Fora Soft's Profile M explainer.

Where tracking and re-ID run

The two halves of this problem sit in different places, and that shapes cost, latency, and privacy. Single-camera tracking is light and lives at the edge — on the camera's own neural processing unit (NPU) or a small on-site box — because the association math (a Kalman filter plus the Hungarian algorithm) is cheap and must keep up with the live frame rate. Cross-camera re-identification tends to run on a server or in the cloud, for a structural reason: to match an object seen on Camera A against one on Camera B, something has to see both cameras' data at once, so the re-ID step needs a vantage point above any single camera. That central matching also wants more compute to run the appearance-embedding model and search across many recent tracks.

The consequence for a buyer is a familiar tradeoff. Doing re-ID centrally means sending appearance signatures (small) or sometimes cropped images (larger) up from every camera, which costs bandwidth and concentrates the most privacy-sensitive data — a searchable index of who-was-where — in one place. Keeping more of it at the edge cuts bandwidth and keeps raw video on-device but limits how richly you can match across the network. Where analytics run — camera, on-site server, or cloud — is its own decision with its own latency, bandwidth, and privacy profile, covered in depth in edge vs cloud video analytics and latency and accuracy at each tier. This article owns what tracking and re-ID do and how they surface; those own where.

Single-camera tracking runs at the edge; cross-camera re-ID runs on a server or cloud that sees all cameras. Figure 5. Where the two halves run. Single-camera tracking is light and stays at the edge; re-ID has to see every camera at once, so it runs centrally on a server or in the cloud, fed by small appearance signatures sent up from each camera. That central index of who-was-where is where bandwidth and privacy concentrate.

The model lives in another section — on purpose

One boundary keeps this article honest. How the tracker and the re-ID network are built — the appearance-embedding architectures, the metric-learning training that pulls the same identity's images together and pushes different identities apart, the association algorithms behind ByteTrack and BoT-SORT — is the territory of our AI for Video Engineering section. That section engineers the model; this surveillance article owns the application: how a track plugs into the camera, the VMS, and storage, what it delivers in practice, and what it costs in accuracy and privacy. When you need the model internals, follow the cross-links; when you need to specify and deploy tracking in a working system, stay here.

The privacy weight: a movement trail is personal data, even without a name

This is the part a buyer cannot skip, because cross-camera tracking changes the privacy character of a surveillance system even when no biometric is involved. Three points draw the line.

First: a movement trail can be personal data on its own. Europe's GDPR defines personal data as any information relating to a person who is identifiable — and the law explicitly includes someone who can be "singled out," directly or indirectly, even without a name (GDPR, Reg. (EU) 2016/679, Art. 4(1)). A persistent re-ID trail — "this individual entered at 09:02, visited these four zones, left at 10:15, and returned Tuesday" — singles a person out across space and time. So building cross-camera trails is processing personal data, with all the GDPR duties that follow, even if the system never learns who the person is. The absence of a name is not the absence of privacy weight.

Second: appearance re-ID is generally not biometric, but two close cousins are. Matching people by the color of their clothing is appearance matching, and the European Data Protection Board treats such "soft" traits — ones that do not by themselves uniquely identify a person — as generally outside the special-category "biometric data" rules (EDPB, Guidelines 3/2019 on processing of personal data through video devices). But the line is thin. Face recognition measures a face into a template to answer "is this Person X?", and gait recognition — identifying someone by how they walk — is a behavioral biometric; both are processing for the purpose of uniquely identifying a person, which is what triggers GDPR Art. 9's special-category protection. The instant your "tracking" leans on a face or a gait signature to hold identity across cameras, you have crossed from appearance matching into biometric identification, and the legal gate slams shut. Those are their own topics: face recognition in surveillance and the full law in GDPR for video surveillance.

Third: scale and place raise the bar. Systematic tracking of people across a publicly accessible space, on a large scale, is exactly the kind of processing for which GDPR requires a Data Protection Impact Assessment (DPIA) before you switch it on (GDPR Art. 35; EDPB Guidelines 3/2019). And in the EU, the AI Act goes further at the sharp end: real-time remote biometric identification of people in publicly accessible spaces — live face-matching a crowd to identify everyone in it — is a prohibited practice for most uses, in force since 2 February 2025, with only narrow law-enforcement exceptions under strict authorization (EU AI Act, Reg. (EU) 2024/1689, Art. 5). Appearance-based tracking is not that prohibition — but the closer your design moves to identifying named individuals in public in real time, the closer it moves to a hard legal wall. The clean engineering rule: following the red jacket is not the same as naming the person, but the trail it builds is still personal data, and adding a face or a gait turns it into biometric identification. This is engineering guidance, not legal advice.

A movement trail with no name is still personal data; face or gait recognition crosses a biometric legal gate. Figure 4. The privacy gradient of tracking. A single-camera track is low weight; an appearance-based cross-camera trail is personal data (it singles a person out) at medium weight; face or gait recognition crosses the biometric legal gate (GDPR Art. 9 · EU AI Act · DPIA under Art. 35). No name does not mean no privacy weight.

Tracking and re-ID at a glance

	Single-camera tracking	Cross-camera re-identification
Question answered	Where did this object go in this view?	Is this the same object I saw on another camera?
How it links	Motion + appearance, frame to frame	Appearance signature match across views
Typical accuracy	~80% MOTA / 77–80% IDF1 (MOT17)	Cross-camera IDF1 ~70–85%; HOTA can fall to ~45%
Where it runs	Edge (camera NPU / small box)	Server or cloud (needs all cameras)
Surfaces in VMS as	ONVIF Profile M `ObjectId` (per device)	VMS / vendor layer — not standardized by ONVIF
Main failure	Identity switch at occlusion	Clothing change, viewpoint, long time gaps
Privacy weight	Low	Medium — a movement trail is personal data

Table 1. Single-camera tracking versus cross-camera re-identification on what a buyer needs to know. The accuracy, the place it runs, and the privacy weight all step up when you cross from one camera to many. Add a face or a gait signature and the privacy weight steps up again, into biometric law (article 4.4).

A common mistake to avoid

The costliest pattern we see is assuming a single-camera track ID carries across cameras — wiring a system as if "object 7" means the same person everywhere — when the ONVIF ObjectId is local to one device and cross-camera identity is a separate, lossy re-ID step that has to be built and tuned. The close runner-up is specifying cross-camera tracking with a single-camera accuracy number: a vendor demo that follows one actor flawlessly around one room tells you almost nothing about identity retention across twelve cameras and a two-minute gap. The third, quieter mistake is treating an appearance trail as privacy-neutral because there is no face — a re-ID trail singles people out and is personal data, and it deserves a DPIA, retention limits, and access controls like any other personal data. Demand a pilot on a multi-camera slice of your real site, reported with a cross-camera identity metric (IDF1 or HOTA), before you believe any "follow anyone anywhere" claim.

Where Fora Soft fits in

Fora Soft has built real-time video, streaming, and computer-vision software since 2005, across 625+ shipped projects, and cross-camera tracking is one of the features where we most often reset expectations before we write code. Teams come to us expecting a single switch labelled "track people across the site"; the honest build is a layered one — solid single-camera tracking at the edge, an appearance-re-ID layer tuned to the actual camera spacing and lighting, and a VMS that stitches and stores tracks without pretending the identity score is perfect. The framing we lead with is how the system behaves under real load first: the realistic cross-camera identity retention at your camera layout, the bandwidth and storage of a central re-ID index, and the privacy posture of holding a who-was-where trail — and only then the capability. A tracker that is honest about where it loses identities beats one that demos as flawless and disappoints in the field.

Call to action

Talk to a surveillance engineer — book a 30-minute scoping call to talk through your cross-camera tracking plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Object Tracking & Re-Identification — One-Page Reference — Single-camera tracking vs cross-camera re-identification, the honest accuracy metrics (MOTA/IDF1/HOTA, the single-to-cross-camera drop, never 100%), how a track surfaces via ONVIF Profile M (the per-device ObjectId caveat), where it….

References

ONVIF — "Profile M" (standardizes generic object classification and the metadata stream; metadata for geolocation, vehicle, license plate, human face and body; event interfaces for object counting, face and license-plate recognition; metadata carried over the stream, the ONVIF event service, or MQTT; a conformant product can be a camera, a server, or a cloud service, and a client can be a VMS, NVR, or cloud service; conformance is a baseline). Primary (tier 1). https://www.onvif.org/profiles/profile-m/
ONVIF — "Analytics Service Specification" (the scene description and ObjectTree: each tracked object carries an ObjectId; the standard defines Rename, Split, Merge, and Delete operations for object identity over time — the standardized mechanism by which a single device reports its own tracks, and the basis for the point that the ObjectId is device-local, not a cross-camera identity). Primary (tier 1). https://www.onvif.org/specs/srv/analytics/ONVIF-Analytics-Service-Spec.pdf
European Union — "GDPR, Regulation (EU) 2016/679, Art. 4(1) and Art. 9" (personal data includes a person who can be 'singled out' directly or indirectly — the basis for 'a movement trail is personal data without a name'; Art. 9 makes biometric data processed to uniquely identify a person special-category — the basis for the face/gait gate). Primary (tier 1). https://eur-lex.europa.eu/eli/reg/2016/679/oj
European Data Protection Board — "Guidelines 3/2019 on processing of personal data through video devices" (soft traits that do not uniquely identify a person are generally outside the special-category biometric rules; systematic large-scale monitoring of a publicly accessible area triggers a DPIA under Art. 35 — the basis for the appearance-vs-biometric line and the DPIA point). Primary / issuing-body guidance (tier 1/2). https://www.edpb.europa.eu/our-work-tools/our-documents/guidelines/guidelines-32019-processing-personal-data-through-video_en
European Union — "EU AI Act, Regulation (EU) 2024/1689, Art. 5" (real-time remote biometric identification in publicly accessible spaces is a prohibited practice for most uses, in force since 2 February 2025, with narrow law-enforcement exceptions under strict authorization — the basis for the public-space biometric wall). Primary (tier 1). https://artificialintelligenceact.eu/article/5/
Zhang, Y., et al. — "ByteTrack: Multi-Object Tracking by Associating Every Detection Box" (arXiv 2110.06864) (associates every detection box including low-confidence ones; ≈ 80.3 MOTA / 77.3 IDF1 / 63.1 HOTA on MOT17 at 30 FPS — the single-camera accuracy anchor and the occlusion-recovery idea). First-party research (tier 3). https://arxiv.org/abs/2110.06864
Aharon, N., et al. — "BoT-SORT: Robust Associations Multi-Pedestrian Tracking" (arXiv 2206.14651) (adds camera-motion compensation and an appearance model; ≈ 80.5 MOTA / 80.2 IDF1 / 65.0 HOTA on MOT17, ranks first on MOT17/MOT20 — the second single-camera anchor and the camera-motion point). First-party research (tier 3). https://arxiv.org/abs/2206.14651
Ristani, E., Tomasi, C. — "Features for Multi-Target Multi-Camera Tracking and Re-Identification" (CVPR 2018, arXiv 1803.10859) (frames cross-camera tracking as single-camera tracking + re-ID + network-wide stitching, each adding error — the basis for the MTMC three-layer framing). First-party research (tier 3). https://arxiv.org/abs/1803.10859
Tang, Z., et al. — "CityFlow: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification" (arXiv 1903.09254) and the AI City Challenge (40 cameras up to 2.5 km apart; leading cross-camera IDF1 ≈ 70–85%, 2025 camera-only end-to-end HOTA ≈ 45% — the cross-camera accuracy-cliff evidence). First-party research / benchmark (tier 3). https://arxiv.org/abs/1903.09254
Luiten, J., et al. — "HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking" (IJCV 2021) and standard person re-ID results on Market-1501 (Rank-1 ≈ 96%, mAP ≈ 91–93%) and MSMT17 (Rank-1 ≈ 86–87%, mAP ≈ 70–75%), 2025 literature (the metric definitions — MOTA/IDF1/HOTA — and the re-ID benchmark numbers, including the clean-vs-messy benchmark drop). First-party research (tier 3). https://link.springer.com/article/10.1007/s11263-020-01375-2

Object Tracking and Re-Identification Across Cameras

Why this matters

From detection to a track: what tracking actually adds

The easy case: tracking within one camera's view

The hard case: re-identification across cameras

The accuracy reality: a range, and it falls off a cliff between cameras

How a track surfaces in the VMS — and the ONVIF caveat that catches teams out

Where tracking and re-ID run

The model lives in another section — on purpose

The privacy weight: a movement trail is personal data, even without a name

Tracking and re-ID at a glance

A common mistake to avoid

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Object Tracking and Re-Identification Across Cameras

Why this matters

From detection to a track: what tracking actually adds

The easy case: tracking within one camera's view

The hard case: re-identification across cameras

The accuracy reality: a range, and it falls off a cliff between cameras

How a track surfaces in the VMS — and the ONVIF caveat that catches teams out

Where tracking and re-ID run

The model lives in another section — on purpose

The privacy weight: a movement trail is personal data, even without a name

Tracking and re-ID at a glance

A common mistake to avoid

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

ONVIF

Re-identification

GDPR

Bandwidth

Object tracking

Face recognition

Object detection

ONVIF Profile M