Published 2026-05-27 · 22 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
If your video product needs to count something, follow something, or reason about a journey, you need a tracker. A surveillance system that counts how many people walked through a door needs to know the difference between three different people walking through and one person walking through three times. A retail analytics tool that measures dwell time needs to know that the shopper at the cereal aisle at second 12 is the same shopper at second 47. A sports broadcast that overlays player names on a basketball court needs to know that the body wearing jersey number 23 in frame 1 is still wearing number 23 in frame 1500, even when number 23 walks behind number 12 for half a second. A video-conferencing UI that highlights the active speaker with a bounding box around their face needs to keep that box on the right face when two people swap seats. All of those features are the same problem under the hood — multi-object tracking — and all of them break in the same way if the tracker is wrong. The technology has been around since the 1990s, but the model choices, the failure modes, the deployment patterns, and the evaluation metrics have all shifted since 2022, and the documentation that engineers read on the first day of a project is usually three years out of date. This article gives you the 2026 picture: which tracker to pick, why ByteTrack and OC-SORT replaced DeepSORT as the default, what HOTA measures that MOTA missed for a decade, and how to avoid the four production mistakes that sink most tracking deployments.
What Multi-Object Tracking Actually Is
A video is a sequence of frames. An object detector — YOLO, RT-DETR, Grounding DINO, whatever you picked in the open-vocabulary detection lesson — runs on each frame independently and emits a list of bounding boxes, each with a class label and a confidence score. The detections in frame 1 and the detections in frame 2 are unrelated objects: the detector does not remember what it saw a moment ago. A tracker sits on top of the detector and stitches those per-frame detections into trajectories. Each trajectory carries a stable track ID — a small integer that says "this is the same object I saw on the previous frame". The track ID is the connective tissue that turns a sequence of detection lists into a list of trajectories that downstream code can reason about.
Why bother? Because almost no useful video feature works on per-frame detections alone. You cannot count cars crossing a line without persistent IDs — every car would be counted dozens of times as it crossed multiple frames. You cannot grade a tennis serve from a video without persistent IDs — you would not know which detection belonged to the server and which to the receiver. You cannot flag a fight in a CCTV stream without persistent IDs — the rule that fires the alert is "the same two people were within one metre for more than two seconds", and that rule requires identity persistence. Every video-AI feature that involves motion, dwell, counting, choreography, or interaction is downstream of a tracker.
The job sounds simple — match the boxes in frame t to the boxes in frame t+1 — but it gets hard fast. Objects occlude each other; the model misses a frame here or there; objects move at non-constant velocity; the camera moves; two objects look similar and trade IDs; one object splits into two detections; two detections of the same object get merged. The history of multi-object tracking is largely the history of getting more robust to those failure modes.
Figure 1. A tracker turns a sequence of independent detection lists into a small set of trajectories with stable identities. Without persistent IDs, every downstream feature breaks.
Tracking-By-Detection — The Pipeline That Won
By 2022, almost every published tracker that mattered had converged on the same skeleton — tracking-by-detection. The pipeline has four stages, and almost every modern tracker is a different choice in each of them.
The first stage is detection. A modern detector emits one bounding box per object per frame, with a class label and a confidence score. The quality of the detector dominates everything downstream — a tracker on top of a 95-percent-recall detector and a tracker on top of an 80-percent-recall detector live in different universes, and the gap between them is far larger than any clever association algorithm can close. Almost every paper since 2022 reports tracker numbers on top of a fixed, public detector (often YOLOX-X or a comparable model) precisely so that tracker comparisons measure tracker quality and not detector quality.
The second stage is motion prediction. For each existing track, the tracker predicts where the object will be in the next frame, based on its motion history. The canonical predictor is the Kalman filter, a 1960 algorithm that models the object's state — typically (x, y, aspect_ratio, height, dx, dy, dh, dr) — as a Gaussian whose mean and covariance get updated each frame by a constant-velocity motion model. Modern trackers all use a Kalman filter or a close cousin; the differences are in the state vector and the noise parameters, not in the algorithm itself.
The third stage is association. Given the predicted track positions and the new detections, the tracker computes a cost matrix between every existing track and every new detection, then solves the assignment problem to match them up. The cost can be a geometric distance (intersection-over-union between the predicted box and the detection box), an appearance distance (cosine distance between deep features of the two boxes), or a weighted combination. The assignment is solved with the Hungarian algorithm, a 1955 result that finds the optimal one-to-one matching in O(n³) time and runs in microseconds at the scales tracking cares about.
The fourth stage is track lifecycle management. New detections that do not match any existing track become new tracks. Existing tracks that do not match any detection get marked unconfirmed, and after a few frames of being unconfirmed they get killed. Tracks that have just been born are usually marked tentative and have to survive a few frames of consistent detection before they become confirmed. The lifecycle parameters — birth threshold, death threshold, maximum unconfirmed frames — are tuning knobs that matter more than people realise.
Every tracker we discuss below is a different choice inside this four-stage pipeline. ByteTrack changes how the detection threshold is applied in stage 3. OC-SORT changes the Kalman filter's behaviour after a missed detection. DeepSORT adds deep appearance features to the cost matrix. BoT-SORT adds camera-motion compensation to the Kalman filter and appearance embeddings to the cost. The skeleton is the same; the muscles differ.
The Four Tracker Families That Matter In 2026
Every production deployment in 2026 picks one of four open-source tracker families. We walk through them in chronological order, because each one solved a problem that the next one inherited.
DeepSORT — the 2017 classic that introduced appearance embeddings
DeepSORT was published in 2017 by Nicolai Wojke, Alex Bewley, and Dietrich Paulus, as an extension of the 2016 SORT tracker. SORT — Simple Online and Realtime Tracking — was already a respectable baseline: Kalman filter for motion prediction, IoU distance for association, Hungarian algorithm for matching. SORT ran at hundreds of frames per second on a CPU but lost identities through occlusions and crossings because it used only geometry. DeepSORT's contribution was to add a second cost matrix on top of the IoU cost: a cosine distance between deep convolutional appearance features of the detection box and a feature gallery for each track. The appearance features were extracted by a small CNN trained for person re-identification on the MARS dataset. The result was the first tracker that could maintain identities through brief occlusions, and it dominated the literature for the next four years.
DeepSORT's numbers on MOT17 — the standard benchmark for pedestrian tracking — were 60.3 MOTA, 61.2 IDF1, and 48.8 HOTA when paired with a strong YOLOX-X detector. Those are 2017 numbers; the metrics themselves did not exist in their modern form yet (HOTA was published in 2020), but the retroactive evaluation is what the field now uses for comparison.
In 2026, the practical reason to know about DeepSORT is twofold. First, it is the tracker that every tracking tutorial uses, and the one that every junior engineer reaches for first. Second, the appearance-feature pattern it introduced is the foundation that BoT-SORT, StrongSORT, and Deep OC-SORT all build on. The original DeepSORT, however, is no longer the default for new projects. The 2022-vintage trackers below beat it on every metric and are equally easy to deploy.
ByteTrack — the 2022 detection-driven workhorse
ByteTrack was published at ECCV 2022 by Yifu Zhang and colleagues. It is the single most influential paper in tracking-by-detection since DeepSORT. The contribution is deceptively small: associate every detection box, not just the high-confidence ones.
Every prior tracker discarded detections below a confidence threshold (typically 0.5) before association. ByteTrack's observation was that low-confidence detections are usually still real objects — they are objects that the detector saw faintly because of partial occlusion, motion blur, or unusual pose. Discarding them throws away exactly the detections that the tracker most needs to maintain identity through occlusion. ByteTrack's procedure: first associate high-confidence detections to existing tracks using IoU distance; then take the unmatched tracks and try to associate them with the low-confidence detections, again using IoU. High-confidence detections that fail to match anything become new tracks; low-confidence detections that fail to match are discarded.
The full algorithm is fewer than fifty lines of Python. It uses no appearance features, no deep neural network beyond the detector itself, no learned components. It just changes the order in which detections are used. The result on MOT17 — 80.3 MOTA, 77.3 IDF1, 63.1 HOTA — beat every prior tracker including DeepSORT, BoT-SORT predecessors, and the entire family of trackers that used appearance embeddings. The lesson was that the detector was doing most of the work all along, and a smart cascade over detection confidence captured most of the value.
ByteTrack is the right default for any project where the detector is strong, the objects do not change appearance much, and the scene does not involve aggressive camera motion. It is roughly twice as fast as DeepSORT (since it does not run an appearance feature extractor) and three to five times faster than BoT-SORT. The reference implementation is at github.com/ifzhang/ByteTrack under MIT license, and it ships as a built-in tracker in Ultralytics YOLO under bytetrack.yaml.
OC-SORT — the 2023 motion-only tracker that fixes non-linear paths
OC-SORT — Observation-Centric SORT — was published at CVPR 2023 by Jinkun Cao and colleagues. The paper identifies a specific failure mode of every Kalman-filter-based tracker, including ByteTrack: after an object is occluded for several frames, the Kalman filter's predicted position drifts because the filter keeps propagating the constant-velocity assumption while no observation arrives to correct it. When the object reappears and the tracker has to associate it, the predicted position has drifted away from the true position, and the tracker fails the match.
OC-SORT's fix is observation-centric re-update. When a previously-lost track gets re-associated with a detection after k frames of occlusion, the tracker does not just update the Kalman filter with the new observation. It also goes back, reconstructs the trajectory between the last observation and the new one using both endpoints, and re-updates the filter as if it had seen consistent observations all along. The state vector at the moment of re-association is therefore based on the actual two endpoints of the gap rather than the drifted prediction. The paper also adds an observation-centric momentum term that uses the direction of recent observations directly in the cost matrix.
The OC-SORT MOT17 numbers — 78.0 MOTA, 77.5 IDF1, 63.2 HOTA — sit right next to ByteTrack on the headline metrics, but the relative improvement is much larger on benchmarks with non-linear motion. On the DanceTrack benchmark, which features acrobatic and dance motion that violates the constant-velocity assumption, OC-SORT beats ByteTrack by more than ten HOTA points. The DanceTrack result is the one that matters for any product where the objects move non-linearly: dance instruction, gymnastics, parkour, drone footage, animal tracking, anything in a sports broadcast that involves changes of direction.
OC-SORT does not use appearance features either. Like ByteTrack, it is detector-driven and runs at the speed of the detector itself. The reference implementation is at github.com/noahcao/OC_SORT under MIT license. Deep OC-SORT, the 2023 extension by Maggiolino and colleagues, adds an adaptive appearance feature on top and reaches 64.9 HOTA on MOT17 — the published state of the art for a long time.
BoT-SORT — the 2022 tracker that became the Ultralytics default
BoT-SORT was published in 2022 by Nir Aharon, Roy Orfaig, and Ben-Zion Bobrovsky. It is, in its own words, a "stronger SORT" — built on the same skeleton as ByteTrack but adding two upgrades. First, camera-motion compensation: before applying the Kalman filter's motion model, the tracker estimates how the camera moved between the two frames (using sparse optical flow and the ECC image-alignment algorithm) and warps the predicted state by the camera motion. This single change is what makes BoT-SORT work on handheld and drone footage where the constant-velocity assumption fails because the world is moving relative to the camera, not because the objects are. Second, appearance embeddings: BoT-SORT-ReID, the appearance-augmented variant, attaches a re-identification feature extractor (a BoT-S50 trained on standard person Re-ID datasets) and adds an appearance cost to the association.
BoT-SORT's MOT17 numbers are 80.5 MOTA, 80.2 IDF1, 65.0 HOTA — at publication, the state of the art across all three headline metrics. The reference implementation is at github.com/NirAharon/BoT-SORT under MIT license. The more important fact for production teams: as of 2024, BoT-SORT is the default tracker in Ultralytics YOLO, with config at ultralytics/cfg/trackers/botsort.yaml. If you call model.track() on an Ultralytics model without specifying a tracker, BoT-SORT is what runs. The community has therefore voted: BoT-SORT-ReID is the production default for general-purpose pedestrian and vehicle tracking.
The cost is throughput. BoT-SORT-ReID is roughly 30 percent slower than ByteTrack because of the appearance feature extractor and the camera-motion estimation. For most production deployments that overhead is acceptable — single-stream pedestrian or vehicle tracking on a modern CPU still runs at 30+ frames per second — but for high-density scenes or many simultaneous streams the cost adds up.
Figure 2. The four tracker families that dominate production in 2026. ByteTrack is the throughput pick; BoT-SORT is the accuracy pick; OC-SORT is the non-linear-motion pick; DeepSORT is the legacy reference.
The Metrics That Actually Matter — MOTA, IDF1, And Why HOTA Replaced Both
For a decade the tracking community ranked trackers by MOTA — Multiple Object Tracking Accuracy. MOTA was introduced in 2008 by Bernardin and Stiefelhagen as a single number that combined false positives, false negatives, and identity switches. It worked well enough that it became the headline metric on MOTChallenge, the dominant benchmark for pedestrian tracking, and the number you would see at the top of every tracker paper from 2008 to 2020.
MOTA had a known flaw: it was heavily biased toward detection quality. A tracker that detected every object correctly but swapped identities at every occlusion would score nearly as high as a tracker that never swapped identities at all, because the identity-switch term in MOTA's formula is small compared to the false-positive and false-negative terms. The community responded by introducing IDF1 — an identity-focused F1 score that punishes identity switches much more aggressively. IDF1 was the right metric for any application where identity persistence mattered (multi-target re-identification, person-of-interest tracking) but it overcorrected: it ignored detection quality in turn.
The resolution came in 2020 with HOTA — Higher Order Tracking Accuracy — proposed by Jonathon Luiten and colleagues. HOTA is an explicit, balanced combination of a detection score (DetA, how well the tracker detects objects in the first place) and an association score (AssA, how well it keeps identities consistent across detections that match the same ground-truth object). The two are combined geometrically — HOTA is the square root of DetA × AssA — so a tracker that scores high on one but low on the other gets penalised. HOTA also reports the two sub-scores separately, so a paper can show which axis a new tracker improved on.
By 2026 HOTA is the metric the field reports. MOTA still appears in every results table, but the leaderboards on MOTChallenge, DanceTrack, and SportsMOT all rank trackers by HOTA. If you read only one number from a tracker comparison, read HOTA.
| Metric | What it measures | Strength | Weakness |
|---|---|---|---|
| MOTA | False positives + false negatives + ID switches | Single number; widely cited | Detection-biased; underweights ID switches |
| IDF1 | F1 score over identity-correct frames | Association-focused | Detection-blind |
| HOTA | √(DetA × AssA), balanced detection + association | Balanced; reports sub-scores | Slower to compute |
Numbers from the MOTChallenge leaderboard (MOT17 test set, private detection) for the four trackers we cover:
| Tracker | MOTA | IDF1 | HOTA |
|---|---|---|---|
| DeepSORT (with YOLOX-X detector) | 60.3 | 61.2 | 48.8 |
| ByteTrack | 80.3 | 77.3 | 63.1 |
| OC-SORT | 78.0 | 77.5 | 63.2 |
| BoT-SORT-ReID | 80.5 | 80.2 | 65.0 |
The numbers say what we said above: ByteTrack and OC-SORT are within a hair of each other on MOT17; BoT-SORT-ReID is a step above them; DeepSORT is the historical reference. The story is different on DanceTrack — where OC-SORT pulls ahead of ByteTrack by ten-plus HOTA points — and on SportsMOT, where BoT-SORT-ReID's appearance features matter most. The right tracker depends on the data, not on the leaderboard headline.
Picking The Right Tracker — A Decision Walkthrough
Here is the decision tree we use at Fora Soft when a project asks for a tracking feature. It runs in four questions.
The first question is: is the camera moving? If the camera is on a phone, drone, body-cam, dashcam, or any rig where the operator and the world both move, the constant-velocity Kalman filter at the heart of ByteTrack and OC-SORT will drift because the predictions are made in image coordinates, not world coordinates. BoT-SORT's camera-motion-compensation step is the fix. Pick BoT-SORT for any handheld, mobile, or vehicle-mounted camera.
The second question is: does the motion violate the constant-velocity assumption? Dance, gymnastics, parkour, animal movement, fast sports, and any scenario where objects suddenly change direction breaks the Kalman filter's prediction step. OC-SORT's observation-centric re-update is the fix. Pick OC-SORT for dance, sports, animal-tracking, and any scenario with sudden direction changes.
The third question is: how many simultaneous streams or how dense is the scene? Appearance-based trackers (BoT-SORT-ReID, Deep OC-SORT, StrongSORT) extract a deep feature per detection per frame. On a CPU, that feature extraction is the dominant cost — running BoT-SORT-ReID on sixteen 1080p streams pegs a server. If the budget says "many streams, modest hardware", pick ByteTrack: it does no appearance work, runs at the speed of the detector, and is the right choice for high-density retail analytics and large-scale surveillance.
The fourth question is: how important is identity persistence through long occlusions? If the application can tolerate brief identity switches (vehicle counting, dwell-time estimation, crowd density), motion-only trackers (ByteTrack, OC-SORT) are fine. If identity must persist through five or more seconds of occlusion (person-of-interest tracking, sports broadcast graphics, multi-camera handoff), appearance-based trackers (BoT-SORT-ReID, Deep OC-SORT, StrongSORT) become necessary. Pick BoT-SORT-ReID for long-occlusion identity persistence.
In our experience, eighty percent of projects end up at ByteTrack (fastest, simplest, works) or BoT-SORT-ReID (slightly slower, more accurate, the Ultralytics default). OC-SORT is the special-case pick. DeepSORT is rarely the right answer in 2026 — every problem it was good at is now solved better by one of the three above.
Numeric example — the cost of running a tracker on a 1280 × 720 video at 30 frames per second
Here is the math for a single stream on a modern Intel server CPU, with a YOLOv8-S detector and twenty pedestrians in frame.
YOLOv8-S detection (640 × 640 input, 20 detections): about 14 milliseconds per frame. ByteTrack association (20 detections, 20 active tracks): about 0.4 milliseconds per frame. BoT-SORT-ReID appearance feature extraction (BoT-S50, 20 crops): about 18 milliseconds per frame. BoT-SORT-ReID association: about 0.6 milliseconds per frame.
Pipeline totals per frame: ByteTrack pipeline: 14 + 0.4 = 14.4 milliseconds per frame. BoT-SORT-ReID pipeline: 14 + 18 + 0.6 = 32.6 milliseconds per frame.
Per second: ByteTrack = 432 ms of CPU; BoT-SORT-ReID = 978 ms of CPU.
The ByteTrack stream uses about 43 percent of one core. The BoT-SORT-ReID stream uses nearly a full core all by itself. A sixteen-core server therefore handles thirty-five concurrent ByteTrack streams or sixteen concurrent BoT-SORT-ReID streams. The price difference shows up directly in the cloud bill: at AWS c7i.4xlarge pricing (sixteen vCPUs, about $0.71/hour on demand), ByteTrack costs $0.020 per stream-hour, BoT-SORT-ReID costs $0.044 per stream-hour. Both numbers are small in absolute terms, but the ratio matters when you scale to thousands of cameras.
If you need more streams per server, the move is not to switch to a lighter tracker; it is to drop the tracker frame rate. Most tracking-driven features — counting, dwell time, fall detection — work fine at 10 frames per second of tracking. Running the tracker on every third frame and interpolating IDs in between cuts the cost by 3× with negligible accuracy loss.
Where To Run It — Browser, Edge, Or Server
The deployment-topology decision applies to trackers exactly the same way it applied to pose models in the pose-tracking lesson. The same model can be deployed in three different places, with three very different trade-offs.
In the browser, the tracker runs inside the user's tab in WebAssembly. None of the four trackers we cover has a first-party browser implementation, but ByteTrack and OC-SORT — being motion-only and tiny — have been ported to JavaScript and ONNX Runtime Web. The deployment is appropriate for browser-based gesture and ergonomics features. BoT-SORT-ReID with appearance features is too heavy for typical browser hardware budgets in 2026.
On the edge, the tracker runs on a small box near the camera — a NVIDIA Jetson Orin Nano, an Intel NUC, or a Hailo-8L AI accelerator. All four trackers fit comfortably on Jetson Orin Nano alongside a YOLOv8-S detector. ByteTrack is the throughput pick (more streams per box). BoT-SORT-ReID is the accuracy pick (one or two streams per box, but higher identity persistence). Edge deployment is the default for AI security cameras, intelligent video analytics, retail, and industrial computer vision.
On the server, the tracker runs in your cloud account alongside the detector. RT-DETR or YOLOv11 on GPU, plus ByteTrack or BoT-SORT-ReID on CPU, is a workable pattern. The latency is dominated by the network round-trip (typically 30 to 100 milliseconds) plus the inference time. Server deployment is the default for sports analytics platforms, broadcast contribution, batch processing of recorded footage, and any feature that needs the largest detectors.
| Topology | Default tracker | Cost owner | Privacy posture | Latency budget |
|---|---|---|---|---|
| Browser | ByteTrack via ONNX Runtime Web | User's device | Frames stay on device | 16 ms (60 FPS budget) |
| Edge | ByteTrack on Jetson (many streams) or BoT-SORT-ReID (high accuracy) | One-time hardware | Frames stay on LAN | 50 ms (round-trip + model) |
| Server | BoT-SORT-ReID on CPU, with GPU detector | Cloud bill per stream | Frames travel to cloud | 100 ms (RTT + model) |
The choice between the three is rarely close. The use case picks the topology, and the topology picks the tracker.
Figure 3. The same tracker family deploys differently depending on where the model runs. The topology drives the tracker pick at least as much as the algorithm does.
Four Production Pitfalls That Sink Tracking Features
We have shipped a lot of tracking features into video products at Fora Soft. Four failure modes show up repeatedly. They are not novel — every team rediscovers them — but the cost of rediscovery is large, so it is worth naming them up front.
The first pitfall is blaming the tracker for what is really a detector problem. When a tracker drops identities or counts wrong, the instinct is to swap trackers. The fix usually is not the tracker; it is the detector. A YOLOv8-N detector trained on COCO will miss most small-pedestrian detections at 1080p resolution, and no tracker on earth can stitch identities across missed detections. Always profile the detector first — if recall on your specific data is below ninety percent, fix the detector before touching the tracker. Switching from ByteTrack to BoT-SORT-ReID buys you a few HOTA points; switching from YOLOv8-N to YOLOv8-X buys you ten.
The second pitfall is using DeepSORT in 2026. DeepSORT was the right answer in 2018. It is rarely the right answer now. Every comparable problem it was good at is now solved better by ByteTrack (faster, more accurate) or BoT-SORT-ReID (more accurate, equally easy to deploy). The reason DeepSORT survives is that the tutorials a junior engineer finds on a first project are usually three years old. Default to ByteTrack for new projects; default to BoT-SORT-ReID if you need appearance features. We have rewritten DeepSORT pipelines into ByteTrack pipelines for three clients in the past two years, and the swap was always a net upgrade on latency, throughput, and HOTA together.
The third pitfall is leaving the Kalman filter and lifecycle parameters at their defaults on data the trackers were not tuned for. Every tracker we discussed is published with hyperparameters tuned for MOT17 — pedestrians, mostly forward motion, well-lit scenes, fixed cameras. Apply that configuration to a drone-view sports broadcast or a CCTV stream of cars on a highway, and the Kalman filter's noise terms are wrong, the maximum-unconfirmed-frames knob is wrong, and the appearance threshold is wrong. The fix is to tune. Specifically: the track_buffer (how long a lost track is kept alive in case it returns) and the match_thresh (the IoU threshold for the second association pass) are the two knobs that move the most. We habitually run a grid search over those two on a held-out video from the actual deployment data before shipping.
The fourth pitfall is trusting track IDs across camera boundaries. A track ID is local to a single camera and a single tracker run. The same person walking from camera A to camera B will be assigned different IDs by the two trackers, and no amount of clever single-camera tracking will fix that. Cross-camera identity persistence is a separate problem — person re-identification or multi-camera multi-target tracking — and it requires a dedicated Re-ID model, a shared feature gallery, and a global graph-solving step. We will cover the cross-camera case in a future lesson; for now, if your application crosses cameras, do not pretend the single-camera tracker is doing it.
Figure 4. The four production pitfalls that sink tracking features. None of them is technically new; all of them are rediscovered by every team.
Where Fora Soft Fits In
We have shipped tracking features into half a dozen product categories at Fora Soft. In video conferencing, we use lightweight ByteTrack on the SFU side to keep persistent IDs on participants who change camera angles mid-call — the IDs feed downstream features like active-speaker labelling and automated meeting summaries. In retail and intelligent video analytics, we use BoT-SORT-ReID on Jetson edge boxes for dwell-time, queue-length, and out-of-stock detection — the appearance features matter because shoppers occlude each other and the appearance model resolves the resulting identity ambiguity. In OTT and broadcast contribution, we use BoT-SORT-ReID on cloud GPUs for sports overlays and player-tracking dashboards, where the longer occlusions and high-stakes identity matter. In e-learning, we use ByteTrack to track multiple students in classroom video and feed the trajectories into attention-monitoring features. In surveillance and security, we use BoT-SORT-ReID for fall, fight, and intrusion detection — the appearance features help maintain identity through long static-camera occlusions. In every project the tracker is the second decision; the detector and the topology are decisions one and zero.
What To Read Next
- YOLO Production Lineage — v8, v9, v10, v11, v12 — the detector that feeds the tracker.
- OpenPose, MediaPipe Pose, RTMPose, ViTPose — The Pose-Tracking Stack For Video In 2026 — the pose pipeline that uses ByteTrack as its third stage.
- Latency, Deployment Topology, And Real-Time-Vs-Batch — the topology-choice framework this article applies.
Talk To Us · See Our Work · Download
- Talk to a video engineer — book a 30-minute scoping call to discuss your tracking feature.
- See our case studies — Fora Soft has shipped multi-object tracking into surveillance, retail, e-learning, OTT, and conferencing products since 2018.
- Download the multi-object tracker decision worksheet — a one-page printable PDF with the four-question decision tree, the topology table, the metric cheat sheet, and the four-pitfall checklist. Download PDF
References
- Wojke, N., Bewley, A., Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric (DeepSORT). ICIP 2017. arXiv:1703.07402. The original DeepSORT paper that introduced deep appearance features into the SORT pipeline.
- Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B. Simple Online and Realtime Tracking (SORT). ICIP 2016. arXiv:1602.00763. The original SORT paper that DeepSORT extended.
- Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., Liu, W., Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. ECCV 2022. arXiv:2110.06864. The ByteTrack paper that introduced the two-stage detection-cascade association.
- ByteTrack reference implementation.
github.com/ifzhang/ByteTrack, accessed 2026-05-27. The MIT-licensed reference code that ships in Ultralytics YOLO. - Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K. Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. CVPR 2023. The OC-SORT paper that introduced observation-centric re-update.
- OC-SORT reference implementation.
github.com/noahcao/OC_SORT, accessed 2026-05-27. The MIT-licensed reference code. - Maggiolino, G., Ahmad, A., Cao, J., Kitani, K. Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification. ICIP 2023. arXiv:2302.11813. The appearance-augmented extension of OC-SORT.
- Aharon, N., Orfaig, R., Bobrovsky, B.-Z. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv:2206.14651, 2022. The BoT-SORT paper that introduced camera-motion compensation and Re-ID-based association upgrades.
- BoT-SORT reference implementation.
github.com/NirAharon/BoT-SORT, accessed 2026-05-27. The MIT-licensed reference code. - Ultralytics. Multi-Object Tracking with Ultralytics YOLO.
docs.ultralytics.com/modes/track, accessed 2026-05-27. First-party documentation for the BoT-SORT (default) and ByteTrack trackers shipped with the Ultralytics framework. - Luiten, J., Ošep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., Leibe, B. HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking. International Journal of Computer Vision, 2021. The HOTA paper that balances detection and association accuracy in a single metric.
- Du, Y., Zhao, Z., Song, Y., Zhao, Y., Su, F., Gong, T., Meng, H. StrongSORT: Make DeepSORT Great Again. IEEE Transactions on Multimedia, 2023. The StrongSORT paper that modernises the DeepSORT pipeline with better detector, Re-ID, and post-processing.
- Stanojevic, V., Todorovic, B. BoostTrack: Boosting the Similarity Measure and Detection Confidence for Improved Multiple Object Tracking. Machine Vision and Applications, 2024. The BoostTrack paper that combines confidence boosting and a soft IoU measure.
- Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., Luo, P. DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion. CVPR 2022. The DanceTrack benchmark that exposes the constant-velocity Kalman filter's weakness on non-linear motion.
- Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L. SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes. ICCV 2023. The SportsMOT benchmark for sports-tracking evaluation.
- Bernardin, K., Stiefelhagen, R. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. EURASIP JIVP, 2008. The original CLEAR-MOT paper that defined MOTA, the metric HOTA replaced.
Notes For The Editor
- The BoT-SORT-ReID HOTA number (65.0) is from the 2022 paper on MOT17; the leaderboard moves and the current top result on MOT17 in mid-2026 is BoostTrack at roughly HOTA 65.4 — worth a refresh in the next revision once the BoostTrack lesson is written.
- "Where Fora Soft fits in" can be expanded with a real case study once the OTT broadcast project goes public; currently described in generic terms.
- We did not cover transformer-based trackers (MOTR, TrackFormer) — they are still research-grade and have not displaced tracking-by-detection in production. Mention in passing in a future revision.
- The numeric example uses YOLOv8-S — refresh to YOLOv11 or the relevant Ultralytics default once the YOLO 2.2 article is final.
- Cross-camera Re-ID is mentioned but not covered; a dedicated lesson should be slotted into Phase 2 in v2 of the curriculum.


