
You want AI that reliably spots people, vehicles, or anomalies across multiple cameras and fires alerts without lag. Generic open-source demos work fine in a lab. They fall apart in real offices, factories, and construction sites, where low light, occlusions, camera movement, and 10+ concurrent streams cause dropped IDs, slow frame rates, and missed events.
In 2026, the practical stack pairs the latest YOLO model generation with ByteTrack or BoT-SORT tracking, RTSP input, and WebRTC output. That combination delivers real-time performance on edge hardware while staying straightforward to maintain.
This guide walks you through the full pipeline, honest benchmarks, and trade-offs so you can decide what fits your project. You will also see exactly where custom streaming integration makes the difference between a prototype and software that holds up under real conditions.
Key Takeaways
- The latest YOLO generation with BoT-SORT or ByteTrack is the strongest speed-accuracy combination for 2026 deployments
- ByteTrack runs faster; BoT-SORT handles occlusions more reliably – choose based on your scene
- TensorRT optimisation delivers 3–5× inference speedup on Jetson hardware
- Edge deployment (Jetson Orin) keeps latency under 100 ms and keeps footage off external servers
- WebRTC output enables sub-second live views on any device without additional media server overhead
- Custom training on your camera data cuts false positives significantly versus off-the-shelf weights
Why the 2026 Stack Beats Legacy Systems
Legacy surveillance software treats detection as a post-processing step: record first, analyse later. That approach works for forensics. It fails when you need instant alerts.
Modern stacks run detection and tracking in a single, continuous pipeline. The latest YOLO models (with the most recent release in early 2026) deliver end-to-end NMS-free inference, smaller parameter counts, and up to 43% faster CPU performance than earlier versions. The nano variant runs efficiently on edge hardware while maintaining solid accuracy for most surveillance tasks.
Pair detection with native trackers: BoT-SORT for accuracy in crowded or occluded scenes, ByteTrack for lower overhead and higher throughput. DeepSORT remains a valid option if your use case requires custom appearance-based re-identification features, though most teams now skip it in favour of the native options, which require far less tuning.
RT-DETR is worth noting for scenarios that demand higher accuracy in complex scenes. However, YOLO variants typically win on speed and deployment ease for standard surveillance workloads.
The result: sub-second alerts, fewer false positives, and more straightforward scaling to multi-camera setups.
Detection Models Compared: YOLOv8, YOLO11, YOLO World, RT-DETR
Choose the right detector based on your hardware constraints and scene demands. The table below compares the main options on COCO benchmark metrics and edge FPS estimates.

YOLO's 2026 nano variant consistently beats RT-DETR in edge speed and parameter count while staying close enough in accuracy for most surveillance tasks. Start with the nano or small variant: scale to medium only if you need to detect small, distant objects. YOLO-World is worth evaluating if you need open-vocabulary detection, though it comes with a speed penalty.
Tracking Shootout: DeepSORT vs ByteTrack vs BoT-SORT
Detection alone is not enough. You need consistent object IDs across frames to drive reliable alerts and event logging.

BoT-SORT wins on accuracy; ByteTrack wins on speed. Both beat DeepSORT in most real-world tests because they use smarter association logic rather than relying solely on appearance features.
A practical starting point: set track_buffer higher (50–100 frames) for longer occlusions, and always test on your actual footage before committing. Expect 5–15% ID switches in busy areas regardless of tracker; plan alert logic around groups or zones rather than single persistent IDs where you can.
Step-by-Step Pipeline: Installation, Custom Training, TensorRT Optimisation
Start simple, then add complexity incrementally. Below are the four stages most teams follow.
Step 1– Install and test quickly
pip install ultralyticsVerify with a quick single-camera test:
from ultralytics import YOLOmodel = YOLO("yolo_latest_n.pt")results = model.track(source="rtsp://your-camera-url", tracker="botsort.yaml", show=True)Step 2 – Build and label your dataset
Prepare a data.yaml with your classes (person, vehicle, anomaly, etc.). Use Roboflow or CVAT for labelling. Aim for 300–1,000 images minimum per class, captured under your actual lighting and camera angles.
Step 3 – Custom training
model = YOLO("yolo_latest_n.pt")model.train(data="data.yaml", epochs=100, imgsz=640, device=0)Add augmentations for low light and weather variation. Train for 50–200 epochs depending on dataset size. Fine-tune confidence thresholds on a held-out validation set before deploying.
Step 4 – Export and optimise for Jetson
model.export(format="engine", device=0) # TensorRT exportTensorRT typically delivers 3–5× inference speedup over PyTorch on Jetson hardware. For the highest throughput, consider a C++ inference wrapper — Python overhead adds 10–20% latency at sustained stream counts.
Edge vs Cloud Deployment
For most enterprise surveillance, edge is the right default. Latency stays under 100 ms, footage never leaves the premises, and ongoing costs are predictable. Cloud deployment makes sense for centralised analytics across dozens of distributed sites — stream metadata and alerts only, not raw video.

Common Production Challenges
- GPU memory pressure at 10+ concurrent streams – use Jetson Orin NX or YOLO nano variants with frame-skipping
- ID switches during occlusion – increase track_buffer and combine with zone-based alert rules
- Low-light false positives – custom training with low-light footage and calibrated confidence thresholds
- GDPR / HIPAA compliance – local processing where possible; encrypted WebRTC streams for remote views
- Camera clock drift in multi-site setups – NTP sync and frame timestamp validation at ingest
Test on your real cameras early. Most teams underestimate network jitter and lighting variation until they reach production.
When to Build In-House vs Engage Video Specialists

If your core business is not video AI, building the detection and streaming layer yourself carries real hidden costs: the learning curve on tracker tuning, RTSP quirks, WebRTC NAT traversal, and compliance documentation. Specialists who have already shipped multi-camera platforms know where the edge cases live.
Our Expertise in Action
We have spent over 20 years building real-time video systems — not just detection scripts, but full pipelines that handle RTSP ingest, AI processing, low-latency delivery, and compliance requirements as an integrated whole.
One surveillance platform we have supported for more than a decade now serves 650 organisations and 25,000 daily users. We added AI motion detection, PTZ control, and word-searchable recordings (Amazon Transcribe) while keeping sub-second live views on web and mobile clients.
A second project integrated YOLO-based detection with Kurento media pipelines and WebRTC for mobile alerting. The client runs reliably across multiple sites with encrypted streams and GDPR-compliant local processing.
A third used LiveKit and FFmpeg to turn smartphones into intelligent edge cameras with AI motion tracking, offline-first recording, and automatic cloud sync when connectivity resumes. These systems hold up under packet loss, variable networks, and mixed hardware — because we treat streaming and AI as one product, not separate layers bolted together.
We use spec-driven Agentic Engineering: detailed specifications before any code is written, AI agents on bounded tasks, and human checkpoints at plan, code, and performance stages. Estimate variance averages around 6% versus the 30% common in traditional approaches. Delivery typically runs 4–10× faster on standard components.
FAQ
Which tracker should I pick first — ByteTrack or BoT-SORT?
Start with ByteTrack if raw speed matters and your cameras are mostly static. Switch to BoT-SORT if you see too many ID switches in crowded or occluded scenes and can afford a few FPS. Both outperform DeepSORT in most real-world deployments.
Does the latest YOLO generation work well on Jetson devices?
Yes. The nano and small models reach 33–52 FPS with TensorRT on Jetson Orin Nano. For sustained high-throughput workloads, a C++ inference wrapper reduces Python overhead further.
Can I use existing RTSP cameras with this stack?
Absolutely. FFmpeg handles RTSP / ONVIF ingestion, YOLO processes the decoded frames, and WebRTC delivers the annotated output to any device. Kurento or LiveKit can be layered in for recording, mixing, or server-side composition if needed.
How much custom training is realistically required for my site?
Start with pretrained weights and fine-tune on 300–1,000 labelled images from your specific cameras. Focus on your actual objects, lighting conditions, and camera angles. Most teams see reliable results in 50–100 epochs. Off-the-shelf weights typically produce too many false positives on site-specific scenes.
What about GDPR or HIPAA compliance?
Process video locally on edge devices wherever possible — no raw footage leaves the premises. Use encrypted WebRTC for any remote viewing. Compliance also involves documenting data flows, retention policies, and consent mechanisms. We build these requirements into the spec from day one rather than retrofitting them.
How long until I have a working prototype?
A basic single-camera tracker takes 1–2 weeks to set up. A full multi-camera system with alert logic, WebRTC live view, and recording typically lands in 4–8 weeks depending on camera count, custom detection classes, and integration requirements.
Do I need a powerful GPU server?
Not for most deployments. Jetson Orin hardware handles multi-camera edge workloads without any cloud GPU. Cloud GPUs are useful primarily for long-term storage, centralised dashboards, or retraining pipelines — not for real-time inference at the edge.
What if I already have an existing surveillance system I need to extend?
We can do a free code audit and integration review. Many existing systems can be extended with a modern AI layer without replacing the full stack — depends on your current ingest and storage architecture.
Next Steps
If you are evaluating whether to build this in-house or want specialist input on the detection and streaming layer, we are straightforward to talk to.
A short call is usually enough to give you a clear picture of what is achievable and what it takes.



.avif)

Comments