Phase 2 Capstone — Real-Time Detector + Tracker + Segmenter On Jetson Orin

Why This Matters

This article is for the product manager scoping a surveillance, retail, or robotics feature who needs to know whether "all of this runs on one cheap box next to the camera" is true, the founder weighing an edge build against a cloud subscription, and the engineer who has read the individual Phase 2 lessons and now wants to see them assembled into a deployable whole. It assumes you have read lesson 2.2 on the YOLO detector lineage, lesson 2.8 on multi-object tracking, and lesson 2.4 on SAM 2 for video, because this capstone connects those three into a single pipeline rather than re-deriving them. By the end you will be able to explain the four parts of an edge vision system, put a defensible frame-rate and power number on a real Jetson deployment, and tell the difference between the AI-assisted-development claims that hold up and the ones that do not.

What "Real-Time On The Edge" Actually Means

Before any models, fix the two words that drive every decision in this article: "edge" and "real-time".

"Edge" is the opposite of "cloud". A cloud system sends the camera's video over the internet to a data centre full of powerful computers, gets an answer back, and acts on it. An edge system does the thinking on a small computer that sits right next to the camera — in the same room, often inside the same housing. The video never leaves the building. Think of the difference between mailing a photo to an expert and asking them to mark it up, versus having the expert standing at your shoulder. The expert at your shoulder answers faster, never depends on the postal service, and never lets the photo out of your hands. Those three advantages — lower latency, no network dependency, and privacy — are why so many video products move their AI to the edge.

"Real-time" has a precise meaning here, and it is not "fast". A camera produces a fresh picture, called a frame, at a fixed rhythm — 30 frames every second is the common rate for surveillance and conferencing. Real-time means the system finishes all of its thinking about a frame before the next frame arrives. At 30 frames per second, a new frame lands every 33 milliseconds (one second ÷ 30 frames = 0.033 seconds = 33 ms). So the entire pipeline — detect, track, segment, and draw — has a budget of 33 milliseconds per frame. Go over that budget and frames start piling up faster than you can process them; the system falls behind and the live view turns into a slideshow. The whole engineering challenge of this capstone is staying inside that 33 ms envelope on a computer the size of a stack of business cards.

A useful image: the frame budget is like a conveyor belt at a fixed speed. Every 33 milliseconds a part arrives. You have exactly that long to inspect it, label it, and pass it on. You cannot ask the belt to slow down, and you cannot let parts stack up. Everything below is about fitting three inspectors — the detector, the tracker, and the segmenter — onto one belt without anyone falling behind.

The Four Parts Of An Edge Vision System

Every system in this capstone, and almost every shipped edge-vision product, is built from the same four parts. Understanding these four is the single most useful thing in the article, because the parts stay the same whether the camera watches a shop floor, a hospital corridor, a factory line, or a delivery robot's path.

The first part is the detector: a model that looks at one frame and draws a box, called a bounding box, around every object it recognises, attaching a label ("person", "car", "package") and a confidence score from 0 to 1. The detector is the eyes. In production this is almost always a member of the YOLO family, covered in depth in lesson 2.2. YOLO — the name stands for "You Only Look Once" — earned its place at the edge because it sees the whole frame in a single pass rather than scanning it piece by piece, which is what makes it fast enough to run inside the frame budget.

The second part is the tracker: detection alone gives you a box in frame 1 and a box in frame 2, but it has no idea they are the same person. A tracker assigns each detected object a stable identity number and follows it across frames, so the system can reason about a path, a dwell time, or a direction of travel. The tracker is the memory. The production trackers — ByteTrack, DeepSORT, OC-SORT — are the subject of lesson 2.8. We use ByteTrack here for a reason we return to: it is nearly free in terms of computing cost, which matters enormously when the budget is 33 milliseconds.

The third part is the segmenter: a box tells you where an object roughly is; a segment tells you exactly which pixels belong to it, tracing the object's true outline. The segmenter is the precision tool. You do not run it on everything — that would blow the budget instantly — you run it only on the few objects a rule cares about. Our segmenter is a distilled, sped-up version of SAM 2, the model from lesson 2.4. "Distilled" means a small model trained to copy a big one; we explain why that is mandatory on the edge below.

The fourth part is the orchestrator: the plumbing that pulls frames off the camera, hands each frame to the three models in the right order, applies the business rules, and draws the result. The orchestrator is the conveyor belt itself. It is unglamorous and it is where most real projects lose their time budget, because a model that runs in 20 milliseconds on a benchmark can run in 40 in a badly built pipeline that copies the frame back and forth between the main processor and the graphics chip. Most of the second half of this article is about the orchestrator, because the three models are the easy part.

Figure 1. The four parts every edge-vision system shares — detector, tracker, segmenter, and the orchestrator that ties them together on one board.

Meet The Hardware: The Jetson Orin Family

The board doing all this work is an NVIDIA Jetson Orin. It is a small computer built around a graphics processor — the same kind of chip that renders video games — because the maths behind AI vision is the maths graphics chips were built to do: the same simple operation applied to millions of pixels at once. Jetson packs that capability into a module small enough to sit inside a camera housing and frugal enough to run on the power a single bright light bulb draws.

The family comes in tiers, and the tier you pick decides how many of the four parts you can run at once and at what frame rate. The numbers below are NVIDIA's own published specifications as of 2026, after the JetPack 6.2 software update unlocked a faster "Super Mode" on the smaller boards.

The smallest, the Jetson Orin Nano 8GB (Super Mode), delivers 67 trillion operations per second — written 67 TOPS, where a TOPS is one trillion operations per second — using 1,024 graphics cores running at 1,020 megahertz, with a memory bandwidth of 102 gigabytes per second, all within a power budget you can set to 7, 15, or 25 watts. For comparison, a 25-watt draw is roughly that of a laptop charger. The mid-tier Jetson Orin NX 16GB (Super Mode) reaches 157 TOPS in the same physical footprint, configurable up to a 40-watt mode. The top of the embedded line, the Jetson AGX Orin 64GB, delivers 275 TOPS at up to 60 watts, and the newest Jetson AGX Thor module pushes to 800 TOPS for robotics workloads that need to run several large models at once. NVIDIA has committed to producing the Orin modules through 2032, which matters for any product that has to stay in the field for years.

Here is the practical takeaway, with the arithmetic shown. A YOLO detector small enough for real-time work needs on the order of a few TOPS of sustained throughput to hit 30 frames per second. An Orin Nano at 67 TOPS therefore has comfortable headroom for one detector and a tracker on a single camera, and just enough left for an occasional segmentation call. An Orin NX at 157 TOPS — more than double — is the natural home for a multi-camera build or a system that segments more aggressively. The TOPS number is not a frame-rate promise; it is a ceiling. What you actually get depends on the model, the precision, and the quality of the orchestrator, which is the rest of this article.

Jetson Orin module (2026)	Peak AI (INT8, sparse)	GPU	Memory bandwidth	Power modes	Good for
Orin Nano 8GB (Super)	67 TOPS	1,024 cores @ 1,020 MHz	102 GB/s	7 / 15 / 25 W / MAXN	One camera: detect + track
Orin NX 16GB (Super)	157 TOPS	1,024 cores @ 1,173 MHz	102 GB/s	10–40 W / MAXN	Several cameras, more segmenting
AGX Orin 64GB	275 TOPS	2,048 cores	204 GB/s	15–60 W	Many cameras, big models
AGX Thor	800 TOPS	Blackwell GPU	273 GB/s	up to 130 W	Robotics, several large models

Table 1. The Jetson Orin family in 2026. The "good for" column is a rule of thumb for a detector-plus-tracker-plus-occasional-segmenter workload like this capstone; your real numbers depend on model size, precision, and pipeline quality. Source: NVIDIA Jetson module specifications and the JetPack 6.2 Super Mode benchmarks.

The One Step That Makes It Fit: TensorRT And Quantization

A model trained on a big computer will not, untouched, hit the frame budget on a Jetson. Two transformations make it fit, and skipping either is the most common reason an edge project misses its timing.

The first is TensorRT. A trained model is a description of millions of small mathematical steps. TensorRT is NVIDIA's tool that takes that description and rewrites it for one specific chip, the way a translator rewrites a sentence to read naturally in another language rather than word-for-word. It does three things: it fuses many small steps into fewer big ones so the chip starts and stops less often (this is called layer fusion), it picks the fastest available routine for each operation on that exact chip (kernel auto-tuning), and it can reduce the numerical precision of the maths (precision calibration), which leads to the second transformation.

The second is quantization. A freshly trained model stores every number as a 32-bit floating-point value — very precise, but wasteful. You almost never need that precision to recognise a person. Quantization rounds those numbers down to a smaller format. The two that matter on Jetson are FP16 (16-bit, half-precision) and INT8 (8-bit integers). Think of it like describing a colour: you can specify it to six decimal places, or you can say "medium blue" — and for telling a person from a car, "medium blue" is plenty. The payoff is large and well measured. On a Jetson Orin Nano, the YOLOv8-nano detector runs at about 27 milliseconds per frame in FP16 and about 23 milliseconds in INT8; on the larger Orin NX, INT8 quantization lifts the same model from roughly 52 frames per second to about 65. Across published 2026 benchmarks, INT8 typically buys a 40–80% throughput gain over FP16, and it also lets the smaller models fit in the board's limited memory at all.

There is a price, and you must measure it, not assume it. Rounding the numbers down can cost a little accuracy. For a well-trained detector like YOLO the drop is usually under one percentage point of detection accuracy when INT8 is done with proper calibration — feeding the conversion tool a few hundred representative images so it learns the right rounding ranges. Skip the calibration and the accuracy loss can be severe. This is the single most important number to put on a test rig before you commit to INT8: measure your detector's accuracy in FP16, measure it again in INT8 on your footage, and confirm the gap is acceptable for your rule.

Common mistake — quoting the benchmark frame rate as your frame rate. A vendor page says "65 FPS YOLOv8 on Orin NX". That number is the detector alone, on a benchmark image stream, with nothing else running. Your system also decodes video, runs the tracker, sometimes runs the segmenter, applies rules, and draws overlays — and it does so while the board's power mode and temperature fluctuate. Real deployments routinely land at half the headline number once the full pipeline is in place. Budget for the whole pipeline, on your hardware, on your footage, in the enclosure it will actually live in.

Walking The Frame Budget

This is the heart of the capstone. We have 33 milliseconds per frame at 30 frames per second. Let us spend it, step by step, for a single-camera deployment on an Orin NX. The exact numbers vary with model and board; the shape of the budget does not.

The first cost is decoding. The camera does not send raw pixels — it sends compressed video, the same way a streaming service does, to save bandwidth. The frame must be decompressed before any model can read it. Jetson boards have a dedicated hardware decoder, called NVDEC, that does this without touching the graphics cores, so it costs almost nothing of our AI budget — call it 1–2 milliseconds, and crucially it runs in parallel with the AI work on the previous frame. This is why keeping the frame on the chip the whole time matters so much.

The second cost is detection, and it is the big one. The YOLO detector, quantized to INT8 and compiled with TensorRT, runs in roughly 12–18 milliseconds per frame on an Orin NX for a small-to-medium model size. That is already more than half the budget, which tells you immediately why the detector is the part you size the board around.

The third cost is tracking, and here is the happy surprise. ByteTrack does its work on the main processor, not the graphics chip, using simple geometry: it predicts where each tracked object will be using a motion model (a Kalman filter, which is just a principled way of guessing the next position from the last few), then matches the new detections to those predictions (using the Hungarian algorithm, a standard method for pairing things up at lowest total cost). Both steps are cheap. Published measurements put ByteTrack's per-frame update at well under one millisecond. In budget terms, the tracker is nearly free — which is exactly why it is the right choice on a board where every millisecond is contested.

The fourth cost is segmentation, and the trick is that you do not pay it every frame. Running a full SAM 2 on every frame would cost more than the entire budget by itself. Instead you run the segmenter only when a rule asks for a precise outline — say, on the one tracked person who just crossed a line — and often only once every few frames, reusing the result in between. A distilled segmenter like NanoSAM runs in single-digit milliseconds on Orin when given a box prompt from the detector you already ran. Spread across frames, its average cost lands in the low single digits of milliseconds. We return to why the distilled version is mandatory below.

The fifth cost is the rules and the overlay: deciding whether a tracked, optionally-segmented object trips a condition, and drawing the boxes and outlines onto the output. On the graphics chip this is a few milliseconds at most.

Add it up for the Orin NX single-camera case: roughly 2 (decode) + 15 (detect) + 1 (track) + 3 (segment, amortised) + 3 (rules and draw) ≈ 24 milliseconds. That fits inside the 33-millisecond budget with about 9 milliseconds of headroom — and headroom is not slack to be proud of, it is the safety margin that absorbs a busy frame with forty objects in it, a momentary thermal throttle, or a second camera. The discipline is to keep the sum comfortably under budget, never exactly at it.

Figure 2. Spending the 33-millisecond frame budget on an Orin NX single-camera build. The detector dominates; the tracker is nearly free; the segmenter is paid only when a rule needs it; headroom is the safety margin, not spare capacity.

The Detector In Detail — YOLO On Orin

The detector earns the largest slice of the budget, so it deserves the most care. Three choices decide its cost: which YOLO size, which precision, and which input resolution.

Model size is a dial, not a switch. The YOLO family ships in sizes from "nano" (the smallest and fastest) up through "small", "medium", and larger. A nano model at INT8 on an Orin Nano runs comfortably above 30 frames per second on a single camera; a medium model gives you better accuracy on small or distant objects but eats more of the budget. The right choice is the smallest model that still detects the objects your rule cares about at the distance they appear. Pick it by testing on your footage, not by reading a leaderboard, because leaderboard accuracy is measured on a generic dataset that may look nothing like your camera's view.

Precision we covered: compile to INT8 with proper calibration, verify the accuracy gap on your own footage, keep FP16 as the fallback if the gap is too wide.

Input resolution is the quietly expensive dial. A detector does not see your full 1080p frame; it shrinks every frame to a fixed square — commonly 640 by 640 pixels — before looking. A larger input square finds smaller objects but costs more time, roughly with the area: going from 640 to 1280 on each side quadruples the pixel count and roughly quadruples the detection time. Many edge projects that "can't hit frame rate" are simply running a larger input than their rule needs. If your objects are big in the frame, a smaller input square can claw back several milliseconds with no accuracy loss that matters.

The Tracker In Detail — Why ByteTrack Fits The Edge

The tracker's job is identity: keep calling the same person "track 7" as they move, even when the detector briefly loses them behind a pillar. ByteTrack's specific cleverness, explained in full in lesson 2.8, is that it does not throw away the detector's low-confidence boxes. Most trackers keep only the boxes the detector is sure about and discard the rest; ByteTrack runs a second matching pass that tries to attach the leftover low-confidence boxes to existing tracks. The result is that a person who gets briefly blurry or half-hidden — and so produces a low-confidence detection — keeps their identity instead of being treated as a new person. On the standard MOT17 benchmark this approach reaches an 80.3 tracking accuracy score, near the top of the field, while staying light enough to run in real time.

The reason this matters for the edge specifically is cost. ByteTrack does not run a neural network of its own; it is geometry and bookkeeping on the main processor. That means it does not compete with the detector and segmenter for the graphics chip, and it adds well under a millisecond per frame. Trackers that do run a neural network to recognise each object's appearance — DeepSORT is the classic example — hold identity more reliably when objects look alike, but they add a second model to your budget. On a board where the detector already eats half the budget, ByteTrack's near-zero cost is usually the right trade. Reach for an appearance-based tracker only when your scene truly demands it: many near-identical objects crossing paths, where geometry alone confuses their identities.

The Segmenter In Detail — Why You Must Distill SAM 2

SAM 2 is the model that, given a point or a box, produces a pixel-perfect outline of an object and — its headline feature — tracks that outline through a video using an internal memory of what the object looked like before. It is remarkable, and in its original form it is too slow for the edge. Published measurements put the full SAM 2 at around one frame per second on a mobile device, because the very memory mechanism that makes it good at video is also its slowest part. One frame per second against a 30-frame-per-second budget is a non-starter.

The fix is distillation: training a small, fast model to imitate the big one's outputs. Several distilled variants now exist precisely for this job. NanoSAM is a distilled Segment Anything model built to run in real time on Jetson Orin with TensorRT, reported at around five times the speed of the already-small MobileSAM with little accuracy loss. EdgeTAM, a 2025 on-device "track anything" model, hits 16 frames per second on a high-end phone and runs more than twenty times faster than standard SAM 2 by replacing the expensive memory step with a lighter one. EdgeSAM reaches 30-plus frames per second on phones, Jetson GPUs, and ARM devices. NVIDIA's own Super Mode benchmarks show a base SAM 2 climbing from 4.4 to 6.3 frames per second on an Orin Nano 8GB after the JetPack 6.2 update — useful for occasional calls, still far short of every-frame use, which is exactly why you run it selectively and reuse its results.

The engineering pattern that makes segmentation affordable on the edge is therefore: never segment blindly, never segment everything, never segment every frame. Let the cheap detector and the nearly-free tracker decide which one or two objects matter this second, hand the segmenter a box prompt for just those, run a distilled variant, and reuse the resulting outline for several frames while the object barely changes shape. Done this way, pixel-exact segmentation costs a few milliseconds on average instead of blowing the entire budget.

The Orchestrator — Where Projects Actually Lose Their Time

The three models are well-understood building blocks. The orchestrator — the code that moves frames between them — is where real projects quietly lose half their performance, and it is the part no benchmark measures for you.

The recurring villain is copying the frame around. The graphics chip and the main processor have separate memory. Every time a frame crosses from one to the other, it must be copied, and a copy of a high-resolution frame costs real time. A naive pipeline copies the frame to the main processor to run the tracker, back to the graphics chip to run the segmenter, back again to draw — and those copies can easily cost more than the AI itself. The fix is to keep the frame on the graphics chip from decode to overlay, and only ever move the small results (a handful of boxes, a few outlines) across the boundary. This is the difference between a detector that benchmarks at 15 milliseconds delivering 24 milliseconds in your pipeline and delivering 40.

There are two ways to build the orchestrator, and the choice is one of the most consequential in the project.

The first is NVIDIA DeepStream, a ready-made pipeline framework built for exactly this. DeepStream is built on GStreamer, a long-established system for assembling media pipelines as a chain of plug-in stages. Its stages map cleanly onto our four parts: a decode stage feeds a batching stage (nvstreammux) that groups frames from one or many cameras, then an inference stage (nvinfer) runs the TensorRT detector, a tracking stage (nvtracker) assigns identities, and an on-screen-display stage draws the result. Critically, DeepStream keeps the frame on the graphics chip across all of these stages by design, so you get the no-copy discipline for free. It is the right default for any serious multi-camera deployment.

The second is a custom pipeline, usually in Python with the Ultralytics library for the detector and your own glue for the rest. This is faster to prototype, easier to read, and fully flexible — Ultralytics supports detection, segmentation, and pose out of the box, where DeepStream's tracker integration is tuned mainly for detection and needs extra work for segmentation. The cost is that you are now responsible for the no-copy discipline yourself, and it is easy to get wrong. A custom pipeline is the right call for a single camera, an early prototype, or a system whose logic is too unusual to express in DeepStream's configuration files.

# Single-camera edge pipeline sketch (Ultralytics + ByteTrack).
# The detector runs a TensorRT INT8 engine; track=True turns on ByteTrack;
# segmentation is called only on selected tracks, not every object.
from ultralytics import YOLO

detector = YOLO("yolov8n.engine")          # pre-compiled TensorRT INT8 engine

# stream=True yields one result per frame without buffering the whole video;
# persist=True keeps track identities alive across frames (ByteTrack).
for result in detector.track(source="rtsp://camera/stream",
                             tracker="bytetrack.yaml",
                             stream=True, persist=True):
    for box in result.boxes:
        track_id = int(box.id) if box.id is not None else None
        # Apply the business rule here: line crossing, dwell time, zone entry.
        # Only when a rule fires do we request a precise mask for that one track:
        if rule_fires(track_id, box):
            mask = segment_one(result.orig_img, box.xyxy)   # distilled SAM 2 call
            raise_alert(track_id, mask)

The sketch above is deliberately small, and that smallness is the point: the models are a few lines; the engineering is everything around them — compiling the engine, calibrating INT8, keeping frames on the chip, choosing when rule_fires, and segmenting one track instead of all of them.

Figure 3. Choosing the orchestrator. One camera or an unusual rule set favours a custom Python pipeline; many cameras and a need to scale favour DeepStream's built-in no-copy, multi-stream design.

Scaling To Many Cameras

A single camera is the teaching case. Real surveillance and retail deployments watch many cameras from one board, and the scaling maths is worth seeing because it is more forgiving than it first appears.

The reason one board can serve many cameras is batching. Instead of running the detector once per camera, the pipeline collects one frame from each of, say, eight cameras and runs the detector once on all eight together. The graphics chip is far more efficient processing eight frames in one pass than eight frames in eight passes, because it spends less time starting and stopping. The published figure makes the point: an Orin NX 16GB running a small YOLOv8 detector at INT8 can serve on the order of forty camera streams at about five frames per second each. Five frames per second is plenty for many surveillance rules — a person does not cross a doorway in a fifth of a second — and it shows that the per-camera frame rate is a dial you trade against camera count.

The trade is explicit and worth stating plainly: total work per second is roughly fixed by the board's throughput, and you divide it between cameras and frame rate. One camera at 30 frames per second, or thirty cameras at one frame per second, draw on the same budget. Pick the point on that curve your rule actually needs. A licence-plate rule at a gate needs high frame rate on few cameras; a "was anyone in this restricted area" rule across a warehouse needs low frame rate on many. Sizing the board is choosing where on that curve you sit, plus headroom.

Track B — How AI Coding Assistants Help Build This (And Where They Don't)

This capstone carries the course's Track B theme: the AI-assisted tools a video engineer actually uses to build the product, as distinct from the AI features inside it. Embedded vision is a revealing test case, because it is exactly the kind of specialised, hardware-bound work where the marketing claims and the reality diverge — so it is worth being precise about both.

Start with what actually helps in 2026. AI coding assistants — Cursor, GitHub Copilot, Claude Code — have become standard equipment, and on this kind of project they earn their place in specific, bounded ways. They are excellent at the orchestrator's boilerplate: wiring up a GStreamer or Ultralytics pipeline, writing the argument-parsing and configuration-loading scaffolding, generating the calibration-image loader for INT8, and drafting the rule-evaluation glue. NVIDIA shipped this idea directly into the toolchain: the 2026 DeepStream release added coding agents that generate a deployable, optimised multi-camera pipeline from a plain-text description — "ingest four RTSP streams, detect people, track them, alert on zone entry" — which removes hours of fiddly configuration that used to be copy-pasted from forum posts. For the assembly work, these tools are a real multiplier.

Now the boundary, stated as plainly. The hard, performance-critical core of edge vision is where today's assistants are weakest, and pretending otherwise wastes a budget. Writing or hand-optimising a custom graphics-chip routine — a CUDA kernel — is the clearest example: peer-reviewed 2026 evaluations find that even frontier models still struggle to produce correct, fast CUDA kernels out of the box, lagging behind established compiler tools. The reason is structural, not temporary: getting an edge pipeline fast depends on facts an assistant cannot see from your code — the exact board, its current thermal state, where memory copies happen across the chip boundary, how the INT8 calibration interacts with your specific footage. An assistant works inside a fixed context window (128,000 to a million tokens of text in 2026) and cannot watch your nvpmodel power setting or your frame timings. It will confidently suggest a pipeline that looks right and copies the frame three times.

So the working pattern that holds up is a division of labour. Let the assistant write the scaffolding, the config, the boilerplate, and a first-draft pipeline — and verify its no-copy discipline yourself by profiling. Keep the human engineer firmly in charge of the four decisions that decide whether the system hits its frame budget: model size, precision and calibration, input resolution, and the memory path through the pipeline. The assistant accelerates the 80% that is assembly; the 20% that is performance engineering is still yours, and on the edge that 20% is the whole game.

Figure 4. The honest 2026 split. AI coding assistants accelerate the assembly of an edge pipeline; the performance-critical decisions that decide whether it hits the frame budget stay with the human engineer.

A Worked Example — Sizing And Powering A Four-Camera Site

Put numbers on a concrete brief: a small retail site, four cameras, a rule that flags when a tracked person enters a stockroom doorway, with a precise outline drawn on the flagged person for the review clip.

Start with the per-camera frame rate the rule needs. A person entering a doorway is not a fast event; eight frames per second is more than enough to catch the crossing and follow the person to the exit. Four cameras at eight frames per second is a total of 32 detection passes per second (4 cameras × 8 frames = 32). Batched on an Orin NX 16GB at 157 TOPS, with a small INT8 YOLO detector costing on the order of 15 milliseconds per batched pass, the detector consumes roughly 32 × 15 = 480 milliseconds of graphics-chip time per second — under half of the one-second budget, leaving comfortable room for the tracker (nearly free), the occasional segmentation call on a flagged person, and the overlay. One Orin NX 16GB covers this site with headroom; an Orin Nano would be tight at four cameras and is better matched to one or two.

Now power, because the edge promise is partly an energy promise. Run the Orin NX in its 25-watt mode for this load. Twenty-five watts continuous is 25 watts × 24 hours = 600 watt-hours, or 0.6 kilowatt-hours per day. At a representative electricity price of 15 cents per kilowatt-hour, that is 0.6 × 0.15 = 9 cents of electricity per day, about $33 a year, for a box that processes four camera streams without sending a single frame to the cloud. Set that against a cloud video-analytics subscription priced per camera per month and the edge build's running cost is, for many sites, a rounding error. The capital cost of the board and the engineering to deploy it is the real number to weigh — which is the scoping question the downloadable worksheet below is built to answer.

Common mistake — ignoring the enclosure and the heat. Every benchmark in this article assumes the board can shed its heat. A Jetson in MAXN Super Mode that exceeds its thermal budget throttles itself to a lower frequency to stay safe, and your real frame rate drops with it — silently. A board on an open bench hits its numbers; the same board sealed in a weatherproof camera housing on a sunny wall may not. Validate frame rate in the actual enclosure, in the actual environment, at the actual ambient temperature, before you sign off. The thermal design is part of the AI design.

Where Fora Soft Fits In

Fora Soft builds video products across surveillance, retail analytics, telemedicine, e-learning, and video conferencing, and the edge pattern in this capstone recurs in most of them: a detector, a tracker, a selective segmenter, and an orchestrator that has to hit a hard frame budget on constrained hardware. The work that decides whether such a system ships is rarely the choice of model — it is the unglamorous engineering this article dwells on: compiling and calibrating the models for the target board, keeping frames on the graphics chip, sizing the camera-count-versus-frame-rate trade to the actual rule, and validating frame rate inside the real enclosure rather than on a bench. Our teams treat the AI-assisted tooling the way this article recommends, as an accelerator for assembly with a human owning the performance-critical core. If you are scoping an edge-vision feature and want a defensible frame-rate, power, and cost estimate before committing a budget, that is precisely the kind of question we help answer.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your jetson orin object detection plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Jetson Orin Edge-Vision Scoping Worksheet — One-page printable that sizes any detector-plus-tracker-plus-segmenter deployment on Jetson Orin: pick the board, set the camera-count-vs-frame-rate trade, budget the 33 ms frame, and avoid the four edge traps.

References

NVIDIA — Jetson Orin module specifications (Orin Nano 67 TOPS, Orin NX 157 TOPS, AGX Orin 275 TOPS; CUDA-core counts, power modes; production lifecycle through 2032). https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/
NVIDIA — Jetson Modules, Support, Ecosystem, and Lineup (per-module TOPS and power-mode table, AGX Thor up to 800 TOPS). https://developer.nvidia.com/embedded/jetson-modules
NVIDIA Technical Blog — JetPack 6.2 Brings Super Mode to Jetson Orin Nano and Orin NX (Super Mode power modes, up to 2× inference; SAM2-base 4.4→6.3 FPS Orin Nano 8GB, Grounding DINO 4.1→6.2, ViT-base 98→158 FPS; CUDA 12.6, TensorRT 10.3). https://developer.nvidia.com/blog/nvidia-jetpack-6-2-brings-super-mode-to-nvidia-jetson-orin-nano-and-jetson-orin-nx-modules/
NVIDIA — DeepStream SDK documentation: Overview (GStreamer-based DAG; nvstreammux batching, nvinfer TensorRT inference, nvtracker, NVDEC hardware decode; frames kept on GPU). https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_Overview.html
NVIDIA — Gst-nvtracker plugin documentation (batch tracking, persistent object IDs, low-level tracker library). https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvtracker.html
NVIDIA — Gst-nvinfer plugin documentation (TensorRT inference on batched NV12/RGBA buffers). https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvinfer.html
NVIDIA Technical Blog — How to Build Vision AI Pipelines Using DeepStream Coding Agents (2026; text-to-pipeline generation, multi-camera RTSP). https://developer.nvidia.com/blog/how-to-build-vision-ai-pipelines-using-deepstream-coding-agents/
Ultralytics — YOLO on NVIDIA Jetson using DeepStream SDK and TensorRT (TensorRT INT8/FP16 deployment; ~40 streams at ~5 fps on Orin NX 16GB with YOLOv8s INT8; INT8/TensorRT-version caveats). https://docs.ultralytics.com/guides/deepstream-nvidia-jetson/
Seeed Studio — YOLOv8 Performance Benchmarks on NVIDIA Jetson Devices (FP16 vs INT8 latency and FPS on Orin Nano / Orin NX). https://www.seeedstudio.com/blog/2023/03/30/yolov8-performance-benchmarks-on-nvidia-jetson-devices/
MDPI Computers — Benchmarking YOLOv8 Variants for Object Detection Efficiency on Jetson Orin NX (2026; per-variant FPS and energy on Orin). https://www.mdpi.com/2073-431X/15/2/74
arXiv — EdgeTAM: On-Device Track Anything Model (SAM 2 ~1 FPS mobile; EdgeTAM 16 FPS on iPhone 15 Pro Max, >20× faster than SAM 2; memory-attention bottleneck). https://arxiv.org/abs/2501.07256
NVIDIA-AI-IOT — NanoSAM (distilled SAM running real-time on Jetson Orin with TensorRT; ~5× MobileSAM). https://github.com/NVIDIA-AI-IOT/nanosam
FoundationVision — ByteTrack (two-stage association with low-confidence boxes; Kalman + Hungarian; 80.3 MOTA on MOT17; sub-millisecond per-frame update). https://github.com/FoundationVision/ByteTrack
arXiv — CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization (2026; LLMs still lag compiler tools on correct/performant CUDA kernels). https://arxiv.org/abs/2511.01884
arXiv — Tokalator: A Context Engineering Toolkit for AI Coding Assistants (2026; finite context windows 128K–1M tokens; prompt-length growth). https://arxiv.org/abs/2604.08290
NVIDIA — Jetson Orin lifecycle and roadmap (Orin production through 2032; JetPack 5/6 Super Mode roadmap). https://developer.nvidia.com/embedded/lifecycle