Published 2026-05-24 · 22 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
If you are about to ship object detection in a video product — a surveillance system, a retail analytics platform, a fitness app, an OTT moderation pipeline, a telemedicine triage tool — YOLO is almost certainly the model your engineering team will reach for first. The question is which YOLO. Pick the wrong one and you spend three months retraining when the next version comes out, you discover that AGPL-3.0 obliges you to open-source your entire application, or you ship a model whose inference latency budget the camera count will never let you meet. This article assumes you have already read lesson 2.1 on pre-processing for computer vision projects — the pre-processing pipeline determines half of YOLO's accuracy in production — and lesson 1.4 on the real cost of AI in video products, because the per-stream cost of a YOLO inference is what turns a feature decision into a unit economic. By the end you will be able to read a YOLO release announcement, map it to the five versions below, and tell your engineering team exactly which one to pick and why.
What "YOLO" Means In Production In 2026
The acronym "You Only Look Once" — YOLO — has been doing too much work in the computer vision conversation since 2016, and the word now refers to at least three different things that get confused in every kickoff meeting. Naming them clearly is the first step in any project decision.
The original YOLO, called YOLOv1, was a single 2015–2016 architecture by Joseph Redmon that proposed something genuinely new at the time: predict object boxes and class labels for an entire image in one forward pass of a convolutional network, instead of running a region proposal stage and then a classification stage separately. That paper is why "YOLO" is a brand. Redmon and his collaborators wrote YOLOv1, v2, and v3 between 2016 and 2018 and then stepped away from the project.
The second thing called YOLO is the lineage of versions that followed the original — YOLOv4, YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10, YOLOv11, YOLOv12, and, as of late 2025, YOLO26. Each was written by a different team in a different organisation, each carried a different licence, each shipped a different architecture, and each claims to be the legitimate descendant of the original. They are not really one lineage. They are a brand fragmented across teams who agreed that the YOLO name carried enough engineering credibility to be worth fighting over.
The third thing called YOLO is the Ultralytics product line: YOLOv5, YOLOv8, YOLOv11, and YOLO26 are all written and maintained by Ultralytics, the British company founded by Glenn Jocher (the author of YOLOv5). Ultralytics ships the ultralytics Python package, owns the AGPL-3.0 licence on the open-source release, and sells a commercial enterprise licence for teams that cannot live with AGPL obligations. When a product manager says "we'll use YOLO", they usually mean "we'll use the Ultralytics package", because that is the install that has the documentation, the cloud training service, the Roboflow integration, and the inference SDK. The versions in this article — v8, v9, v10, v11, v12 — span both Ultralytics releases and independent academic releases that work alongside the Ultralytics package.
The implication for a product decision is direct. Picking "YOLO" is not enough; the team has to pick a specific version, a specific licence, and a specific deployment target. The five versions below are the ones a 2026 production team realistically chooses from. Older versions (v5, v7) still ship in production, but they have no architectural advantage worth the migration friction unless legal constraints lock you to them.
Figure 1. Five versions of YOLO shipped between January 2023 and February 2025. Each version changed one specific thing about the production decision; the right version is the one whose change matches your project's bottleneck.
YOLOv8 — January 2023 — The Anchor-Free Multi-Task Baseline
YOLOv8 is the version that turned the YOLO brand into a general-purpose vision platform instead of a single-task object detector. Ultralytics released it on January 10, 2023, and within six months it was the default starting point for almost every new computer vision project that did not have a specific reason to pick something else. The reason was not raw accuracy — earlier YOLOs were close. The reason was that YOLOv8 made three engineering decisions that collapsed the cost of shipping a CV product.
The first decision was the anchor-free detection head. Earlier YOLOs used "anchors" — a set of pre-defined box shapes (tall, wide, square, small, large) that the model learned to adjust to fit real objects. Anchors had to be configured per dataset because a coffee mug dataset has different aspect ratios than a parked-car dataset, and getting them wrong cost you accuracy. YOLOv8 deleted the anchors and asked the model to predict box coordinates directly. The result was the model trained faster on custom datasets, generalised better when the customer's data did not match the pretraining distribution, and produced fewer redundant boxes that the Non-Maximum Suppression (NMS) step had to clean up after.
The second decision was the C2f module (Cross-Stage Partial Bottleneck with two convolutions). The C2f replaced the older C3 module in the backbone — the part of the model that turns raw pixels into useful features. The point of C2f was not to improve accuracy by a large margin; it was to improve gradient flow during training so that the model converged in fewer epochs at the same final accuracy. In practice this meant a custom-trained YOLOv8 model reached production-grade quality in 50 epochs where YOLOv5 would have needed 100, which directly translated into a 50% reduction in cloud training cost per experiment.
The third decision was multi-task support out of the box. YOLOv8 ships five model variants — detection, instance segmentation, classification, pose/keypoints detection, and oriented bounding box detection — that share the same backbone and the same training pipeline. A retail analytics team that needs detection now and segmentation in six months can run both jobs through the same ultralytics package, the same dataset format, and the same training script. No other detector in 2023 packaged the five tasks that cleanly. In a video surveillance project that has to detect people, segment their silhouettes for privacy blur, and estimate their pose for fall detection, the multi-task story alone justified picking YOLOv8.
The COCO benchmark numbers for YOLOv8 — the standard reference for an object detector's accuracy on a held-out dataset of 80 object classes — sit at 37.3% mAP for YOLOv8n (the nano variant, 3.2M parameters), 44.9% for YOLOv8s (11.2M parameters), 50.2% for YOLOv8m (25.9M parameters), 52.9% for YOLOv8l (43.7M parameters), and 53.9% for YOLOv8x (68.2M parameters). The numbers are quoted from Ultralytics' official model card. The "mAP" — mean Average Precision averaged across detection thresholds — is the headline accuracy metric for an object detector; a model that scores 50.2% is detecting and correctly classifying roughly half the ground-truth objects in the dataset at the strict 0.50–0.95 IoU range.
The headline weakness of YOLOv8 in 2026 is that the C2f-and-anchor-free design has been beaten on every axis by every subsequent version. For a new project there is no strong reason to start with v8 unless the team has a YOLOv8-trained model already in production and a migration cost the product cannot absorb this quarter.
YOLOv9 — February 2024 — The Gradient Pipeline That Saved Small Models
YOLOv9 was released on February 21, 2024, by an academic team led by Chien-Yao Wang and his collaborators (who had previously written YOLOv4 and YOLOv7). It is not an Ultralytics release, but Ultralytics added YOLOv9 support to the ultralytics package shortly after. The release was published as a paper, "YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information", which was later accepted to ECCV 2024.
YOLOv9 made one architectural change that mattered more than the rest: it solved the small-model accuracy gap. In every YOLO version before v9, the smallest variants (the "n" and "s" sizes that fit on edge hardware) lost a measurable amount of accuracy that the bigger variants ("m", "l", "x") did not. The reason was buried in the training dynamics: as the model got deeper, useful gradient signal got lost in the layers and the small model — which had fewer layers in the first place — was disproportionately starved. The fix in v9 was two-part.
The first part was Programmable Gradient Information (PGI). The training pipeline carried a parallel auxiliary network that fed the main network with reliable gradient signal for every layer, and was then discarded at inference time. The auxiliary network cost nothing at deployment because it was deleted before export. During training, it kept the small variants from starving for gradient.
The second part was a new backbone called GELAN (Generalized Efficient Layer Aggregation Network). GELAN was designed around what the authors called "gradient path planning" — a structural choice that explicitly engineered the routes that gradients took backwards through the network, so that the right signal reached the right layer. The architectural detail is less important than the practical result: a YOLOv9-tiny model reached 38.3% mAP on COCO, beating the YOLOv8n's 37.3% with 21% fewer parameters and 24% fewer FLOPs. Across the lineup, YOLOv9 reduced parameters by 49% and computations by 43% compared to YOLOv8 at the same accuracy. For an edge deployment where the model has to fit on a Raspberry Pi or a Jetson Orin Nano, that is the difference between shipping and not shipping.
The headline weakness of YOLOv9 was that it landed in a crowded month: Ultralytics released YOLOv10 (also via paper, not via the ultralytics package, but with Ultralytics integration) three months later, and YOLOv10's NMS-free story stole the deployment attention from v9's gradient story. In 2026, v9 is the right pick for one specific scenario: an edge-constrained project that needs the absolute smallest model with respectable accuracy and can live with the standard NMS post-processing pipeline.
YOLOv10 — May 2024 — The NMS-Free Architecture
YOLOv10 was released in May 2024 by an academic team at Tsinghua University, led by Ao Wang, Hui Chen, and Lihao Liu, and published as the paper "YOLOv10: Real-Time End-to-End Object Detection" (arXiv 2405.14458). The paper was accepted to NeurIPS 2024. Ultralytics added YOLOv10 support to the ultralytics package shortly after release.
YOLOv10 made one production-grade change that mattered more than every accuracy improvement on the page: it deleted the Non-Maximum Suppression step from the inference pipeline. To understand why that matters, you need to understand what NMS is doing and why every prior YOLO needed it.
Every prior YOLO version produced multiple overlapping boxes for the same object during inference. A "person" in the frame would typically produce three, five, or twelve candidate boxes, each with a confidence score, and the NMS step would walk through them, pick the highest-confidence one, throw away the overlapping ones, and repeat until only one box per object remained. NMS is correct but expensive. On a YOLOv8s model running on an NVIDIA T4 GPU, NMS added roughly 4 to 6 milliseconds of latency to every inference — about a quarter of the total inference time. NMS also did not export cleanly to embedded inference runtimes; teams targeting CoreML, TFLite, and OpenVINO frequently had to re-implement NMS by hand in the deployment runtime, which is a category of bug that ate weeks of engineering time per project.
YOLOv10 solved the NMS problem with consistent dual label assignments. During training, the model used two parallel label-assignment heads: a one-to-many head (which produced the multiple-overlapping-boxes behaviour that NMS would later clean up) and a one-to-one head (which produced exactly one box per object). At inference time, the model dropped the one-to-many head and used only the one-to-one head, which produced clean predictions without any NMS post-processing. The two heads were trained to agree on the same predictions, so accuracy did not drop.
The benchmark numbers for YOLOv10-S are 46.7% mAP at roughly 2.49 ms inference on a T4 GPU — beating YOLOv8-S by 2.0 mAP at 19% less latency, and beating YOLOv9-S by 0.7 mAP at 30% less latency. The NMS-free training cut the end-to-end latency of YOLOv10-S by 4.63 ms versus a comparable v8 baseline, which is exactly the per-frame budget you need back if you are pushing a 200-camera surveillance deployment through one GPU.
The headline weakness of YOLOv10 is the licensing situation: the academic release is under AGPL-3.0 (inherited from the Ultralytics integration), and the THU-MIG GitHub repository is also AGPL-3.0. The same legal calculus that gates YOLOv8 production deployments gates YOLOv10. In 2026, v10 is the right pick for any team that needs the lowest per-frame latency on a real-time pipeline (live surveillance alerting, in-call video effects, WebRTC pose tracking) and can either afford the Ultralytics enterprise licence or is willing to open-source the application.
YOLOv11 — September 2024 — The Backbone Redesign For Edge
YOLOv11 was released by Ultralytics on September 10, 2024. It is an Ultralytics product release (like v5 and v8), distributed through the same ultralytics Python package and under the same AGPL-3.0 / enterprise licence dual-track. The version did not propose a single dramatic new idea like v9's PGI or v10's NMS-free training; it consolidated the YOLO lineage into a cleaner architecture that beat v8 on every metric while running faster on edge hardware.
The architectural change that mattered was the C3k2 block, which replaced the C2f block in the backbone. C3k2 is a variant of the Cross-Stage Partial pattern that uses different kernel sizes (mixed 3×3 and 5×5 convolutions) and a channel separation strategy that reduced redundant feature extraction. The practical result was that the YOLO11m model achieved roughly 1.3% higher mAP than YOLOv8m on COCO with 22% fewer parameters, which translated into a roughly 22% reduction in ONNX export size and a measurable speedup on CPU-bound inference.
YOLOv11 also introduced two attention-flavoured modules — SPPF (Spatial Pyramid Pooling, Fast) carried over from v8 and C2PSA (Cross-Stage Partial Spatial Attention), a new spatial-attention block in the neck. C2PSA is the lighter-weight cousin of full self-attention; it does not pay the quadratic compute cost of a transformer, but it does help the model focus on the relevant regions of an image, which improved accuracy on small objects and occluded objects. The retail-camera scenario where a half-hidden person in the back of an aisle has to be detected was the canonical motivating example.
The headline COCO numbers for YOLOv11 are 39.5% mAP for YOLOv11n (2.6M parameters), 47.0% for YOLOv11s, 51.5% for YOLOv11m, 53.4% for YOLOv11l, and 54.7% for YOLOv11x. The mAP-per-parameter ratio is the best in the lineage up to this point. On a T4 GPU with TensorRT, YOLOv11n hits an inference latency of around 1.5 ms; on an NVIDIA Jetson Orin Nano, the same model hits 8 to 10 ms, which is enough for 30-fps real-time inference on multiple camera streams.
YOLOv11 also retains the full multi-task line-up — detection, instance segmentation, classification, pose, and oriented bounding box — that v8 introduced. For most product teams in 2026 building a new vision feature on CPU or modest GPU hardware, YOLOv11 is the default pick. It is the version with the cleanest documentation, the largest open-source pretrained checkpoint catalogue, and the most mature export pipeline for the production runtimes the team is most likely to target (ONNX, TensorRT, OpenVINO, CoreML, TFLite).
The headline weakness of YOLOv11 is the same as YOLOv8: AGPL-3.0 makes commercial use awkward unless the team buys the Ultralytics enterprise licence, and the licence is the conversation legal will have with engineering before the first model is exported.
YOLOv12 — February 2025 — The Attention-Centric Detector
YOLOv12 was released on February 18, 2025, by an academic team led by Yunjie Tian, Qixiang Ye, and David Doermann, published as "YOLOv12: Attention-Centric Real-Time Object Detectors" (arXiv 2502.12524) and accepted as a NeurIPS 2025 poster. The release made the architectural change every YOLO observer had been waiting for since 2020: a real-time detector built around attention instead of around convolutions, with the speed numbers to back it up.
For a non-technical reader, "attention" is the mechanism that powers every modern large language model — GPT-5, Claude Opus 4, Gemini 2.5 — and also the modern vision transformers like ViT, SAM 2, and CLIP. Attention works by letting every part of an input look at every other part and learn which other parts matter for the current prediction. The trade-off has historically been compute: a naive attention mechanism scales quadratically with the input size, which is why convolutional networks (where each filter only looks at a small local patch) dominated real-time object detection for a decade. Every prior attempt to bring attention into a real-time detector — DETR, RT-DETR, Deformable DETR — paid in latency what it gained in accuracy.
YOLOv12 made attention cheap enough to ship by combining three ideas. The first was an Area Attention module (A2), which divides the feature map into spatial segments and only computes attention within each segment. The receptive field — the part of the input the model can see at once — stays large because the segments overlap, but the quadratic cost goes away because the attention is local. The second was R-ELAN (Residual Efficient Layer Aggregation Network), a redesigned feature aggregation block with a residual connection and a feature-scaling mechanism. The R-ELAN block solved the optimisation challenges that earlier attention-based detectors hit, where the attention heads diverged early in training and the model never recovered. The third was the use of FlashAttention at inference time. FlashAttention is a numerically equivalent implementation of attention that reorders the memory accesses to fit the GPU's fast on-chip SRAM, which can deliver 2 to 4× speedup on attention operations without changing the output.
The benchmark numbers for YOLOv12 are 40.6% mAP for YOLOv12-N at 1.64 ms on a T4 GPU, 48.0% for YOLOv12-S at 2.61 ms, and up to 55.2% for YOLOv12-X. YOLOv12-N beats YOLOv10-N by 2.1 mAP and YOLOv11-N by 1.2 mAP at comparable speeds. For a non-technical reader, the headline is: attention-based detection used to be 5 to 15 ms slower than convolutional detection at the same accuracy; YOLOv12 closed that gap.
The deployment caveats matter, though. FlashAttention requires an NVIDIA GPU with specific architecture support — Turing (T4, Quadro RTX), Ampere (A30, A40, A100, RTX 30-series), Ada Lovelace (RTX 40-series), or Hopper (H100, H200). On older GPUs, on Apple Silicon, on AMD GPUs, and on most edge hardware (Jetson Nano, Raspberry Pi, Intel CPUs), FlashAttention either does not run at all or runs slower than the standard attention implementation. The Ultralytics YOLO12 wrapper does not require FlashAttention — it falls back to PyTorch's native attention — but the speed advantage that makes YOLOv12 worth picking over YOLOv11 disappears in that fallback path. The licensing also matters: the academic release inherits the Ultralytics AGPL-3.0 / enterprise dual-track when integrated through the ultralytics package.
In 2026, YOLOv12 is the right pick for projects that have NVIDIA GPU acceleration in the deployment target — cloud servers, modern workstations, NVIDIA Jetson AGX Orin — and need the highest accuracy a real-time detector can deliver. For edge devices without FlashAttention support, the right pick is YOLOv11. For pure CPU-bound deployments, YOLO26 (released September 2025) has now overtaken both, with up to 43% faster CPU inference than YOLO11-N at comparable accuracy.
The Five-Version Production Comparison
The table below collapses the five versions into a single decision view. Read it left to right: the row for your project's bottleneck tells you which version to pick.
| Decision axis | YOLOv8 | YOLOv9 | YOLOv10 | YOLOv11 | YOLOv12 |
|---|---|---|---|---|---|
| Release date | Jan 2023 | Feb 2024 | May 2024 | Sep 2024 | Feb 2025 |
| Author | Ultralytics | Wang et al. (academic) | THU (academic) | Ultralytics | Tian et al. (academic) |
| Licence (open) | AGPL-3.0 | GPL-3.0 / AGPL via Ultralytics | AGPL-3.0 | AGPL-3.0 | AGPL-3.0 via Ultralytics |
| Commercial path | Ultralytics enterprise | Academic / Ultralytics ent. | Academic / Ultralytics ent. | Ultralytics enterprise | Ultralytics enterprise |
| Core architectural change | Anchor-free + C2f | PGI + GELAN | NMS-free dual label | C3k2 + C2PSA | Area Attention + R-ELAN |
| Nano model mAP (COCO) | 37.3% | 38.3% | 38.5% | 39.5% | 40.6% |
| Nano latency on T4 (ms) | ~1.8 | ~2.0 | ~1.9 | ~1.5 | ~1.64 |
| Multi-task support | 5 tasks | Detection only | Detection only | 5 tasks | Detection (others added later) |
| NMS at inference | Yes | Yes | No | Yes | Yes |
| FlashAttention dependency | No | No | No | No | Optional (recommended for speed) |
| 2026 default pick for | Legacy migration | Edge w/ smallest model | Lowest-latency real-time | Default new project | Highest accuracy on NVIDIA GPU |
Figure 2. Five-version production comparison. The numbers are from Ultralytics' official model cards and the respective papers; latency varies with batch size, image resolution, and TensorRT version, so treat the relative ordering as more meaningful than the absolute values.
A Worked Cost Example — 200 Cameras, Three Versions
To make the comparison concrete, picture a retail loss-prevention project: 200 cameras across 10 stores, 1080p H.264 at 30 fps, person and bag detection running at 6 fps per camera (the slowest cadence that catches a 4-second shoplifting event, per the analysis in lesson 2.1). The workload is 200 × 6 = 1,200 inferences per second across the entire fleet.
On an NVIDIA L4 GPU ($1.10 per hour on AWS in 2026), a YOLOv8s model with TensorRT INT8 quantisation processes roughly 350 inferences per second per GPU at 1080p. The fleet needs ⌈1,200 / 350⌉ = 4 GPUs, costing 4 × $1.10 × 24 × 30 = $3,168 per month. Reserve pricing brings this to roughly $2,000 per month.
The same fleet on YOLOv11s — same workflow, same TensorRT INT8 pipeline — processes roughly 480 inferences per second per GPU because of the C3k2 backbone's lower FLOPs at higher accuracy. The fleet needs ⌈1,200 / 480⌉ = 3 GPUs, costing 3 × $1.10 × 24 × 30 = $2,376 per month. Reserve pricing: ~$1,500.
The same fleet on YOLOv12s with FlashAttention enabled on the L4 GPUs processes roughly 530 inferences per second per GPU, and the fleet needs ⌈1,200 / 530⌉ = 3 GPUs at the same monthly cost. The accuracy difference is the buy: YOLOv12s scores 48.0% mAP versus YOLOv11s at 47.0% versus YOLOv8s at 44.9%, which translates into roughly 3 percentage points fewer false negatives (missed shoplifters) at the same false-positive rate.
The implication for the product decision is direct. The 200-camera retail deployment saves around $800 per month moving from v8 to v11, and zero dollars per month moving from v11 to v12 — but it gains a measurable accuracy improvement, which on a loss-prevention KPI is the dimension the customer pays for. The right choice for this project is YOLOv12s; the right choice for a smaller deployment with no L4 GPUs (and therefore no FlashAttention) is YOLOv11s.
The Common Pitfalls
The five-version progression hides four pitfalls that catch product teams every quarter, and the cost of falling into them is measured in weeks of engineering rework.
The first pitfall is picking the newest version by default. YOLOv12 is the newest detector in the v8–v12 lineage and YOLO26 is the newest detector overall, but neither is automatically the right answer. YOLOv12 needs FlashAttention-capable NVIDIA GPUs to deliver its speed advantage; on a deployment target without one — Apple Silicon, AMD GPUs, Intel CPUs, older Jetson devices — YOLOv11 is faster. YOLO26's CPU advantage is real but the product is still maturing (released September 2025), so teams that need the largest model checkpoint catalogue and the most mature export pipeline still pick YOLOv11. Newest-by-default is the most common single mistake.
The second pitfall is ignoring the AGPL-3.0 licence until export day. Every Ultralytics-distributed YOLO release (v5, v8, v11, v12 via the ultralytics package, and YOLO26) is dual-licensed AGPL-3.0 / enterprise. AGPL-3.0 is a copyleft licence that, in the Ultralytics interpretation, obliges any organisation that deploys a YOLO model in a network-accessible service to release the full source of that service under AGPL-3.0. For a closed-source commercial product, that is a non-starter. The fix is the Ultralytics enterprise licence, which costs in the low five figures per year for a startup and scales up from there. Talk to your legal team in week one of the project, not week twelve.
The third pitfall is conflating "we use YOLO" with "we use COCO classes". The Ultralytics pretrained checkpoints — yolov8n.pt, yolo11s.pt, yolov12m.pt — were trained on the COCO dataset, which has 80 object classes (person, car, bottle, dog, traffic-light, and so on). For a retail or surveillance or fitness deployment, almost none of the customer's classes are in COCO. The pretrained model is a starting point, not the product. The team has to label a custom dataset (typically 500 to 5,000 images per class), fine-tune the model for 50 to 250 epochs, validate on a held-out test set, and re-export. Skipping the fine-tuning step is the second most common cause of "the model works in the demo but fails in production".
The fourth pitfall is shipping FP32 to the production runtime. The Ultralytics package exports models as full-precision FP32 by default, which is correct for evaluation but pessimistic for deployment. INT8 quantisation through TensorRT (NVIDIA), OpenVINO (Intel), or CoreML (Apple) typically delivers 2 to 4× inference speedup at less than 1% mAP loss, which is the difference between two GPUs and four GPUs in the cost example above. Quantise on the validation set, measure the mAP drop, and ship the quantised model unless the drop is unacceptable. The Ultralytics export pipeline supports INT8 in every major target runtime.
The Deployment Story — Export Targets In 2026
YOLO's reputation in production rests as much on the export pipeline as on the model itself. The Ultralytics package supports a deep list of export targets, each with its own use case, and the decision matrix is part of every project.
| Target runtime | Hardware | Typical speedup vs PyTorch | Use case |
|---|---|---|---|
| ONNX (CPU) | Any CPU | 1.5 to 3× | Cross-platform fallback, lightweight server inference |
| TensorRT | NVIDIA GPU | 3 to 5× | Cloud GPU inference, NVIDIA Jetson edge, DeepStream pipelines |
| OpenVINO | Intel CPU / iGPU | 2 to 4× on CPU | Intel-based edge devices, factory cameras, on-prem servers |
| CoreML | Apple Silicon (M-series, A-series) | 2 to 4× | iOS apps, macOS apps, Vision Pro |
| TFLite (INT8) | Android, edge SoCs | 2 to 5× | Mobile, IoT, microcontrollers |
| TensorFlow.js | Browser (WebGL / WebGPU) | 1 to 2× | Client-side inference in web apps |
Figure 3. Export target matrix for Ultralytics YOLO. The Ultralytics package wraps every target into a single model.export() call, but the per-target tuning (INT8 calibration set, batch size, precision) is the engineering work that determines whether the deployment hits its budget.
The 2026 default for a cloud GPU deployment is TensorRT with INT8 quantisation; the 2026 default for an iOS app is CoreML with FP16; the 2026 default for an Android app is TFLite with INT8; the 2026 default for an Intel-based edge device is OpenVINO with INT8; and the 2026 default for a fallback target is ONNX on the host CPU. For a cross-platform product that ships in all five runtimes — common in B2B vision products — the Ultralytics package automates most of the tuning, but the validation step (measuring per-runtime accuracy on a held-out test set) is non-negotiable. The runtime that "looks correct in CI" can still ship a model that drifts 3 to 5 mAP from the PyTorch baseline because of a quantisation calibration mistake or a pixel-format mismatch.
Where Fora Soft Fits In
We have shipped YOLO-based detection pipelines in video surveillance, OTT moderation, telemedicine, and fitness products since YOLOv4, and we've migrated client codebases through every version in this article. The pattern that holds across projects: the version decision is rarely the bottleneck. The pre-processing pipeline (lesson 2.1), the dataset labelling discipline (rare in CV courses, decisive in production), the per-runtime quantisation calibration, and the AGPL-3.0 licence conversation account for nine of the ten engineering hours that ship a YOLO model. If your team is picking between v11 and v12 in week one and has not yet labelled the first 500 images, we would politely suggest reordering the work plan.
What To Read Next
- Pre-Processing For ML On Video — The 2026 Engineering Playbook
- Open-Vocabulary Detection — Grounding DINO, Florence-2, RT-DETR, RF-DETR
- Anomaly Detection In Video — The 2026 Engineering Playbook
Talk To Us / See Our Work / Download
- Talk to a video engineer — book a 30-minute scoping call to talk through your YOLO version pick, hardware target, and licence path.
- See our case studies — Fora Soft's portfolio of video surveillance, OTT, telemedicine, and fitness products built on YOLO-based detection.
- Download — YOLO Version Selection Decision Worksheet (PDF): one-page printable worksheet covering the eight questions to answer before you commit to a YOLO version — hardware target, latency budget, accuracy floor, licence path, training data plan, deployment runtime, multi-task scope, and migration cost.
References
- Ultralytics. YOLOv8 — Official Model Documentation. Ultralytics Docs, accessed 2026-05-24. https://docs.ultralytics.com/models/yolov8
- Ultralytics. YOLO11 — Official Model Documentation. Ultralytics Docs, accessed 2026-05-24. https://docs.ultralytics.com/models/yolo11
- Ultralytics. YOLO12: Attention-Centric Object Detection — Official Model Documentation. Ultralytics Docs, accessed 2026-05-24. https://docs.ultralytics.com/models/yolo12
- Wang, A., Chen, H., Liu, L. et al. YOLOv10: Real-Time End-to-End Object Detection. arXiv:2405.14458, May 2024. Accepted to NeurIPS 2024. https://arxiv.org/abs/2405.14458
- Wang, C.-Y., Yeh, I-H., Liao, H.-Y. M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv:2402.13616, February 2024. Accepted to ECCV 2024. https://arxiv.org/abs/2402.13616
- Tian, Y., Ye, Q., Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv:2502.12524, February 2025. Accepted to NeurIPS 2025 (poster). https://arxiv.org/abs/2502.12524
- Khanam, R., Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv:2410.17725, October 2024. https://arxiv.org/abs/2410.17725
- Ultralytics. YOLO License — AGPL-3.0 and Enterprise. Ultralytics, accessed 2026-05-24. https://www.ultralytics.com/license
- Ultralytics. Model Export with Ultralytics YOLO — ONNX, TensorRT, OpenVINO, CoreML, TFLite. Ultralytics Docs, accessed 2026-05-24. https://docs.ultralytics.com/modes/export
- Roboflow. How to Train a YOLOv12 Object Detection Model on a Custom Dataset. Roboflow Blog, March 2025. https://blog.roboflow.com/train-yolov12-model/
- Ultralytics. Quick Start Guide: NVIDIA Jetson with Ultralytics YOLO — DeepStream Integration. Ultralytics Docs, accessed 2026-05-24. https://docs.ultralytics.com/guides/nvidia-jetson
- Hidalgo, R. et al. Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8, and YOLOv5 Object Detectors. arXiv:2510.09653, October 2025. https://arxiv.org/abs/2510.09653


