Open-Vocabulary Detection — Grounding DINO, Florence-2, OWLv2, RT-DETR, RF-DETR

Why This Matters

If you are about to ship a feature where the list of objects you need to find is not known at training time, open-vocabulary detection is what makes the feature possible. Common video product scenarios include retail loss prevention where every store has its own list of "things to track on shelf", surveillance systems where the security team types in a new alert ("orange traffic cone in the parking lot") and expects the system to start watching for it within a minute, telemedicine triage where the doctor wants the patient to "show me the pill bottle and the dose label" and the system has to find the dose label without ever having been trained on that brand, and OTT/UGC moderation where the policy list changes faster than any retraining cadence can keep up. This article assumes you have already read lesson 2.2 on the YOLO production lineage — because the decision in 2026 is almost always "do I extend my YOLO pipeline or replace it with an open-vocabulary model" — and lesson 1.2 on latency and deployment topology, because most of the open-vocabulary models are too heavy to run on edge and have to live in the cloud or on a beefy server. By the end you will be able to read a competitor's "AI-powered search" claim, map it to the five models below, and tell your engineering team which one fits your product and why.

What "Open-Vocabulary" Actually Means In 2026

The phrase "open-vocabulary detection" gets used loosely in vendor marketing and slides, and the confusion costs project time. Three things sit on a spectrum from least to most flexible, and the differences are sharp.

Closed-vocabulary detection is what every YOLO version does. The model has a fixed list of class names baked into the final classification layer at training time — 80 for the standard COCO checkpoints, more or fewer for custom-trained models. At inference time, you cannot ask the model for a class it was not trained on. If you need to detect a new class, you retrain. This is fast, accurate, and well-understood; it is what almost every shipped video product in 2026 still uses for its main detection workload.

Zero-shot detection is the term used in academic papers for a model that can detect classes it was never explicitly trained on, by transferring from a related vision-language pretraining. The model still expects a discrete list of class names at inference time, but the list can be supplied as text and does not need to match the training labels. OWL-ViT and OWLv2 from Google sit here. The "zero" in "zero-shot" means zero examples of the target class were seen in the labelled detection training set, not that the model had zero exposure to the visual concept during pretraining.

Open-vocabulary detection is the broadest framing. The model takes free-form natural language — a category name like "yellow vest", a phrase like "person carrying a backpack", or even a referring expression like "the leftmost car" — and produces bounding boxes for whatever the text described. The contract is the prompt, not a class list. Grounding DINO, Florence-2 (when run in detection mode), and YOLO-World sit here. In practice, most engineering teams use "open-vocabulary" as a catch-all that includes zero-shot, because the deployment story is the same: text in, boxes out, no retraining required.

The four other things you will hear and should not confuse with open-vocabulary detection: prompt-tuning (small text-conditioned adapter on a closed-vocabulary model — still needs retraining for major changes), referring expression comprehension (find the one specific instance described by the prompt, not all instances of a category), phrase grounding (associate every noun phrase in a sentence with a box, a richer task than detection), and visual question answering (the model answers questions about the image without producing boxes — useful for description, useless for detection).

For a product team, the practical implication is this: if your roadmap includes "let users type what they want the camera to look for", or "let the security team add new alert categories without retraining", or "support 200 customer-specific class lists without 200 retraining jobs", you are buying open-vocabulary detection. The five models below are the realistic choices.

Decision diagram comparing closed-vocabulary detection, zero-shot detection, and open-vocabulary detection with the prompting interface, retraining requirement, and example models for each. Figure 1. Closed-vocabulary, zero-shot, and open-vocabulary detection sit on a spectrum from fixed-class-list to free-form-text input. The right-most column is the one that supports "the user types what to find".

Grounding DINO — The Accuracy Ceiling

Grounding DINO is the model that turned open-vocabulary detection from a research demo into a production option. It was introduced in March 2023 by Shilong Liu and collaborators from IDEA Research, published as the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection" (arXiv:2303.05499), and later accepted to ECCV 2024. The open-source code lives at IDEA-Research/GroundingDINO on GitHub.

The architectural idea is simple to state and complicated to ship. Grounding DINO took DINO — a strong closed-vocabulary transformer detector from the same group — and fused it with text at three different stages of the network instead of only at the final classification head. The image is processed by a Swin Transformer backbone that extracts multi-scale visual features, the text prompt is processed by a BERT text encoder (with a 256-token maximum), and then the two streams are fused through three mechanisms: a feature enhancer that runs deformable self-attention on the image and vanilla self-attention on the text plus cross-attention in both directions, a language-guided query selection module that picks the image-region tokens most relevant to the text, and a cross-modality decoder that produces the final boxes conditioned on both streams. Tight fusion at three stages is what gave Grounding DINO its accuracy advantage; previous open-vocabulary detectors fused text only at the head, which left the early visual layers blind to the prompt.

The accuracy numbers are why the model became a reference. The original Grounding DINO scored 52.5 AP on the COCO zero-shot transfer benchmark without ever having seen a COCO training image — at the time, the best closed-vocabulary detector trained on COCO directly scored around 58 AP. Closing most of a six-point gap on COCO while remaining zero-shot was a significant result. On LVIS — a harder benchmark with 1,200 classes including many rare ones — Grounding DINO scored well above prior zero-shot baselines, with the larger Swin-L variant outperforming every prior open-set detector.

The version evolution matters for a 2026 product team. IDEA Research released Grounding DINO 1.5 in May 2024 in two flavours: 1.5 Pro (a larger ViT-L backbone with deep early-fusion, trained on more than 20 million grounding-annotated images and aimed at the accuracy ceiling) and 1.5 Edge (a slimmer model with an EfficientViT-L1 backbone, single-scale early fusion, and a feature enhancer that cross-attends only to the deepest image features — designed to run on edge accelerators). Grounding DINO 1.6 Pro followed in early 2025 and pushed the state of the art further: 55.4 AP on COCO zero-shot transfer, 57.7 AP on LVIS-minival, and 51.1 AP on LVIS-val, with significant improvements on LVIS rare classes — the categories with very few training examples, which is precisely the case industrial users care about because the objects they most need to detect are often the ones with the least training data. DINO-X, a 2024 follow-up unifying detection, segmentation, and pose under a single open-world vision model, sits at the same lineage and extends the family.

The headline weakness of Grounding DINO is latency. The original model takes roughly 100–200 ms per image on an A100 GPU at standard input resolution, depending on the variant and the prompt length, which is fine for cloud APIs and search workflows but too slow for a 30-fps real-time pipeline. The Edge variant gets that number down to around 21.7 ms per image on a Jetson Orin Nano in INT8, which is roughly real-time for a single stream but still too heavy for multi-camera deployments. The second weakness is licence: the original Grounding DINO open-source code is under the Apache 2.0 licence and the model weights are released for research use; the 1.5 / 1.6 Pro models are accessed through IDEA Research's paid API, which is the correct choice for accuracy-bound use cases but is a different commercial story from running a checkpoint locally.

In 2026, Grounding DINO is the right pick when your product can either run inference in the cloud (search, batch processing, asynchronous moderation) or accept Edge variant's latency on a single-stream device. It is also the model that sets the accuracy bar every other open-vocabulary detector is measured against — when you read a competitor's "we built our own zero-shot detector" claim, the next question is "how does it compare to Grounding DINO 1.6 Pro on LVIS-rare?".

Florence-2 — The Multi-Task Generalist

Florence-2 was introduced by Microsoft Research at CVPR 2024 as an open-source vision-language foundation model. The release made one engineering bet: instead of shipping one model for detection, one for segmentation, one for captioning, and one for OCR, Microsoft built a single sequence-to-sequence transformer that handles every task through a unified prompt-based interface. You ask it for object detection with <OD>, dense captioning with <DENSE_REGION_CAPTION>, visual grounding with <CAPTION_TO_PHRASE_GROUNDING>, OCR with <OCR>, and a dozen other tasks, all through the same model and the same API surface.

The model comes in two sizes — Florence-2-base at roughly 230 million parameters and Florence-2-large at roughly 770 million — and was trained on FLD-5B, a large dataset of approximately 126 million images with 5.4 billion annotations spanning the various supported tasks. The unified training is what gives Florence-2 its multi-task footprint at a low parameter count compared to single-task specialists.

The detection numbers are not the headline. On COCO zero-shot transfer, Florence-2 scores roughly 34.7 mAP, well below the 50+ AP that specialist detectors like UNINEXT or Grounding DINO achieve. The Microsoft team has been explicit about this trade-off: Florence-2 is a generalist, and for production systems that demand state-of-the-art detection accuracy, the right move is to fine-tune Florence-2 on the target domain or use a specialist detector. The Florence-2 paper showed that fine-tuning on a downstream task closes the gap quickly — Florence-2-large fine-tuned on COCO matches or beats DINO on detection while remaining a multi-task model after the fine-tune.

The reason a product team should still consider Florence-2 is efficiency and breadth. The model needs only about 862 MB of VRAM on an RTX A5000 to run inference, which is small enough to fit alongside other models on a shared GPU; the unified prompt interface means one model can replace four — detection, OCR, dense captioning, and visual grounding — in a single deployment. For a video product that needs all four (an OTT moderation pipeline that detects objects, reads on-screen text, generates dense captions for accessibility, and grounds specific phrases for compliance), Florence-2 collapses four models worth of integration, monitoring, and inference cost into one.

The licence is unusually friendly for a production team: Florence-2 is released under the MIT licence on Hugging Face (microsoft/Florence-2-base, microsoft/Florence-2-large, plus the -ft fine-tuned variants), which means commercial deployment without any contractual entanglement. For teams comparing licence costs, this puts Florence-2 in a different commercial bracket from any Ultralytics-distributed YOLO or any Grounding DINO 1.5 / 1.6 Pro API.

In 2026, Florence-2 is the right pick when your product needs more than just detection and your accuracy bar is "good enough" rather than "state of the art". OTT moderation pipelines, accessibility captioning systems, document-AI workflows over video frames, and any multi-task feature where the cost of operating four separate models would exceed the cost of one fine-tune are the canonical use cases.

OWLv2 — The Google Zero-Shot Baseline

OWLv2 (short for Open-World Localization, version 2) was proposed by Matthias Minderer, Alexey Gritsenko, and Neil Houlsby in the paper "Scaling Open-Vocabulary Object Detection". OWLv2 is the successor to OWL-ViT — Google's original 2022 open-vocabulary detector — and the version most teams should reach for if their pretraining philosophy aligns with Google's CLIP-and-pseudo-labels approach.

The architecture is straightforward. OWLv2 uses CLIP as its multi-modal backbone, with a ViT-style transformer encoding the image and a causal language model encoding the text. The image transformer's final token-pooling layer is removed, and instead a lightweight classification head and box head are attached to each output token. Open-vocabulary classification works by replacing the fixed final classifier weights with text-embedding vectors derived from the query strings; the model "knows" a class when the text embedding of its name matches the visual features of a region. OWLv2 added an objectness classifier on top of OWL-ViT — a head that predicts the likelihood that a box contains any object — which improved precision by suppressing the high-recall noise the earlier model produced.

The training strategy is the second thing that makes OWLv2 distinct. The model was trained on a web-scale dataset of over one billion image-text examples, with OWL-ViT v1 used to generate pseudo-labels on the larger corpus — the "scaling open-vocabulary object detection" of the paper title is literal. The result is a model that has seen a much larger range of visual concepts than any detection-only training set could contain.

The Hugging Face hub hosts five distributable variants: google/owlv2-base-patch16 and google/owlv2-large-patch14 (the standard zero-shot models), google/owlv2-base-patch16-ensemble and google/owlv2-large-patch14-ensemble (ensemble training for higher accuracy), and google/owlv2-base-patch16-finetuned (fine-tuned for specific downstream improvements). The Apache 2.0 licence applies to all of them.

The weakness of OWLv2 in 2026 is that the accuracy ceiling is lower than Grounding DINO's. Where Grounding DINO 1.6 Pro reaches 55.4 AP on COCO zero-shot, OWLv2 sits at roughly 30–40 AP depending on the variant. For projects where the absolute accuracy is the constraint, the choice falls to Grounding DINO. The second weakness is the user experience: OWLv2's prompting is a list of strings (one per class), not a free-form phrase, so the "describe what you want" UX that Grounding DINO supports does not work the same way out of the box.

In 2026, OWLv2 is the right pick when you need a permissively-licensed zero-shot detector that runs faster than Grounding DINO at the cost of some accuracy, when you prefer Google's pretraining philosophy, or when your prompts are short noun-phrase class names rather than referring expressions. It is also the cleanest model to fine-tune on a domain — the architecture is simpler than Grounding DINO's three-stage fusion, and the training pipeline is more accessible.

RT-DETR — The Real-Time End-to-End Transformer (Closed-Vocabulary But Adjacent)

RT-DETR sits next to the open-vocabulary models in this article rather than inside them, but it is impossible to evaluate the open-vocabulary stack in 2026 without naming RT-DETR, because it is the model that proved transformer-based detection could run in real time at all. Before RT-DETR, every team comparing detection options had a hard split: convolutional detectors (YOLO family) were fast, transformer detectors (DETR family) were accurate-or-flexible but slow. RT-DETR collapsed the split.

The original RT-DETR was introduced in 2023 as the first real-time end-to-end transformer-based object detector. The architecture used an efficient hybrid encoder, IoU-aware query selection, and a multi-scale feature fusion module to deliver DETR-level accuracy at YOLO-level speed. RT-DETRv2 (mid-2024) added a "bag of freebies" — incremental training-time and inference-time improvements — to push accuracy further without changing the deployment story. RT-DETRv3 (WACV 2025 Oral, paper arXiv:2409.08475) tackled the supervision-density mismatch in transformer-based detection: DETR-family models use one-to-one bipartite matching during training, which gives each ground-truth object exactly one positive sample, while convolutional detectors use one-to-many anchor-based label assignment, which gives the model more supervision signal per training step. RT-DETRv3 introduced hierarchical dense positive supervision to recover that signal without breaking the end-to-end property.

The benchmark numbers as of 2025 are striking. RT-DETR-R50 scored 53.1% AP at 108 FPS on a T4 GPU, exceeding YOLOv8-L on both accuracy and speed. RT-DETRv3-R18 (the smallest variant of v3) hit 48.1% AP — 1.6 points above the equivalent RT-DETR-R18 and 1.4 points above RT-DETRv2-R18 — at the same latency. The latest follow-up, RT-DETRv4, introduces a Deep Semantic Injector that pulls semantics from a vision-language foundation model into the deep CNN backbone at training time, raising AP with zero deployment overhead — that is, the v4 model exports to the same ONNX/TensorRT runtime as v3 but with higher accuracy.

The Ultralytics package supports RT-DETR through the same training and export pipeline as YOLO, which means a team that has already invested in the Ultralytics tooling does not have to learn a second toolchain to add transformer detection. The licence inherits the same dual track — AGPL-3.0 open-source plus enterprise — that the YOLO family uses through Ultralytics.

The relevance of RT-DETR to an open-vocabulary article is twofold. First, RT-DETR is the closed-vocabulary baseline against which the open-vocabulary models are measured in production — if you already have RT-DETR running at your latency budget and your class list is small enough to retrain when it changes, switching to an open-vocabulary model is a regression on speed and probably on accuracy too. Second, the open-vocabulary lineage and the RT-DETR lineage now intersect: the latest hybrid models extend RT-DETR's real-time transformer architecture with text-conditioned heads, which is how RF-DETR (next section) reaches its 2026 numbers.

RF-DETR — The 2026 Real-Time Detection Transformer

RF-DETR is the 2026 model that finally made the accuracy claim "real-time transformer detection at YOLO speeds" stop being a research aspiration and start being a deployment commitment. It was released by Roboflow in early 2025, accepted to ICLR 2026, and the paper "RF-DETR: Neural Architecture Search for Real-Time Detection Transformers" (arXiv:2511.09554) documents the architecture and the search procedure that produced the model lineup.

The architectural foundation is the use of a DINOv2 vision transformer backbone — Meta's self-supervised ViT, which is one of the strongest open vision foundation models in 2026 — combined with a detection decoder optimised by neural architecture search (NAS) for the real-time inference regime. The NAS run searched over backbone width, depth, decoder depth, and feature aggregation structure to find the combination that maximises accuracy at a fixed latency budget. The result is a family of models — RF-DETR-Nano, Small, Medium, Base (29M params), Large (128M params), XLarge, and 2XLarge — that span the same range of edge-to-cloud deployments as the YOLO lineup but at higher accuracy per millisecond.

The benchmark numbers are why the model matters in 2026. RF-DETR-Nano scored 48.0 AP on COCO, beating D-FINE-Nano by 5.3 AP at similar latency — a margin that does not appear in benchmarks for incremental releases and signals an architectural change. RF-DETR-Large reached 56.5 AP at 6.8 ms on an NVIDIA T4 GPU with TensorRT FP16. RF-DETR-2XLarge hit 60.1 AP on COCO — the first real-time model to clear 60 AP at all, a number that had been a closed-vocabulary aspirational ceiling for two years.

The domain-adaptation story is where RF-DETR fits the open-vocabulary discussion. On the RF100-VL benchmark — Roboflow's own diverse-domain benchmark that tests how well a detector transfers from COCO pretraining to 100 different domains including aerial, microscopy, document, and medical imagery — RF-DETR-2XLarge outperformed Grounding DINO-Tiny by 1.2 AP while running 20× as fast. RF-DETR is not itself an open-vocabulary detector (it does not take text prompts as input), but the model is positioned by Roboflow as the closed-vocabulary backbone you fine-tune on a domain after using an open-vocabulary detector to auto-label the training data — the workflow that has become the canonical 2026 pattern for fast custom detector development.

The licensing structure deserves attention. Core models (Nano through Large) and all code are released under the Apache 2.0 licence — among the most permissive of all real-time detector licences, and a material commercial advantage over Ultralytics YOLO's AGPL-3.0. The XLarge and 2XLarge detection models require the rfdetr[plus] package and are provided under PML 1.0, a more restrictive licence for the highest-accuracy variants. The licence split lets a team start with a permissively-licensed Large model and decide whether the additional 3–4 AP of the 2XLarge model is worth the licence upgrade.

In 2026, RF-DETR is the right closed-vocabulary pick for a new project that needs higher accuracy than YOLO can provide at the same latency, that wants Apache 2.0 licensing for the production model, and that has standardised on the Roboflow toolchain (Roboflow Train, Roboflow Workspace, Supervision for annotation). It is also the model that completes the canonical 2026 pipeline: Grounding DINO at the front (auto-label new data with text prompts), RF-DETR at the back (fast closed-vocabulary detector trained on the labelled data).

Pipeline diagram showing the 2026 canonical workflow: Grounding DINO auto-labels new training data from text prompts, the labelled data trains a fast RF-DETR closed-vocabulary detector, and the resulting model deploys to production at real-time latency. Figure 2. The 2026 canonical workflow for shipping a custom detector. Open-vocabulary detection lives at the front of the pipeline (auto-label) and at the back as a fallback for unknown classes; the production hot path uses a fast closed-vocabulary detector.

A Five-Model Production Comparison

The table below condenses the four open-vocabulary detectors plus RT-DETR / RF-DETR into a single decision view. Read it row by row, picking the row that matches your project's hardest constraint.

Decision axis	Grounding DINO 1.6 Pro	Florence-2	OWLv2	RT-DETRv3	RF-DETR
Type	Open-vocabulary	Multi-task VLM (incl. OVD)	Zero-shot OVD	Closed-vocabulary	Closed-vocabulary
Author	IDEA Research	Microsoft Research	Google Research	Baidu (PaddlePaddle)	Roboflow
First release	Mar 2023 (1.0); Jan 2025 (1.6 Pro)	Jun 2024	Jun 2023 (v1: Jan 2022)	Jul 2023; v3 in 2024	Mar 2025
Backbone	Swin-T / Swin-L / ViT-L	DaViT (B: 230M, L: 770M)	CLIP ViT-B/16, ViT-L/14	ResNet-18/50/101	DINOv2 ViT
Prompt interface	Free-form text + referring expressions	Task tokens + free text	List of noun phrases	None (class list at training)	None (class list at training)
COCO AP (zero-shot or trained)	55.4 (zero-shot, 1.6 Pro)	~34.7 (zero-shot)	~30–40	53.1 (trained, R50)	60.1 (trained, 2XL)
LVIS performance	57.7 minival / 51.1 val (1.6 Pro)	Below specialist	Strong zero-shot	N/A	N/A (closed-vocab)
Typical inference	100–200 ms (A100)	~30–60 ms (RTX A5000)	~50–80 ms (T4)	~9 ms (T4, R50)	6.8 ms (T4, FP16, Large)
Multi-task support	Detection + grounding	10+ tasks	Detection only	Detection only	Detection + segmentation
Open-source licence	Apache 2.0 (1.0 weights research-use)	MIT	Apache 2.0	Apache 2.0	Apache 2.0 (core); PML 1.0 (XL/2XL)
API / hosted option	IDEA Research API	Hugging Face + Azure	Hugging Face	Ultralytics / PaddlePaddle	Roboflow Inference
2026 default pick for	Auto-labelling + cloud search	Multi-task generalist	Permissive zero-shot baseline	Real-time, Apache 2.0, mature toolchain	Real-time, highest accuracy, Apache 2.0

Figure 3. Five-model decision matrix. The latency numbers are quoted from each model's official documentation or peer-reviewed paper at the model's default input resolution and a single-batch inference; absolute values will vary with TensorRT version, image size, and quantisation level.

A Worked Example — Retail Store Loss Prevention With An Open Class List

To make the model choice concrete, picture a retail loss-prevention project where the security team needs to define new alert categories every week without retraining. The workload is fifty stores, four cameras per store (entrance, two aisle cameras, one exit), 1080p H.264 at 30 fps, with inference triggered at 4 fps per camera when motion is detected. The peak inference workload is 50 × 4 × 4 = 800 inferences per second across the fleet, dropping to roughly 100 per second during off-hours.

The category list grows monthly: "person carrying open suitcase", "person near refrigerated section without basket", "spilled liquid on aisle floor", "broken glass". The closed-vocabulary baseline — a YOLOv11 model fine-tuned on the initial 12 categories — works fine for the first six weeks, then breaks when the security team adds three new categories and there is no labelled training data for any of them. Retraining the YOLOv11 model would take four to six weeks of data collection, labelling, and model retraining; the security team needs the new categories live the same week.

The 2026 production pipeline is a two-tier deployment. Tier 1 (real-time, on every frame) runs RF-DETR-Large on the existing 12 categories, fine-tuned on the customer's data. On four NVIDIA L4 GPUs ($1.10/hour on AWS in 2026), RF-DETR-Large at FP16 processes roughly 250 inferences per second per GPU at 1080p, comfortably handling the 800/second peak. The monthly cost is 4 × $1.10 × 24 × 30 = $3,168.

Tier 2 (deferred, on every motion event) runs Grounding DINO 1.6 Edge on the new categories the security team typed in this week. The Edge variant is heavy — roughly 80 ms per image on the same L4 GPU at INT8 — but it runs only on motion-triggered frames, which is about 5–10% of the total frame count. One additional L4 covers the Tier 2 workload, adding $792/month, for a fleet total of $3,960/month.

The non-obvious win in this design is that Tier 2 also auto-labels Tier 1's training data. Every Grounding DINO detection on a new category is stored with the frame; once 500–1,000 labelled examples accumulate for a category, an automated retraining job adds the category to the next RF-DETR fine-tune. The fleet learns continuously, the security team gets new categories live within minutes of typing them in, and the per-frame cost of inference stays at the closed-vocabulary level because Grounding DINO does not run on the hot path.

A team that tried to solve the same problem with Grounding DINO 1.6 Pro on every frame would need roughly 8–10 L4 GPUs to handle the inference workload at the model's 100+ ms latency per frame, costing $7,920–$9,900/month — two to three times the cost of the two-tier design with materially worse latency.

Common Mistakes — The Three Pitfalls We See Most Often

Pitfall 1: Treating "open-vocabulary" as a universal replacement for closed-vocabulary. Teams that have just discovered Grounding DINO sometimes propose replacing their YOLOv11 pipeline with Grounding DINO on every frame to "future-proof the system". This is almost always wrong. Open-vocabulary models are 10–20× slower per frame and 5–15 AP less accurate on classes they have effectively seen during pretraining. The correct architecture is the two-tier design in the worked example above: closed-vocabulary on the hot path, open-vocabulary as the fallback and the auto-labeller.

Pitfall 2: Underestimating prompt engineering. Grounding DINO is sensitive to prompt phrasing. "Person carrying a suitcase" and "a person who carries a suitcase" and "someone with a suitcase" produce different detection sets on the same frame. Teams that ship a product where end-users type the prompts directly need a prompt-template layer that rewrites user input into the phrasing the model handles best, plus a confidence-threshold tuning step per template. Without this layer, the product's behaviour will appear stochastic to the user — a category they typed two months ago "works" and a similar category they typed today does not, because the phrasing crossed a threshold the user never sees.

Pitfall 3: Ignoring the licence column. The five models in this article ship under four different licence regimes — Apache 2.0, MIT, AGPL-3.0 / Enterprise, and the IDEA Research commercial API — and the difference becomes material the moment the product crosses from prototype to revenue-generating deployment. Many teams ship a Grounding DINO 1.0 demo built from the open-source weights and discover six months later that the 1.0 weights are research-use-only, that 1.5 / 1.6 Pro requires an IDEA Research API contract, and that the team's procurement process for closing that contract is six weeks. Read the licence column on day one, not on launch day.

Where Fora Soft Fits In

Fora Soft has shipped video products since 2005 across video surveillance, retail analytics, telemedicine, e-learning, and OTT moderation — the exact verticals where open-vocabulary detection is reshaping what is possible. In several of our recent projects, we have moved a closed-vocabulary YOLO pipeline to the two-tier architecture described in this article, using Grounding DINO or Florence-2 at Tier 2 to handle category-list change requests in minutes rather than weeks. The integration work — prompt-template layer, confidence-threshold tuning, auto-labelling pipeline, retraining cadence — is where most of the engineering risk lives, and it is the work we do when we ship these architectures for customers. If your product roadmap includes "let the operator add new categories without a retrain", an open-vocabulary detector is what makes that line on the roadmap real.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your grounding dino plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Open-Vocabulary Detection Decision Worksheet — One-page printable worksheet covering the eight questions to answer before picking an open-vocabulary detection model — latency budget, accuracy floor, prompt interface, licence path, deployment hardware, multi-task scope,….

References

Liu, S., Zeng, Z., Ren, T. et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv:2303.05499, March 2023. Accepted to ECCV 2024. https://arxiv.org/abs/2303.05499
IDEA Research. Grounding DINO 1.5: Pushing the Boundaries of Open-Set Object Detection. IDEA Research, May 2024. GitHub repository: https://github.com/IDEA-Research/Grounding-DINO-1.5-API
IDEA Research. Grounding DINO 1.6 Pro — Benchmark Results On COCO And LVIS. Visincept overview, January 2025. https://visincept.com/en/blog/6
Ren, T., Jiang, Q., Liu, S. et al. DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding. arXiv:2411.14347, November 2024. https://arxiv.org/abs/2411.14347
Xiao, B., Wu, H., Xu, W. et al. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. arXiv:2311.06242. Accepted to CVPR 2024. https://arxiv.org/abs/2311.06242
Microsoft. Florence-2-large Model Card. Hugging Face, accessed 2026-05-24. MIT licence. https://huggingface.co/microsoft/Florence-2-large
Minderer, M., Gritsenko, A., Houlsby, N. Scaling Open-Vocabulary Object Detection (OWLv2). NeurIPS 2023. Hugging Face documentation accessed 2026-05-24. https://huggingface.co/google/owlv2-large-patch14
Zhao, Y., Lv, W., Xu, S. et al. DETRs Beat YOLOs on Real-time Object Detection (RT-DETR). arXiv:2304.08069, April 2023. CVPR 2024. https://arxiv.org/abs/2304.08069
Wang, S., Xia, C., Lv, F., Shi, Y. RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision. arXiv:2409.08475, September 2024. WACV 2025 Oral. https://arxiv.org/abs/2409.08475
Robinson, I., Robicheaux, P., Popov, M. et al. RF-DETR: Neural Architecture Search for Real-Time Detection Transformers. arXiv:2511.09554, November 2025. Accepted to ICLR 2026. https://arxiv.org/abs/2511.09554
Roboflow. RF-DETR — Official Documentation And Benchmarks. Roboflow, accessed 2026-05-24. https://rfdetr.roboflow.com/
Roboflow. Best Object Detection Models 2026: RF-DETR, YOLOv12 & Beyond. Roboflow Blog, accessed 2026-05-24. https://blog.roboflow.com/best-object-detection-models/
Cheng, T., Song, L., Ge, Y. et al. YOLO-World: Real-Time Open-Vocabulary Object Detection. arXiv:2401.17270, January 2024. CVPR 2024. https://arxiv.org/abs/2401.17270