Published 2026-05-31 · 19 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

Two years ago, building a feature that understood what was inside a video frame meant collecting thousands of labeled images, training a custom model, and maintaining it forever. Today you can paste a frame into a frontier multimodal model and ask "is there a forklift in a pedestrian zone?" in plain English, and get a usable answer in one API call with no training at all. That shortcut is real and it is changing how video products get built — but used in the wrong place it quietly burns money, blows latency budgets, and produces a great demo that never survives contact with production scale. This article is for the product manager, founder, or video engineer who has to decide, for one specific feature, whether to reach for the VLM shortcut or build the custom pipeline — and wants the decision grounded in cost and latency numbers rather than in hype.

The Two Tools On The Table

Before the decision makes sense, you need a clean picture of what each tool actually is. They look interchangeable in a demo and behave very differently in production.

The first tool is a custom computer-vision model, often called an object detector. The most common family is YOLO — short for "You Only Look Once" — along with newer designs like RF-DETR. You decide in advance exactly what it should find ("person", "vehicle", "hard hat"), you collect and label example images, and you train the model on them. Once trained, it takes a frame and returns a tidy list of boxes: each box has coordinates, a category label, and a confidence score. Think of it as a very fast, very literal worker who has memorized one specific checklist and does only that, the same way, every single time.

The second tool is a vision-language model, or VLM — the multimodal cousin of a chatbot. Examples are Google's Gemini, OpenAI's GPT, and Anthropic's Claude, plus open-weight models like LLaVA and Qwen-VL. You hand it a frame and a question written in ordinary language, and it answers in ordinary language. It was never trained on your specific objects; it learned about the visual world in general. Think of it as a knowledgeable generalist you can ask anything — but who occasionally misremembers, phrases the same answer two different ways, and charges you by the question.

The single most important difference: the detector is deterministic and fixed, the VLM is flexible and probabilistic. Run the same frame through a trained detector a hundred times and you get a hundred identical answers. Run it through a VLM a hundred times and you may get ninety-six identical answers and four slightly different ones. Hold onto that contrast — almost every part of the decision flows from it.

Two side-by-side cards comparing a custom object detector and a vision-language model. The detector card shows an image going in and a fixed list of bounding boxes coming out, labeled deterministic, fast, fixed categories. The VLM card shows an image plus a text question going in and a natural-language sentence coming out, labeled flexible, slower, any question. A divider in the middle reads same input, different shape of answer. Figure 1. A custom detector returns a fixed, repeatable list of boxes; a VLM returns flexible language about whatever you ask. The shape of the output, not the intelligence of the model, drives the decision.

The Decision Is Really About Three Numbers

People frame this choice as "is the VLM smart enough?" The honest framing is "does the VLM fit my cost, latency, and consistency budget?" In nearly every production video feature, capability is no longer the bottleneck — a 2026 frontier VLM can describe almost any scene you put in front of it. What it cannot always do is describe it cheaply enough, fast enough, and the same way twice. Those three numbers — cost per frame, latency per frame, and consistency across runs — decide the case far more often than raw accuracy.

We will walk each number in turn, with real arithmetic. The numbers below use Google Gemini 2.5 Flash as the reference VLM because it is one of the cheaper frontier options and its pricing is published; the conclusions transfer to GPT and Claude, which cost more per image, not less.

Number one: cost per frame, and the break-even point

A custom detector costs money in a big lump up front and almost nothing afterward. You pay to collect data, label it, train the model, and stand up a server to run it. After that, each frame the model processes is nearly free — you are just paying for electricity and a slice of a GPU you already rent. A VLM API is the mirror image: nothing up front, but you pay a small toll for every single frame, forever.

To compare them you need the toll per frame. A VLM does not see an image as one thing; it chops it into square tiles and charges per tile. Gemini counts a tile as 258 tokens, where a token is the small unit of text or image data these models are billed on. A standard-definition video frame of 960×540 pixels works out to six tiles.

Let us show that arithmetic out loud, the way we will show every number in this article:

Frame size:        960 × 540 pixels
Tile crop unit:    floor(min(960, 540) / 1.5) = floor(360) = 360 px
Tiles across:      960 / 360 = 2.67  ->  3 tiles
Tiles down:        540 / 360 = 1.5   ->  2 tiles
Total tiles:       3 × 2 = 6 tiles
Image tokens:      6 × 258 = 1,548 tokens per frame

Gemini 2.5 Flash charges about $0.30 per one million input tokens (mid-2026 published rate). Add a short text prompt and a short answer, and one frame analyzed is roughly:

Input image:   1,548 tokens
Input prompt:    ~60 tokens
Output answer:   ~40 tokens  (billed at $2.50 / 1M output)
Cost ≈ (1,608 × $0.30 + 40 × $2.50) / 1,000,000
     ≈ ($0.000482 + $0.000100)
     ≈ $0.00058 per frame

So about six hundredths of a cent per frame. That sounds trivial — and at low volume it is. The trouble is that video is made of an enormous number of frames. A single camera at a modest 5 frames per second produces 432,000 frames a day, or about 13 million a month. Pushing every frame through the VLM would cost:

13,000,000 frames × $0.00058 ≈ $7,540 per month — for one camera.

Now put the custom detector beside it. A realistic one-time build — data labeling, training, and a quarter of an entry GPU server — lands somewhere around $8,000 to $20,000, after which the per-frame running cost is a fraction of a cent because you are not paying a per-call toll. The crossover is what matters. Industry guidance in 2026, including Roboflow's published decision framework, puts the rule of thumb at roughly 100,000 images per month: below it, the VLM's zero-setup convenience usually wins; above it, the custom detector's near-zero per-frame cost wins, and the gap widens fast with volume. One busy camera blows past that line in a day.

The lesson is not "VLMs are expensive." It is that VLM cost scales linearly with frames while detector cost is mostly flat, so the right model depends entirely on how many frames you actually need to look at — which is the whole reason the hybrid pattern later in this article exists.

Number two: latency per frame, and the real-time wall

Latency is the delay between a frame being captured and your system having an answer about it. For some features a few seconds is fine; for others it is fatal. This single number disqualifies VLMs from a large class of video work no matter how the cost math comes out.

A small custom detector is built for speed. The smallest YOLO11 variant runs a frame in about 1.5 milliseconds on a server GPU, and on an edge device like an NVIDIA Jetson Orin it clears roughly 28 to 41 frames per second depending on precision settings (2026 benchmarks). That is fast enough to keep up with live video and react inside the same frame — the welding-inspection line that must stop the conveyor before a defect moves on, the safety system that must flag a person in a danger zone immediately.

A cloud VLM lives in a different time zone. The frame has to travel to a data center, wait in a queue, get processed by a model with billions of parameters, and travel back. That round trip is typically measured in seconds, not milliseconds. Roboflow's 2026 production guidance is blunt about it: cloud-based VLMs are "rarely suitable for strict latency requirements" and belong in batch processing or human-in-the-loop workflows where waiting a few seconds per frame is acceptable.

Here is the wall, stated plainly. If your feature has to keep up with a live video stream — real-time analytics, autonomous control, anything where a late answer is a wrong answer — a cloud VLM cannot do the per-frame job, full stop. It does not matter that it is smart. A correct answer that arrives three seconds after the moment has passed is a failure. This is why the latency question, not the capability question, ends most "just use a VLM for live video" conversations.

Number three: consistency, and what determinism buys you

The third number is the one teams discover last and regret most. A custom detector is deterministic: identical input, identical output, every time. A VLM is probabilistic: the same frame and prompt can yield slightly different wording, a different bounding box, or — occasionally — a confident description of something that is not in the frame at all. That last failure has a name in the research literature: hallucination.

For many features the variation is harmless. If a VLM describes a garment defect as "loose threading near the left cuff" one run and "frayed stitching on the left sleeve" the next, a human reading the report understands both. But the moment your system makes an automated decision on the answer — pass or fail, alert or ignore, log this for the auditor — non-determinism turns into risk. Compliance and quality-control systems often need to prove that the same input always produced the same decision; a model that "usually" agrees with itself cannot make that promise.

The failure is sharpest in tasks that require counting and spatial reasoning, which video features lean on constantly. A 2026 study, GroundCount, documented that vision-language models hallucinate persistently when asked to count objects, with accuracy "substantially lower than other visual reasoning tasks" — and that this weakness survives even in the newest reasoning-capable VLMs because it comes from how the models integrate space and meaning, not from any one model's flaws. A separate 2026 benchmark, OmniSpatial, found that both open and closed VLMs show "significant limitations" in spatial reasoning — relative position, perspective, dynamic relationships. The practical upshot is concrete and worth memorizing: do not ask a VLM for exact counts or precise positions. Ask it for those and it will frequently be confidently wrong. A trained detector or a dedicated depth model gives you those numbers reliably; the VLM is for meaning, not measurement.

Common mistake: asking the VLM the question only a detector can answer reliably. "How many people are in this frame?" and "Is the box to the left of the pallet?" feel like simple questions, so teams hand them to the VLM. Counting and precise spatial relations are exactly the queries where VLMs hallucinate most (GroundCount, OmniSpatial, 2026). Use the detector to count and locate; reserve the VLM for "what is happening and does it matter?" The 2026 fix, when you must combine them, is to feed the detector's boxes into the VLM's prompt so the model reasons over real coordinates instead of guessing — the approach that lifted counting accuracy by several points in the GroundCount work.

A Side-By-Side Scorecard

The three numbers, plus the practical concerns around them, fit into one table. Use it as a first-pass filter for any feature you are scoping.

Criterion Custom detector (YOLO, RF-DETR) Vision-language model (Gemini, GPT, Claude)
Categories Fixed; new category needs retraining Open; ask anything in plain language
Latency per frame ~1.5 ms (server GPU) to ~25 ms (edge) Seconds (cloud round trip)
Cost shape High one-time build, near-zero per frame Zero setup, per-frame toll forever
Economics crossover Wins above ~100K frames/month Wins below ~100K frames/month
Consistency Deterministic — identical output every run Probabilistic — wording and boxes can vary
Counting / exact position Reliable Hallucination-prone (avoid)
Time to first version Weeks (data, labels, training) Hours (API + prompt)
Ongoing dependency You own the whole system Tied to a vendor's API, pricing, and behavior
Handles rare / unseen objects No — misclassifies as nearest known Yes — can describe things never trained on

Read the table as a balance, not a verdict. The detector wins the rows that matter for high-volume, real-time, audited production work. The VLM wins the rows that matter for fast iteration, changing requirements, and open-ended understanding. Most real features touch both columns — which is the cue to stop choosing and start combining.

A Decision Tree You Can Run In Your Head

When you are scoping one specific feature, six questions settle it. Walk them in order and stop at the first clear answer.

First, can you write down, in advance, the complete list of things the feature must recognize? If you cannot — because the categories are open-ended, change seasonally, or include rare long-tail cases you cannot enumerate — a VLM is the natural fit, because it needs no predefined list. If you can enumerate them, keep going.

Second, does the feature have to keep up with live video? If a late answer is a wrong answer, you need the detector's millisecond latency; a cloud VLM cannot meet the budget. If a few seconds per frame is acceptable, keep going.

Third, will you process more than about 100,000 frames a month? Above that line the detector's flat per-frame cost wins on economics, and the margin grows with every additional camera. Below it, the VLM's zero setup is usually the cheaper path. Keep going.

Fourth, does an automated decision ride directly on the output, with an audit or compliance requirement behind it? If yes, the detector's determinism is worth a great deal — you can prove the same input always yields the same decision. If a human reviews the output, the VLM's occasional variation is tolerable. Keep going.

Fifth, do you have the machine-learning people and the weeks needed to collect data, label it, and train and maintain a model? If not, the VLM gets you to a shipped feature in days instead of weeks. If you do, the detector is the better long-term foundation. One last question.

Sixth, does the feature need understanding beyond where objects are — context, relationships, reading text in the scene, judging whether something looks wrong? If yes, that is VLM territory; a detector only outputs boxes. If you only need location and identity of known objects, the detector is the cleaner, cheaper tool.

A top-down decision tree with six diamond decision nodes. Node 1 asks whether the categories are known and fixed; a no branch points to a VLM outcome box. Node 2 asks whether real-time video is required; a yes branch points to a detector outcome box. Node 3 asks whether monthly frame volume exceeds one hundred thousand; a yes points to detector, a no continues. Node 4 asks whether automated audited decisions ride on the output; yes points to detector. Node 5 asks whether ML resources exist; no points to VLM, yes continues. Node 6 asks whether contextual understanding beyond location is needed; yes points to VLM, no points to detector. Outcome boxes are color coded, detector in blue, VLM in purple. Figure 2. Six questions, run top to bottom, resolve almost every "detector or VLM?" scoping decision. Stop at the first branch that gives a clear answer.

The Pattern That Actually Ships: Hybrid Gating

In practice the best teams rarely pick one tool. They build a funnel where a cheap, fast model handles the firehose of frames and an expensive, smart model sees only the few frames that earn the attention. This is the single most useful idea in the article, so it is worth making concrete.

Picture a surveillance feature that should alert when "an unauthorized person is in a restricted area after hours." Pushing every frame to a VLM is, as we computed, thousands of dollars a month per camera and too slow for live response. Instead you stage it:

The first stage is a cheap gate. A tiny motion detector or a small custom YOLO model runs on the camera's own chip and drops the roughly 99% of frames where nothing relevant is happening. It is fast and nearly free, and it answers only one question: is there a person here at all? Nothing expensive runs until this gate opens.

The second stage is measurement. On the rare frames that pass the gate, a custom detector locates the person precisely and a depth or geometry step decides whether they are actually inside the restricted zone — the kind of exact spatial judgment a detector makes reliably and a VLM does not.

The third stage is understanding, and only here does the VLM appear. For the handful of frames that show a person genuinely in the zone, you send one clear frame to the VLM with a structured question: "Is this person a maintenance worker with a visible ID badge, or an unidentified individual?" That is a context-and-meaning question the detector cannot answer and the VLM answers well — and because you are sending a few frames a day instead of thirteen million a month, the cost is negligible and the few-second latency is fine, because the moment of decision is "should I raise an alert," not "should I stop a conveyor."

Stage 1 — Motion / tiny detector gate   →  drops ~99% of frames, runs on-device, ~free
Stage 2 — Custom detector + geometry    →  precise location and zone check on survivors
Stage 3 — VLM, structured prompt        →  context judgment on the rare frame that earns it

Two production safeguards make the VLM stage trustworthy. First, constrain the output: ask for a yes/no or a fixed set of choices, not free prose, so the answer is easy to act on and hard to drift. Second, for high-stakes calls, require agreement across two sampled frames before acting — if the VLM says "intruder" on one frame and "worker" on the next, treat the disagreement as a signal to escalate to a human rather than to fire the alert. These two habits turn the VLM's probabilistic nature from a liability into a manageable input.

A left-to-right three-stage funnel. Stage one is a wide motion gate box labeled tiny detector on-device, drops ninety-nine percent of frames, with a thick arrow narrowing into stage two. Stage two is a custom detector plus geometry box labeled precise location and zone check, narrowing further into stage three. Stage three is a small VLM box labeled structured context question on rare frames, leading to two outcome chips, alert and drop. A caption strip along the bottom notes cheap and fast on the left, expensive and smart on the right. Figure 3. The hybrid funnel: a cheap, fast model on the left gates the stream so the expensive, smart VLM on the right only sees the rare frames worth its cost and latency.

Two More Hybrid Shapes Worth Knowing

The gating funnel is the most common pattern, but two others recur often enough to name.

The first is VLM-for-labeling, detector-for-production. Early in a project you do not yet have labeled training data, and labeling by hand is the slowest, most expensive part of building a detector. You can use a VLM to label thousands of frames automatically — "draw a box around every forklift" — then train a fast custom detector on those VLM-generated labels and deploy the detector for the high-volume production work. The VLM does the one-time heavy lifting; the detector does the repeated cheap work. This collapses the weeks-of-labeling bottleneck without paying the VLM's per-frame toll at runtime.

The second is detector-for-the-common-case, VLM-for-the-edge-case. A document-sorting feature classifies the routine forms with a fast detector and escalates only the documents it is unsure about to a VLM, which can read unfamiliar layouts and explain what they are. The uncertain cases then become new training data, so the detector slowly absorbs the long tail and the VLM is called less over time. The system gets cheaper and more self-sufficient as it runs.

All three shapes share one principle: let the cheap model decide how often the expensive model runs. Whenever you feel the pull to "just use a VLM" across a whole stream, the fix is almost never to abandon the VLM — it is to put a gate in front of it.

When "Just Use A VLM" Is The Right Call

It would be a mistake to read this as "VLMs are for demos, detectors for production." There are real features where reaching straight for the VLM, with no custom model at all, is the correct engineering decision in 2026.

Reach for the VLM alone when the volume is low and the value per frame is high — a telemedicine tool that summarizes one short clip per patient visit, an e-learning system that generates a description of one uploaded lecture slide, an archive search that answers a librarian's occasional question about old footage. Reach for it when the categories are genuinely open or constantly changing, so there is nothing stable to train a detector on. Reach for it when you need to ship this week and prove the feature has value before investing in a pipeline — a VLM prototype is the fastest way to learn whether anyone wants the feature at all. And reach for it for the messy, contextual judgments detectors simply cannot make: "does this scene look unsafe?", "what is unusual here?", "read the handwritten note in the corner and tell me what it says."

The honest position is that the line moves every quarter. VLMs are getting faster and cheaper, and the 2026 trend is smaller models that run in under a second and even on-device, which will pull more near-real-time work into VLM territory. But the underlying physics does not change: a model trained to do one narrow thing will always be faster and cheaper per frame than a generalist asked to do everything, and a generalist will always be more flexible than a specialist. The decision is not which is better — it is which fits this feature's three numbers, today.

Where Fora Soft Fits In

We build video products across video conferencing, video streaming, OTT, video surveillance, e-learning, and telemedicine, and the "detector or VLM?" question lands on our desk for nearly every AI feature in them. The pattern we keep returning to is the hybrid funnel — a small on-device model gating a frontier VLM — because it is the only shape that respects both a real-time budget and a real cost ceiling at the same time. In surveillance work the cheap gate keeps per-camera cost sane while the VLM handles the rare contextual call; in e-learning and telemedicine, where volume is lower and each clip is worth more, a VLM often does the whole job alone. The engineering value we add is mostly in drawing that line correctly for a specific feature, then building the gating so the expensive model only ever sees the frames that justify it.

What To Read Next

Talk To Us / See Our Work / Download

  • Talk to a video engineer — scope your feature against the three numbers before you build. Book a 30-minute call.
  • See our case studies — how we wire AI into conferencing, OTT, surveillance, e-learning, and telemedicine products. View our work.
  • Download the decision checklist — a one-page printable that runs the six-question decision tree and the cost-and-latency math for your own feature. Download the VLM-vs-custom-CV decision checklist.

References

  1. Roboflow Blog. "Object Detection vs Vision-Language Models: When Should You Use Each?" (Jan 13, 2026). Decision framework, latency/cost/determinism factors, hybrid patterns, ~100K images/month crossover. https://blog.roboflow.com/object-detection-vs-vision-language-models/
  2. Google AI for Developers. "Gemini Developer API pricing." Gemini 2.5 Flash input/output token rates (accessed 2026-05-31). https://ai.google.dev/gemini-api/docs/pricing
  3. Google AI for Developers. "Understand and count tokens — generateContent API." Image tiling: 258 tokens per 768×768 tile; crop-unit formula (accessed 2026-05-31). https://ai.google.dev/gemini-api/docs/tokens
  4. GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations. arXiv:2603.10978 (March 2026). VLM counting hallucination; detector-grounded prompting raises accuracy. https://arxiv.org/abs/2603.10978
  5. OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models. arXiv:2506.03135 (2026). Open and closed VLMs show significant spatial-reasoning limitations. https://arxiv.org/pdf/2506.03135
  6. LearnOpenCV. "YOLO11 on Raspberry Pi: Optimizing Object Detection for Edge." Edge FPS and quantization (accessed 2026-05-31). https://learnopencv.com/yolo11-on-raspberry-pi/
  7. arXiv:2604.03349. "YOLOv11 Demystified: A Practical Guide to High-Performance Object Detection." YOLO11n ~1.5 ms / ~650 FPS; Jetson Orin NX FP16/INT8 latency and FPS. https://arxiv.org/html/2604.03349v1
  8. Ultralytics. "Real-Time Inference: Speed, Low Latency & YOLO." Real-time detector latency framing (accessed 2026-05-31). https://www.ultralytics.com/glossary/real-time-inference
  9. Fireworks AI. "Introducing Vision-Language Model Fine-tuning." Fine-tuned small VLM: ~100× lower cost, ~1.5× lower latency vs closed providers; up to 43× throughput. https://fireworks.ai/blog/vlm-tuning
  10. DigitalOcean. "A Guide to Object Detection with Vision-Language Models." VLM-as-detector behavior and prompt sensitivity (accessed 2026-05-31). https://www.digitalocean.com/community/conceptual-articles/hands-on-guide-to-object-detection-with-vision-language-models