Published 2026-05-31 · 24 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

If your product needs to understand video — read a label in a warehouse feed, summarise a telemedicine consult, moderate user uploads, flag an event in a surveillance stream — you have two paths: rent a closed model over an API, or run an open model on your own machines. The closed frontier lesson covered the first path. This lesson covers the second, and it exists because the open path wins outright in three situations that come up constantly: when data legally cannot leave your servers, when a high-volume feature is cheaper to host than to rent, and when you need to fine-tune the model on your own domain. The reader this is written for is the product manager, founder, or operations lead who has to sit in the build-versus-buy meeting and understand why the engineers keep saying "Qwen3-VL for general work, InternVL if the lawyers are nervous, Pixtral when it has to fit on one GPU". It assumes you have read the video VLM primer; if "token", "frame sampling", and "context window" are new words, read that first.

What "Open Frontier" Actually Means

Start with the two words, because each carries weight. Open here is shorthand for open-weights: the trained model — the giant table of numbers, called weights, that holds everything the model learned — is published as a file you can download, run on your own computer, inspect, and change. The opposite is closed, where the weights live only on the vendor's servers and you rent access over the internet. Frontier means the most capable models of their kind — the ones that smaller models are measured against. So the open frontier is the most capable models you are allowed to download and own.

Here is the analogy to hold onto. A closed model is like hiring a brilliant analyst through an agency: you send your material out, an answer comes back, you pay per page, and you never see their desk. An open model is hiring the analyst in-house: you pay once to bring them on, they work entirely inside your building, your material never leaves, and you can train them on your own playbook — but you also have to give them a desk, a computer, and a salary whether they are busy or idle. That "desk and salary" is the catch with open models: the weights are free, but the hardware to run them is not, and someone on your team has to keep the lights on. The rest of this article is about when hiring in-house is the right call, and which candidate to hire.

Diagram contrasting the closed path and the open path for adding video understanding to a product. Left side, labelled 'Closed (rent)': your app sends video to a vendor cloud box (Gemini, GPT-5, Claude) and gets text back, with a tag 'pay per call, data leaves your servers, cannot fine-tune'. Right side, labelled 'Open (own)': the model weights are downloaded from a hub into your own server box, your app sends video to that box and gets text back, with a tag 'pay for hardware, data stays in your building, you can fine-tune'. A shared note at the bottom states both end at the same place — text answers about video — but the trade-offs are opposite. Figure 1. The two paths to video understanding. Closed rents capability per call; open owns it on your hardware. The whole lesson is about when the right column wins.

The Common Recipe — Why Every One Of These Models Looks Alike Inside

Before the four families, learn the one architecture they all share, because once you see it the entire field stops being a wall of names. Every model here is built from three parts bolted together, and the design was proven by the first family on our list, LLaVA, in 2023.

The first part is an image reader — technically a vision encoder — a component already trained to turn a picture into a list of numbers that capture what is in it. Most of these models use a reader from a system called CLIP, which had already learned to match images to their text descriptions. The second part is a language model — the same kind of system that powers a text chatbot, already trained to read and write fluent prose. The third part, sitting between them, is the clever bit: a small connector (engineers call it a projector or adapter) that translates the image reader's numbers into the form the language model expects, so the chatbot can "read" a picture as if it were words.

The analogy: the image reader is an eye, the language model is a fluent speaker who has never seen, and the connector is the optic nerve that carries what the eye sees into language the speaker can use. LLaVA's insight in 2023 was that you do not have to build all three from scratch. Take a ready-made eye (CLIP), take a ready-made speaker (an open chatbot called Vicuna), and train only the cheap little optic nerve in between, plus a light polish of the speaker. That made a capable vision-language model trainable on a modest budget, and it set the template every later family refined.

Diagram of the shared three-part vision-language model architecture. Left: an 'Image reader (vision encoder, e.g. CLIP)' box with a small picture icon, output arrow labelled 'image as numbers'. Middle: a 'Connector (projector / adapter)' box, the smallest box, tagged 'the part LLaVA showed you only have to train this'. Right: a 'Language model (LLM)' box with a speech icon, output arrow labelled 'text answer'. A caption underneath reads: eye plus optic nerve plus fluent speaker. Every model in this lesson is a variation on these three parts. Figure 2. The shared recipe. An image reader, a language model, and a small connector that links them. The four families differ mainly in how good each part is and how they are trained — not in the basic shape.

The Four Families

LLaVA — The Academic Pioneer That Set The Template

LLaVA stands for Large Language and Vision Assistant. It came out of academic labs in 2023 and matters less for what you would deploy today than for what it proved. Its method, called visual instruction tuning, was the first to use a text-only GPT-4 to generate training conversations about images, then use those conversations to teach the connector-plus-chatbot to answer questions about pictures. Per the original paper, the recipe was deliberately frugal: connect a frozen CLIP ViT-L/14 image reader to a frozen Vicuna language model, train the connector on a 558,000-example alignment set, then fine-tune on roughly 150,000 GPT-generated instruction examples. That two-stage recipe is now the default playbook for the whole field.

The lineage grew — LLaVA-1.5, LLaVA-NeXT (also called LLaVA-1.6), then LLaVA-OneVision, which added video understanding by treating a clip as a sequence of frames. The honest 2026 status is twofold. As a production model, mainstream LLaVA has been overtaken: its raw scores now sit behind Qwen-VL and InternVL. But as an open framework, the lineage is alive and still setting records for openness — the 2025 LLaVA-OneVision-1.5 release published its full training data and code, and reported that its 8-billion-parameter model beats the comparable Qwen2.5-VL-7B on 18 of 27 benchmarks. So you would rarely deploy classic LLaVA in a product today, but you would absolutely read its papers to understand the field, and you might use the OneVision framework if full reproducibility matters to your research. LLaVA is the keyword most people search — and the answer to "should I use LLaVA in production in 2026?" is usually "no, use one of the next three, but thank LLaVA for inventing the recipe they all use".

Qwen-VL — The All-Round Leader In 2026

Qwen-VL is the vision-language line from Alibaba's Qwen team, and in 2026 it is the open model most teams should try first. The current generation is Qwen3-VL, released in late 2025. It ships in a wide range of sizes so you can match the model to your hardware: small dense models at 2, 4, 8, and 32 billion parameters for single-GPU and edge work, and large mixture-of-experts models (a design where only a fraction of the model's parameters fire on any given input, giving big-model quality at smaller-model running cost) up to a 235-billion-parameter flagship.

Two features make Qwen-VL strong on video specifically. First, it natively supports a very long context — 256,000 tokens out of the box, expandable toward a million — which, per the token math from the VLM primer, is enough to hold an hours-long video with second-level indexing. Second, the previous generation, Qwen2.5-VL, introduced dynamic frame-rate sampling and absolute time encoding — meaning the model knows not just which frames it saw but when each one occurred in real seconds, so it can answer "what happened at 4 minutes 12 seconds?" rather than only "what happened in the video?". That temporal precision is exactly what video products need. Qwen2.5-VL's 72-billion-parameter model was benchmarked as competitive with closed models like GPT-4o, and as of May 2026 the Qwen line holds the strongest all-round scores among open weights.

There is one more reason engineers reach for Qwen first, and it is not technical: its licence. Qwen3-VL is released under Apache 2.0, an open licence with no restriction on commercial use at any scale. We return to why that matters below — it is often the deciding factor.

InternVL — The Strongest Permissively-Licensed Alternative

InternVL comes from OpenGVLab (the Shanghai AI Laboratory and partners) and is Qwen-VL's closest open rival. The current generation, InternVL3, is notable for a training change that sounds academic but has practical consequences: native multimodal pre-training. Earlier models — including LLaVA — first trained a language model on text alone, then bolted vision on afterward in a separate stage. InternVL3 instead learns language and vision together from the start, in a single pre-training run. The payoff the team reports is better, more consistent multimodal reasoning, because the model never has to be retrofitted with sight after the fact.

In numbers, InternVL3's largest model (78 billion parameters) scores about 72 percent on MMMU — a hard benchmark of college-level questions that mix images and text — placing it at the top of the open field and within reach of closed models. Its smaller sizes are competitive class-for-class with Qwen-VL. For a video product, InternVL handles the same frames-as-tokens approach as the others and supports extended multimodal context.

InternVL's distinguishing reason-to-choose is, again, licensing: it is released under permissive open-source terms (MIT-style), which some legal teams prefer even over Apache 2.0 for its brevity and the absence of any patent or attribution friction. When a company's lawyers want the cleanest possible licence and InternVL clears the quality bar, it is the pick.

Pixtral — The Strongest Small Model

Pixtral is Mistral's entry, and its claim to fame is punching above its weight. Pixtral 12B pairs a compact 400-million-parameter image reader with a 12-billion-parameter language model — small enough to run on a single mainstream GPU — and at release it outperformed other open models in its size class on instruction-following tasks. Its image reader uses a technique called RoPE-2D that lets it process pictures at their native resolution and shape, rather than squashing every image into one fixed square, which helps with documents, charts, and wide video frames. Pixtral handles any number of images inside a 128,000-token context window, and ships under Apache 2.0.

Pixtral is the model you choose when the constraint is hardware, not raw capability. If a feature has to run on a single modest GPU — on an edge device, in a cost-sensitive batch job, or embedded close to a camera — a 12-billion-parameter model that holds its own is worth more than a 235-billion-parameter flagship you cannot afford to keep running. It is the "fits on one GPU and still works" choice.

The Families At A Glance

The table below is an orientation snapshot for 2026, not a contract. Open-weights models release new versions every few months, and benchmark numbers move with them, so treat the scores as "what order of magnitude" rather than "what to put in a spec".

Family Origin Best 2026 generation Sizes Licence Reach for it when…
LLaVA Academic (UW-Madison et al.) LLaVA-OneVision-1.5 ~4B, 8B Apache 2.0, fully open data You want a reproducible research baseline, not a production model
Qwen-VL Alibaba Qwen3-VL 2B–32B dense, up to 235B MoE Apache 2.0 You want the best all-round open model — the safe default
InternVL OpenGVLab / Shanghai AI Lab InternVL3 1B–78B MIT-style You need top quality under the cleanest possible licence
Pixtral Mistral AI Pixtral 12B / Large 12B (and a larger variant) Apache 2.0 The feature must fit on one modest GPU

Generations and sizes are a 2026 snapshot and change every few months. All four ship under business-friendly licences — which is itself the headline: the open frontier is, in 2026, genuinely usable in commercial products.

The Licence Is The Real Decision — Not The Benchmark

Here is the rule that surprises most teams: for a business shipping a product, the software licence usually matters more than a few benchmark points. The licence is the legal document that says what you are allowed to do with the downloaded weights — whether you can use them commercially, at what scale, and what strings are attached. Pick a model that scores two points higher but carries a licence your lawyers reject, and you have picked nothing.

Three licence shapes show up across the open frontier, and you only need to recognise them. Apache 2.0 (Qwen-VL, Pixtral) and MIT (InternVL) are the friendliest: they let you use, modify, and ship the model commercially at any scale, on your own servers or embedded in a product you sell, with no usage cap and no permission needed. The third shape is the community licence used by some other popular open models (notably Meta's Llama family): commercial use is allowed, but with conditions — most famously a clause that only bites at hyperscale, restricting the very largest companies (those past hundreds of millions of monthly users), plus an attribution requirement. For almost every company that clause never triggers, but it is source-available, not formally open-source, and some procurement teams will not touch it on principle.

The practical takeaway is short. The four families in this lesson all ship under genuinely permissive licences — Apache 2.0 or MIT — which is exactly why they form the recommended open frontier: you can put them in a product you sell without a lawyer's caveat. When a model outside this set tempts you with a higher score, read its licence before its benchmark. A common, expensive mistake is prototyping for months on a model whose licence forbids the deployment you were always planning.

Decision tree for choosing an open-frontier vision-language model. Top diamond: 'Does the data have to stay on your servers, or do you need to fine-tune, or is volume huge?' A 'No' branch points to a rectangle 'Consider a closed API instead (see the closed-frontier lesson)'. A 'Yes' branch goes to a second diamond 'Does it have to fit on one modest GPU?'. Its 'Yes' branch points to a green outcome box 'Pixtral 12B'. Its 'No' branch goes to a third diamond 'Do your lawyers want the cleanest possible licence?'. That diamond's 'Yes' branch points to an outcome box 'InternVL3', and its 'No' branch points to an outcome box 'Qwen3-VL (the safe default)'. A side note reads: LLaVA is for research baselines, not production. Figure 3. Choosing an open model is a short decision tree, and the licence question sits near the top — above raw score. Qwen3-VL is the default; the branches are the exceptions.

What It Actually Costs To Run One Yourself

The weights are free; the hardware is not. The honest cost of an open model is the cost of the GPU (the specialised chip that runs it) plus the engineer-hours to operate it. Walk a rough example out loud, because the arithmetic is the whole build-versus-buy case.

Suppose you rent a single cloud GPU capable of serving a mid-sized VLM, at a representative 2026 price of about \$2.00 per hour. Run it around the clock for a month:

$2.00/hour × 24 hours × 30 days = $1,440 per month, fixed

That \$1,440 buys you a machine that runs whether it is busy or idle — the "desk and salary" from the analogy. Now compare against renting a closed API. If a closed model charges roughly \$0.05 to analyse one short video (the figure worked out in the closed frontier lesson), then the break-even point is:

$1,440 ÷ $0.05 per video = 28,800 videos per month

Below roughly 29,000 videos a month, the closed API is cheaper, because you only pay when you use it. Above it, the owned GPU wins, because its cost is fixed while the API bill keeps climbing with volume. A single GPU can usually process far more than 29,000 short videos in a month, so a genuinely high-volume feature tips toward self-hosting — and the more you run, the wider the gap. This is why the open path is described as winning "at scale": the fixed cost is only worth paying once you have enough work to keep the machine busy. The same arithmetic, run with your real GPU price and your real per-video API quote, is the core of any honest build-versus-buy decision — and a tool like the cost calculator from the cost-model lesson exists precisely to run it for you.

Two costs the arithmetic above hides, and that teams routinely forget. First, utilisation: if your GPU sits idle 90 percent of the time, you are paying \$1,440 for \$144 of work, and the break-even moves badly against you — self-hosting only pays when the machine is busy. Second, operations: someone has to deploy the model, keep the server patched, monitor it, and restart it at 3 a.m. when it falls over. Serving frameworks such as vLLM make this far easier than it used to be, but "free weights" never means "free to run". Budget the people, not just the chips.

A Common Pitfall — Choosing By The Leaderboard Alone

The mistake we see most often is picking a model from a single benchmark number — usually whichever model tops a chart someone screenshotted on social media that week. Three things make that misleading. First, the headline benchmarks (MMMU, OCRBench, and the like) test still images; a model can top them and still be mediocre at video, which depends on temporal benchmarks like Video-MME that the leaderboard screenshot probably did not show. If your product is about video, weight the video benchmarks, not the image ones.

Second, benchmark differences of one or two points are usually noise — within the margin where a different prompt, a different frame-sampling rate, or a different software version would flip the ranking. Treat a two-point gap as a tie and decide on licence, size, and serving cost instead. Third, leaderboards rarely show the licence, the model size, or the hardware needed to hit those scores — and those three are what actually determine whether you can ship. The fix is a discipline, not a chart: shortlist on licence first, filter to the sizes your hardware can run, then — and only then — compare the video benchmarks among the survivors, and prototype the top two on your own data before committing. The model that wins on your footage at your latency budget is the right one, regardless of who leads the public chart this month.

Where Fora Soft Fits In

We deploy these open models inside real video products across the verticals we work in, and the family choice maps cleanly onto them. In video surveillance and telemedicine, where footage often cannot legally leave the customer's servers, an open model running on-premises is frequently the only compliant option, and Qwen3-VL or InternVL3 on the customer's own hardware is where we start. In e-learning and OTT, where features such as lecture summarisation or content tagging run at high volume, the owned-GPU economics above tip the build-versus-buy maths toward self-hosting, and a fine-tuned open model on a private domain outperforms a generic closed one. For edge and embedded work close to a camera, Pixtral's single-GPU footprint earns its place. The judgment we bring is the one this article teaches: choose on licence and serving cost first, weight the video benchmarks over the image ones, and reserve the closed-API path for low-volume features and rapid prototypes where time-to-ship beats unit cost.

What To Read Next

Talk To Us / See Our Work / Download

  • Talk to a video AI engineer — scope an open-weights video feature, pick the right family, and size the hardware before you build. Book a 30-minute call.
  • See our case studies — surveillance, telemedicine, e-learning, and OTT systems we have shipped. Browse case studies.
  • Download the open-frontier VLM selection sheet (PDF) — a one-page decision sheet: the four families and their licences, the shared three-part recipe, the self-hosting break-even math, and the licence-first decision tree. Download the selection sheet.

References

  1. Liu, H., Li, C., Wu, Q., Lee, Y. J. "Visual Instruction Tuning." NeurIPS 2023 (Oral), arXiv:2304.08485 (accessed 2026-05-31). — Primary source for the original LLaVA architecture (frozen CLIP ViT-L/14 vision encoder + trainable projection matrix + Vicuna LLM), the two-stage recipe (558K alignment subset, ~150K GPT-generated instruction examples), and the visual-instruction-tuning method that became the field's template.
  2. Qwen Team, Alibaba Group. "Qwen2.5-VL Technical Report." arXiv:2502.13923 (March 2025, accessed 2026-05-31). — Source for dynamic-resolution ViT trained from scratch with window attention, dynamic FPS sampling, absolute time encoding for second-level event localisation, hours-long video support, three model sizes, and the Qwen2.5-VL-72B ≈ GPT-4o / Claude 3.5 Sonnet comparison.
  3. Qwen Team, Alibaba Cloud. "Qwen3-VL Technical Report." arXiv:2511.21631; and QwenLM/Qwen3-VL GitHub repository (accessed 2026-05-31). — Source for the Qwen3-VL dense (2B/4B/8B/32B) and MoE (30B-A3B / 235B-A22B) line, the 256K native context window expandable toward 1M, second-level video indexing, the September 2025 release, and the Apache 2.0 licence.
  4. Zhu, J., et al. (OpenGVLab). "InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models." arXiv:2504.10479 (April 2025, accessed 2026-05-31). — Source for native multimodal pre-training (single joint stage), variable visual position encoding (V2PE), mixed preference optimization (MPO), and the InternVL3-78B ≈ 72.2% MMMU figure.
  5. Mistral AI. "Pixtral 12B." arXiv:2410.07073 (October 2024, accessed 2026-05-31). — Source for the 400M-parameter vision encoder + 12B multimodal decoder architecture, the RoPE-2D native-resolution image handling, the 128K-token context window holding any number of images, and the Apache 2.0 licence.
  6. LLaVA-OneVision-1.5 Team. "LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training." arXiv:2509.23661 (accessed 2026-05-31). — Source for the fully-open training data and code claim, and the result that the 8B model beats Qwen2.5-VL-7B on 18 of 27 benchmarks (and the 4B beats Qwen2.5-VL-3B on all 27).
  7. Fu, C., et al. "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis." arXiv:2405.21075; and Video-MME-v2, arXiv:2604.05015 (April 2026, accessed 2026-05-31). — Source for the existence and role of the Video-MME / Video-MME-v2 video-understanding benchmarks and the point that video performance is measured separately from still-image benchmarks like MMMU.
  8. "Apache License, Version 2.0." Apache Software Foundation, apache.org/licenses/LICENSE-2.0; and "The MIT License." Open Source Initiative, opensource.org/license/mit (accessed 2026-05-31). — Primary licence texts establishing that Apache 2.0 and MIT permit unrestricted commercial use, modification, and redistribution at any scale, in contrast to source-available community licences with usage-scale carve-outs.
  9. "Welcome to vLLM" and "Qwen3-VL usage guide." vLLM documentation, docs.vllm.ai (accessed 2026-05-31). — Source for vLLM as the open-source inference-serving framework used to deploy these VLMs in production and the operational point that serving, not weights, is the real cost.
  10. "Best Open-Weight Vision-Language Models 2026" (Presenc AI) and "Top 10 Vision Language Models in 2026" (DataCamp) (accessed 2026-05-31). — Competitor-landscape references used for orientation only: the May-2026 ranking that places Qwen2.5-VL-72B (~70.2% MMMU, ~888 OCRBench) at the top of the open field, InternVL3-78B as the strongest MIT-licensed alternative, and Pixtral / Phi-4 as the strongest small models. Per §4.3.2 these tier-7 sources are cross-checked against the primary papers above and never used as the source of a spec claim.