Published 2026-05-27 · 24 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

Almost every modern computer-vision model your team will deploy has a Vision Transformer somewhere inside it. The detector you picked from the YOLO production lineage lesson has a Transformer-style backbone or neck. The open-vocabulary detector from the Grounding DINO lesson is a ViT through and through. The segmentation model from the SAM 2 lesson uses a Hiera backbone. The video VLMs that classify scenes, describe clips, and answer questions about a stream all start with a ViT image encoder. If you do not understand the architecture, you cannot reason about why one model is fast and another is slow, why one needs a billion examples to train and another works with a thousand, or why the engineering team keeps debating patch size, image resolution, and embedding dimension as if those were knobs that mattered. They are knobs that matter. This primer is the foundation under every Phase 2 lesson, every Phase 4 multimodal lesson, and every Phase 5 generative-video lesson. It is also the cleanest mental model we know for explaining what a "foundation model for vision" actually is, in language a product manager or founder can follow.

The Mental Model — An Image Is A Sequence Of Patches

Before ViT, every serious image model was a convolutional neural network — a CNN. CNNs sweep small filters across the image, pooling features into smaller and smaller grids until the whole image has been summarised into a single vector. The architecture has two strong inductive biases baked in: locality, the assumption that nearby pixels are related to each other, and translation equivariance, the assumption that an object is the same object whether it appears in the top-left of the image or the bottom-right. Those biases are good. They are also a ceiling — they constrain what the model is allowed to learn.

The Vision Transformer paper, published by Alexey Dosovitskiy and colleagues at Google Research in 2020 under the title "An Image is Worth 16×16 Words", asked what happens if you take those biases out and let the model learn from scratch. The recipe is almost embarrassingly simple. Take an image, say 224×224 pixels. Cut it into a grid of 16×16 patches — that gives you 14 patches across and 14 patches down, or 196 patches total. Flatten each patch into a vector of 768 numbers (16 × 16 × 3 channels). Multiply each vector by a learned matrix to project it into the embedding space — usually a 768-dimensional vector. Add a learned position embedding to each patch so the model knows where in the image the patch came from. Prepend a special learnable "class token" to the start of the sequence — its final output will be the image's overall representation. Feed all 197 tokens through a standard Transformer encoder, exactly the kind used in language models. Out the other end, take the class token's vector and pass it through a small classification head.

That is the whole architecture. There are no convolutions. There is no pooling. There is no special vision machinery. An image is being treated as a sequence of 196 tokens, and the same Transformer encoder that processes a sentence of 196 words can process this sequence. The brightness number — what engineers call a "patch embedding" — is in plain language just a 768-number summary of one 16×16 patch of the picture.

Diagram showing the ViT pipeline. Left: a 224x224 image with a grid overlay dividing it into 14x14 patches. Middle: arrows pulling each patch into a column of flattened vectors, then a linear projection block producing patch embeddings, then a 'plus' merging position embeddings, then a class token prepended. Right: the resulting sequence of 197 tokens entering a stack of Transformer encoder blocks (with multi-head self-attention and feed-forward layers labelled), and the class token output feeding into a small MLP head that produces class probabilities. Figure 1. The Vision Transformer recipe: image is cut into patches, patches become tokens, tokens go through a vanilla Transformer encoder, and the class token's final embedding represents the whole image.

The numbers worked. ViT-Base — 12 layers, 768 embedding dim, 12 attention heads, 86 million parameters — matched the best CNN baselines on ImageNet when both were trained from scratch. ViT-Large — 24 layers, 1024 dim, 16 heads, 307M parameters — pulled ahead. ViT-Huge — 32 layers, 1280 dim, 16 heads, 632M parameters — beat the state of the art set by EfficientNet on ImageNet by a clear margin, set new top results on CIFAR-100 and the VTAB benchmark, and required fewer compute resources to train than the CNN it dethroned. The catch, and the only catch, was data: ViT had to see roughly 300 million labelled images (the JFT-300M dataset that Google had internally) to beat the CNN. When trained on plain ImageNet's 1.3 million images, ViT lost to a comparable ResNet. CNNs were better at small data; ViT was better at scale.

Why The Attention Mechanism Is The Whole Point

The reason ViT beats CNNs at scale is the attention mechanism. A 16×16 convolution can only see 16 pixels at a time; to mix information from one side of the image to the other, the CNN has to stack many layers, with each layer mixing slightly further-apart patches. Information from corner to corner propagates slowly. The self-attention layer at the heart of every Transformer block does something different: every patch can directly attend to every other patch in a single layer. Patch number 1 — top-left corner — can in one step ask "how related am I to patch number 196 — bottom-right corner?" and get a learned weight back. Long-range relationships that take a CNN twelve layers to construct are available to a Transformer in layer one.

The mechanism itself is a weighted average. For each pair of tokens, the model computes three vectors — a query, a key, and a value — and the attention score from token A to token B is the dot product of A's query and B's key, divided by the square root of the embedding dimension, passed through a softmax. The result is a probability distribution that says "where should I look when constructing my next representation?" The new representation of token A is the softmax-weighted sum of every other token's value vector. Stack that operation a dozen times and the model has built a rich global representation of the image.

The trade-off is cost. Self-attention scales quadratically with the number of tokens: a 224×224 image at patch size 16 gives 196 tokens, which is comfortable; a 1024×1024 image at patch size 16 gives 4,096 tokens, and the attention matrix has 16.7 million entries. That is the wall that motivated every architectural variant we discuss next.

Why ViT Beat The CNN — And Where The CNN Still Wins

Two effects let ViT pull ahead. First, the absence of inductive bias becomes an asset at scale. When the data set is small, baked-in assumptions help — the CNN does not have to learn locality from scratch. When the data set is huge, those baked-in assumptions are a ceiling — the model would learn locality on its own and then learn shapes the CNN cannot see, but only if there is enough data to learn them from. The 2020 ViT paper put it plainly: with enough data, large-scale training trumps inductive bias. Second, the attention mechanism's global receptive field gives every layer access to the whole image. A CNN's effective receptive field grows layer by layer; ViT's is the whole image from layer one.

Where CNNs still win in 2026: on small data sets, on small models, on edge devices with tight latency budgets, and on tasks where translation equivariance is a genuine constraint (medical imaging tasks where the model must respond identically to a feature regardless of its location, for example). EfficientNet and ConvNeXt remain strong baselines below the 100-million-parameter mark. The right framing today is not "Transformers replaced CNNs"; it is "Transformers replaced CNNs at scale". For a deployment that runs on a phone with a 50 ms inference budget, a small CNN often still wins. For a deployment that runs in a cloud GPU with a 200 ms budget and benefits from pre-training on hundreds of millions of images, ViT wins.

Property CNN (e.g., ResNet, EfficientNet) ViT
Inductive bias Locality + translation equivariance None — learned from data
Receptive field per layer Local (3×3, 5×5, 7×7 kernels) Global (every token attends to every token)
Small-data behaviour Excellent — biases help Poor — needs millions of examples
Large-data behaviour Plateaus Keeps improving
Throughput on edge Strong (mature kernels) Improving but historically slower
Memory at high resolution Manageable Quadratic in token count — expensive

The Token Maths — Worked Out Loud

Let us walk through the numbers for ViT-Base on a 224×224 image, because the arithmetic is the cleanest way to internalise what is happening.

A 224×224 RGB image is a tensor of shape (224, 224, 3) — 150,528 numbers in total. The patch size is 16×16. The grid is 224 / 16 = 14 patches across and 14 patches down, for 14 × 14 = 196 patches in total. Each patch flattens into a vector of 16 × 16 × 3 = 768 numbers.

The patch embedding is a single linear projection — a learned matrix of shape (768, 768) — that maps each patch vector to a 768-dimensional embedding. That is 196 × 768 = 150,528 numbers entering the Transformer. The model also prepends a learned [CLS] token (a single 768-dim vector) and adds learned position embeddings of shape (197, 768). The Transformer encoder is fed a tensor of shape (197, 768).

ViT-Base has 12 Transformer encoder blocks. Each block has multi-head self-attention with 12 heads of dimension 64 (because 768 / 12 = 64), and a feed-forward layer with hidden dimension 3072 (the standard 4× expansion). The total parameter count is roughly 86 million.

The cost of one self-attention layer over the 197 tokens is 197 × 197 × 768 ≈ 30 million multiply-add operations per head, times 12 heads — about 360 million operations per layer for attention alone. Across 12 layers that is 4.3 billion attention operations. The feed-forward layers add roughly the same again. End to end, one forward pass through ViT-Base on a 224×224 image is about 17 GFLOPs.

Now scale to a 1024×1024 image at the same patch size. The grid becomes 64 × 64 = 4,096 patches. The attention cost per layer becomes 4,096 × 4,096 × 768 ≈ 12.9 billion operations per head — 35× the cost of the 224×224 case for attention alone. That is the quadratic wall we keep mentioning. Almost every architecture below was invented to climb around it.

Patch Size, Resolution, And The Knobs That Actually Matter

Three engineering knobs dominate every ViT deployment decision: patch size, input resolution, and embedding dimension. They look small. They are the three numbers that decide whether your feature ships at 60 FPS or 6 FPS.

Patch size controls how aggressively the model down-samples. ViT-Base/16 uses 16×16 patches and produces 196 tokens from a 224×224 image. ViT-Base/14 uses 14×14 patches and produces 256 tokens. Smaller patches mean more tokens, more fine-grained features, quadratically more attention cost, and usually better accuracy on dense tasks like segmentation. Larger patches (32×32) cut token count by 4× and inference cost by 16× for the attention layers, at the cost of detail.

Input resolution matters because token count scales with the resolution squared. Going from 224×224 to 384×384 — a routine upgrade for accuracy — triples token count from 196 to 576 and increases attention cost by 9×. Going to 1024×1024 takes you to the wall.

Embedding dimension controls model capacity. ViT-Tiny uses 192, ViT-Small uses 384, ViT-Base uses 768, ViT-Large uses 1024, ViT-Huge uses 1280, ViT-G uses 1408. Doubling the embedding dimension roughly quadruples the parameters of the feed-forward layers and doubles the cost of attention per head.

In practice, the right combination depends on what you are doing. For real-time inference on edge: ViT-Tiny/16 or ViT-Small/16 at 224×224 resolution. For maximum cloud accuracy on small images: ViT-Large/14 at 384×384. For dense prediction (segmentation, detection backbones): ViT-Base/16 at 512×512 or higher, with the hierarchical extensions we describe below. The temptation to pick a huge model at high resolution is what drove most teams toward the hierarchical and windowed variants.

From Image To Video — Why Plain ViT Is Not Enough

A video is a stack of images, plus the dimension of time. Multiplying token count by 16 frames per clip — the typical input length for an action-recognition model — turns 196 tokens into 3,136 tokens. The attention matrix grows by 256×. Throw a 1-minute clip at 30 frames per second at the model and you are far past where any GPU can hold the attention matrix in memory.

Two things have to happen for video. First, the model has to understand temporal relationships — the brightness number at position (10, 20) in frame 5 has to be linked to the brightness number at position (10, 20) in frame 6, even though they are at completely different positions in the flattened token sequence. Second, the cost of attention over space × time has to come down by orders of magnitude.

The video-Transformer literature is essentially a list of factorisations and locality tricks that solve those two problems. The four we cover below are the four you actually meet in production.

ViViT — The Pure-Transformer Video Recipe

ViViT was published by Anurag Arnab and colleagues at Google in March 2021, six months after the original ViT. The paper presents four model variants. Variant 1 — Spatio-temporal attention — is the brute-force baseline: tubelet patches (3-D blocks of, say, 16 × 16 × 2 voxels) become tokens, and the encoder attends over all of them jointly. It is the most accurate, but the most expensive. Variant 2 — Factorised encoder — first runs a spatial Transformer per frame, then a small temporal Transformer that takes the per-frame [CLS] tokens as its input. Variant 3 — Factorised self-attention — interleaves spatial and temporal attention layers in the same block. Variant 4 — Factorised dot-product attention — splits the attention heads themselves into spatial heads and temporal heads.

The two practical wins were Variant 2 and the use of tubelet patches. Tubelet patches replace the per-frame 2-D patches with 3-D blocks that span multiple frames, which means the model sees a small slice of temporal context inside every token. Variant 2's factorisation cuts the cost of full spatio-temporal attention by an order of magnitude on long clips while losing only a small amount of accuracy.

ViViT-L/16 trained on the Kinetics-400 action-recognition benchmark — 400 action classes, 240,000 training clips — reaches 81.3% top-1 accuracy. Kinetics-600 takes it to 83.5%, and Something-Something v2 — a benchmark designed specifically to test temporal reasoning — reaches 65.4%. Those were state-of-the-art numbers in 2021 and are still the reference for the "pure attention for video" approach.

TimeSformer — Divided Attention

TimeSformer was published two months before ViViT, by Gedas Bertasius, Heng Wang, and Lorenzo Torresani at Facebook AI Research. The contribution is divided attention: instead of attending over space and time jointly, the model alternates layers — one layer attends only along the time axis (linking the same spatial position across frames), the next attends only along the spatial axes (linking different positions within a frame). The two attention layers stack, and the result is global spatio-temporal coverage at one-quarter to one-tenth of the cost of joint attention.

TimeSformer's headline numbers are 82.2% top-1 on Kinetics-400 and 80.7% on Kinetics-600. The model came in three sizes — base, large, and HR (high resolution) — and the large HR variant trained on Kinetics-600 set the state of the art at the time. The reference implementation is at github.com/facebookresearch/TimeSformer under a CC BY-NC license (note: non-commercial — read it before you ship). The divided-attention pattern was widely copied; you will see it inside more recent models too.

Video Swin Transformer — Hierarchical Window Attention

Swin Transformer — the Shifted Window image model — was published by Ze Liu and colleagues at Microsoft in 2021. The contribution that mattered most for production was window-based attention with shifted windows. Instead of letting every token attend to every other token, Swin partitions the token grid into non-overlapping windows of, say, 7×7 tokens, and applies attention inside each window. Information leaks between windows by shifting the window partitioning by half a window every other layer. The cost of attention drops from quadratic in the number of tokens to linear, because the per-window attention is constant-cost and the number of windows scales linearly. The model is also hierarchical: it merges tokens between stages to produce a feature pyramid, the same way a CNN does, making Swin a drop-in backbone for dense-prediction tasks like detection and segmentation.

Video Swin Transformer, published shortly after by the same team, extends the recipe to time: windows become 3-D blocks of, say, 8 × 7 × 7 tokens (8 frames × 7 × 7 spatial), and the shifted-window trick applies in both space and time. On Kinetics-400, Video Swin-L reaches 84.9% top-1; on Something-Something v2, 69.6%. The dominant practical fact is the hierarchical feature pyramid: Video Swin slots in wherever an FPN-style backbone would, which is why it became the default video backbone for downstream tasks like temporal localisation and dense video segmentation for several years. Swin V2 followed in 2022 with scaling tweaks that pushed total parameter count to 3 billion and supported very high-resolution inputs.

Hiera — Cutting The Bells And Whistles

By 2023, the field had stacked vision-specific tweaks on top of plain ViT for years — hierarchical pyramids, shifted windows, relative positional biases, convolutional stems, learned downsampling. Each piece improved a benchmark. Together they made the models slower than they had any right to be. Hiera, published at ICML 2023 (Oral) by Chaitanya Ryali and colleagues at Meta FAIR with collaborators from Georgia Tech and Johns Hopkins, asked the obvious question: what happens if you take all of those tweaks back out, and let the model learn what it needs from a Masked Autoencoder (MAE) pre-training objective?

The recipe: start from a hierarchical ViT (so you keep the multi-scale feature pyramid that downstream tasks need), remove every vision-specific addition — no shifted windows, no relative positional biases, no convolutional kernels in the patch embedding — and pre-train with MAE, where the model has to reconstruct masked patches of the image. MAE pre-training gives the model implicit locality and translation-equivariance priors from the data, instead of having those priors baked into the architecture. The result is an architecture that is simpler and faster than every prior hierarchical ViT, and more accurate.

The numbers are striking. Hiera is 2.4× faster than MViTv2 on images and 5.1× faster on video. On Kinetics-400, Hiera-L beats every prior model in its class; on Kinetics-600 and Kinetics-700 it sets new state-of-the-art numbers. The reference implementation is at github.com/facebookresearch/hiera under the Apache 2.0 license — properly commercial-friendly. Hiera is also the visual backbone that powers SAM 2 (the segmentation model from the SAM 2 lesson), which is the strongest production endorsement an architecture can get.

A four-quadrant comparison chart of the video Vision Transformer family. Each quadrant shows the model name, publication year, dominant idea, Kinetics-400 top-1 accuracy, and license. Quadrant 1 — ViViT: 2021, tubelet patches plus factorised encoder, 81.3 percent, Apache 2. Quadrant 2 — TimeSformer: 2021, divided attention, 82.2 percent, CC BY-NC. Quadrant 3 — Video Swin Transformer: 2021, hierarchical 3D shifted windows, 84.9 percent (Swin-L), MIT. Quadrant 4 — Hiera: 2023, MAE-pretrained hierarchical ViT without bells-and-whistles, 87.3 percent (Hiera-L), Apache 2. Figure 2. The four video Vision Transformer families that matter in 2026. Hiera is the fastest accurate option for new projects; Video Swin remains the default in many downstream-task backbones; ViViT and TimeSformer are the canonical reference papers.

VideoMAE And The Pre-Training Question

Vision Transformers are data-hungry, and there is never enough labelled video to train one from scratch. The field's answer is self-supervised pre-training on huge unlabelled video corpora. The dominant recipe is Masked Autoencoders for Video — VideoMAE — published at NeurIPS 2022 by Zhan Tong and colleagues at Tsinghua. The idea is simple. Take a clip of 16 frames, divide each frame into 16×16 patches, randomly mask out 90–95% of the patches across space and time, feed the surviving 5–10% of patches into a ViT encoder, and ask a small decoder to reconstruct the masked pixels. The encoder has to learn what video looks like to do that.

The masking ratio matters. Image MAE works well at 75% masking; video MAE works better at 90–95% masking, because consecutive frames are largely redundant. The model is forced to predict masked patches from temporally distant context, which is exactly the learning signal you want.

VideoMAE's numbers on a ViT-Base backbone: 81.5% top-1 on Kinetics-400 after self-supervised pre-training on Kinetics-400 itself (no external labelled data). VideoMAE-V2, published in 2023, scales to a 1-billion-parameter ViT with a dual masking strategy and reaches 90.0% top-1 on Kinetics-400, 89.9% on Kinetics-600, 68.7% on Something-Something v1, and 77.0% on v2. Those are foundation-model numbers — they sit at or near the top of every public leaderboard, and they were reached without using a single externally-labelled video example beyond the benchmark's own training split.

The practical implication: if you are training a video-understanding model in 2026 on your own data, you almost certainly start from a publicly released VideoMAE-V2 checkpoint or a Hiera-MAE checkpoint, then fine-tune on whatever labels you have. Starting from random initialisation on a domain video dataset of, say, 50,000 clips is essentially never the right move.

DINO, DINOv2, And DINOv3 — Self-Supervision Without Reconstruction

The other big self-supervised family is DINO — self-DIstillation with NO labels — published by Mathilde Caron and colleagues at Meta FAIR in 2021. Instead of reconstructing masked patches, DINO trains a student network to match the output of a teacher network that is an exponential moving average of past students. Both networks see different augmented views of the same image and have to produce the same embedding. The model learns a representation without any labels, just from the constraint that two augmented views of the same image should be embedded near each other.

DINOv2 (2023, Maxime Oquab and colleagues at Meta) scaled the recipe to 142 million curated images and produced the strongest open-weights image encoder of its era. DINOv3 (August 2025, Oriane Siméoni and colleagues at Meta) scaled it further: 1.7 billion images, 1.1 billion parameters in the biggest model, a new Gram Anchoring loss to stabilise the dense feature maps that degrade in long training runs. On ImageNet linear-probe classification, DINOv3 reaches 88.4% top-1, beating DINOv2's 87.3%. On semantic segmentation of PASCAL VOC, DINOv3 hits 86.6 mean IoU vs DINOv2's 83.1.

Why this matters for video: DINO-family features are exceptionally good at dense tasks — segmentation, depth, tracking, anything that needs a useful representation per pixel — and they transfer to video by extracting features per frame and then training a small temporal head on top. DINOv2 and DINOv3 are MIT-licensed for commercial use and are the strongest off-the-shelf vision backbones available when you have no labelled data in your domain.

Where Vision Transformers Live In A 2026 Production System

Vision Transformers are everywhere in a modern video AI stack, almost always as a backbone — the first network in the pipeline that turns pixels into features — rather than as a standalone model. Here is the map.

In object detection, the trend has been to replace CNN backbones with ViT backbones in the higher-accuracy tier. RT-DETR (covered in the open-vocabulary detection lesson) is a pure Transformer detector. YOLO-style detectors increasingly use Transformer-style modules in their necks. Grounding DINO and OWLv2, the open-vocabulary detectors, are Transformer backbones with a text encoder bolted on top.

In segmentation, SAM 2 (the segmentation model from the SAM 2 lesson) uses a Hiera backbone — a video Vision Transformer — as its image encoder. The memory module that lets it track objects through occlusion is itself a Transformer.

In image and video classification, the entire foundation-model layer is ViT. EVA-02 and SigLIP 2 are ViT-based image encoders used inside almost every recent vision-language model. DINOv2 and DINOv3 are ViT image encoders used as feature extractors for almost every dense-prediction task.

In vision-language models — the multimodal models from Phase 4 — the vision encoder is always a ViT, almost always with a specific patch size that aligns with the language model's token budget. LLaVA, Qwen-VL, Pixtral, InternVL, Florence-2, and every frontier-grade VLM uses a ViT image tokeniser to convert pixels into the tokens the language model can read.

In generative video — Phase 5 — the diffusion U-Nets that power Sora, Veo, Kling, Runway, Pika, and Luma all use Transformer blocks for the spatio-temporal attention layers that mix frame and time. The open-weights generative video models (HunyuanVideo, CogVideoX, Mochi, LTX-Video, Wan) are similar.

In real-time video calls — Phase 6 — Transformer-based segmentation models power background blur and replacement (covered in the MediaPipe Selfie Segmentation lesson). Lightweight ViT variants are starting to replace older MobileNet-style models in browser-deployed WebGPU pipelines.

Pitfalls When Shipping ViT Backbones In Video Products

We have shipped or evaluated Vision Transformer backbones into roughly twenty production video features at Fora Soft, mostly in video surveillance, telemedicine, conferencing, and OTT. The four pitfalls below come up over and over.

Pitfall 1: Quadratic attention bites at high resolution. The most common surprise in a ViT deployment is that the model's inference cost balloons when the team decides to "just run it at 4K". A 4K image is roughly 18 million pixels; at patch size 16 that gives 73,000 tokens per frame; the attention matrix has 5.3 billion entries. The model that ran at 30 FPS on 224×224 input runs at less than 1 FPS. The fix is to use a hierarchical model (Swin, Hiera) that does not attend globally, or to run the model on a downsampled image and project the result back up.

Pitfall 2: Position embeddings are tied to the input resolution. The original ViT learns a fixed grid of position embeddings — say, 14×14 for a 224×224 image. When you change the input resolution, those embeddings have to be interpolated, which usually works but sometimes degrades performance. Modern variants (RoPE, sinusoidal, relative) handle resolution changes more gracefully. Read the model's documentation before changing the input resolution.

Pitfall 3: Small data + plain ViT = no transfer. If you train a plain ViT-Base on your domain video data of 10,000 clips from scratch, you will get worse results than a ResNet-50. ViT needs scale. The way out is to start from a pre-trained checkpoint — DINOv2, DINOv3, VideoMAE-V2, Hiera, SigLIP 2 — and fine-tune. We have not seen a project where training a ViT from scratch was the right call.

Pitfall 4: Edge deployment is harder than CNN edge deployment. Vision Transformers have historically lagged CNNs on edge devices because the kernels for attention are less mature in TensorRT, CoreML, and TensorFlow Lite than the kernels for convolution. The gap is closing fast — TensorRT now has good ViT support, Apple's CoreML has explicit ViT optimisations, and architectures like EfficientFormer were designed specifically for edge. But "I will deploy a ViT-L on a phone" is still a much riskier engineering bet than "I will deploy a MobileNet on a phone". Test the export path early.

A diagram showing four common pitfalls with ViT backbones in video products. Pitfall 1 — quadratic attention scaling: a chart showing FPS dropping from 30 at 224x224 to less than 1 at 4K. Pitfall 2 — position embedding interpolation: two grids of position embeddings, one 14x14 the other 32x32, with an interpolation arrow between. Pitfall 3 — training from scratch on small data fails: a bar chart showing ViT trained from scratch losing to ResNet at 10K clips, and ViT pretrained beating it. Pitfall 4 — edge export friction: a flowchart showing PyTorch -> ONNX -> TensorRT with red friction marks at the export edges. Figure 3. The four pitfalls we keep rediscovering when shipping ViT backbones into video products.

Picking The Right Backbone — A Decision Walkthrough

The decision tree we use at Fora Soft has four questions.

The first question is: what task are you doing? For image classification on a fixed taxonomy: a small ViT (Tiny or Small) pre-trained with DINOv2 or DINOv3, fine-tuned on your labels. For segmentation: SAM 2 (Hiera backbone) is the default, with DINOv3 + a small head as an alternative when you need a single-model pipeline. For detection: RT-DETR or a YOLO with a Transformer neck, depending on whether you need open-vocabulary (Grounding DINO) or fixed-vocabulary (YOLO). For video classification: Hiera-MAE or VideoMAE-V2 pre-trained, fine-tuned on your action labels. For vision-language: pick the VLM, not the backbone — the VLM ships with its own visual encoder.

The second question is: what is your latency budget? Sub-50 ms on edge: ViT-Tiny or ViT-Small at 224×224, with TensorRT or CoreML export tested. 50–200 ms in cloud: ViT-Base or ViT-Large at 224×224 or 384×384. 200 ms+ for batch processing: anything up to ViT-Huge at 384×384 or higher.

The third question is: how much labelled data do you have? Under 10,000 examples: start from a pre-trained DINOv3 / Hiera / VideoMAE-V2 checkpoint and fine-tune the head only, freezing the backbone. 10,000 to 100,000: fine-tune the head and the last few backbone layers. Over 100,000: full fine-tuning of the backbone with a low learning rate. Never train ViT from scratch on a domain dataset unless you have at least a million examples and a research-grade compute budget.

The fourth question is: what licence do you need? Apache 2.0 or MIT (Hiera, Swin, DINOv2, DINOv3, EVA-02, SigLIP 2): safe for any commercial use. Non-commercial (TimeSformer's reference code, some Meta-internal forks): legal review required before shipping. ImageBind and certain Meta data sets are even more restrictive — read the licences.

Where Fora Soft Fits In

Most of the video features we ship at Fora Soft start with a pre-trained Vision Transformer as the backbone. In video surveillance we use ViT-based open-vocabulary detectors and SAM 2 for segmentation; in telemedicine we use DINOv3 for symptom-area localisation and Hiera-MAE for action classification when the camera sees a procedure; in video conferencing we use lightweight ViT variants on the WebGPU pipeline for background segmentation; in OTT we use Video Swin and Hiera for content-aware encoding decisions and scene classification on archive footage. We do not train ViTs from scratch — we fine-tune from public Apache or MIT checkpoints. The architecture choice is rarely the bottleneck; the data pipeline, the labelling strategy, and the export path to the deployment runtime are where projects live or die.

What To Read Next

Talk To Us / See Our Work / Download

  • Talk to a video engineer — book a 30-minute scoping call about a Vision Transformer-backed video feature.
  • See our case studies — review the surveillance, telemedicine, and OTT projects we have shipped with ViT-family backbones.
  • Download the ViT decision worksheet — a one-page printable with the four-question decision tree, the patch-size / resolution / embedding-dim cheat sheet, and the four-pitfall checklist.

References

  1. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. "An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale." ICLR 2021. arXiv:2010.11929. Accessed 2026-05-27. Original ViT paper. Source of the patch-as-token recipe, the 86 / 307 / 632 million parameter Base / Large / Huge configurations, and the JFT-300M-scale data-efficiency results.

  2. Vaswani, A., Shazeer, N., Parmar, N., et al. "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762. Accessed 2026-05-27. The Transformer encoder that ViT borrows wholesale. Source for the multi-head self-attention formulation and the 4× feed-forward expansion.

  3. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C. "ViViT: A Video Vision Transformer." ICCV 2021. arXiv:2103.15691. Accessed 2026-05-27. Four-variant ViViT and tubelet patches; Kinetics-400 81.3% / Kinetics-600 83.5% / Something-Something v2 65.4% numbers.

  4. Bertasius, G., Wang, H., Torresani, L. "Is Space-Time Attention All You Need for Video Understanding?" ICML 2021. arXiv:2102.05095. Accessed 2026-05-27. TimeSformer with divided attention; Kinetics-400 82.2% / Kinetics-600 80.7%. Reference implementation: github.com/facebookresearch/TimeSformer, CC BY-NC.

  5. Liu, Z., Lin, Y., Cao, Y., et al. "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." ICCV 2021. arXiv:2103.14030. Accessed 2026-05-27. Shifted-window attention; hierarchical feature pyramid; MIT-licensed reference.

  6. Liu, Z., Ning, J., Cao, Y., et al. "Video Swin Transformer." CVPR 2022. arXiv:2106.13230. Accessed 2026-05-27. 3-D shifted windows; Kinetics-400 84.9% (Swin-L), Something-Something v2 69.6%.

  7. Liu, Z., Hu, H., Lin, Y., et al. "Swin Transformer V2: Scaling Up Capacity and Resolution." CVPR 2022. arXiv:2111.09883. Accessed 2026-05-27. Swin scaling tweaks; 3-billion-parameter variant; high-resolution support.

  8. Ryali, C., Hu, Y.-T., Bolya, D., et al. "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles." ICML 2023 Oral. arXiv:2306.00989. Accessed 2026-05-27. Strip-down hierarchical ViT pre-trained with MAE; 2.4× faster on images, 5.1× faster on video than MViTv2; new state-of-the-art on Kinetics-400/600/700. Reference: github.com/facebookresearch/hiera, Apache 2.0.

  9. Tong, Z., Song, Y., Wang, J., Wang, L. "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training." NeurIPS 2022. arXiv:2203.12602. Accessed 2026-05-27. 90–95% masking ratio; ViT-Base Kinetics-400 81.5% top-1 with no external labels.

  10. Wang, L., Huang, B., Zhao, Z., et al. "VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking." CVPR 2023. arXiv:2303.16727. Accessed 2026-05-27. Billion-parameter video ViT; Kinetics-400 90.0%, Kinetics-600 89.9%, Something-Something v1 68.7%, v2 77.0%.

  11. Caron, M., Touvron, H., Misra, I., et al. "Emerging Properties in Self-Supervised Vision Transformers (DINO)." ICCV 2021. arXiv:2104.14294. Accessed 2026-05-27. Self-distillation without labels; the foundation of the DINO family.

  12. Oquab, M., Darcet, T., Moutakanni, T., et al. "DINOv2: Learning Robust Visual Features without Supervision." TMLR 2024. arXiv:2304.07193. Accessed 2026-05-27. 142M curated images; ImageNet linear probe 87.3%.

  13. Siméoni, O., Vo, H. V., Seitzer, M., et al. "DINOv3." Meta AI, August 2025. arXiv:2508.10104 and ai.meta.com/blog/dinov3-self-supervised-vision-model. Accessed 2026-05-27. 1.7B images, 1.1B parameter ViT-G, Gram Anchoring; ImageNet linear probe 88.4%, PASCAL VOC 86.6 mIoU.

  14. Fang, Y., Wang, W., Xie, B., et al. "EVA-02: A Visual Representation for Neon Genesis." Image and Vision Computing 149 (2024). arXiv:2303.11331. Accessed 2026-05-27. CLIP-distilled MIM-pre-trained ViT; the backbone inside LLaVA-Next and Qwen-VL.

  15. Tschannen, M., Gritsenko, A., Wang, X., et al. "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding." 2025. Hugging Face SigLIP 2 model card. Accessed 2026-05-27. Dual-tower SigLIP-family encoders at B/L/So400m/G scales.

  16. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R. "Masked Autoencoders Are Scalable Vision Learners (MAE)." CVPR 2022. arXiv:2111.06377. Accessed 2026-05-27. The image MAE recipe that VideoMAE and Hiera extend.