Published 2026-05-28 · 18 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

The interesting thing about 2026 is not that AI got bigger. It is that small models got useful. Two years ago, depth from a single camera was a research problem; today a 25-megaparameter model trained on synthetic data outputs depth maps that, for most product purposes, are indistinguishable from what a stereo rig produces. Two years ago, a vision-language model that could describe a video clip required a multi-GPU server; today a 500-megaparameter model running inside an iPhone app does the same job. If you ship a product that touches video — surveillance, telemedicine, e-learning, OTT, video conferencing, AR — the question is no longer whether small on-device CV is feasible. It is which two or three of these primitives you stack together to answer the product question. This article is for the product manager, founder, or video engineer who needs to scope a feature that involves understanding what is inside a frame — its geometry, its content, the relationship between its objects — without paying a per-call API price or asking the user's phone to upload every frame to a server.

The Two Halves Of On-Device Vision

A useful way to think about modern small-model computer vision is that it splits into two complementary jobs. The first is measurement — turning pixels into numbers about the physical scene: where surfaces are, how far away they are, how they move. The second is understanding — turning the same pixels into natural language about the scene's content and meaning. Depth Anything and SmolVLM each own one half of that split.

Depth Anything is a measurement primitive. It takes one RGB frame and outputs a depth value for every pixel — a floating-point number that says how far that point in the world is from the camera. The output is dense (one number per pixel, not one number per detected object) and it is geometric (the numbers respect the actual geometry of the scene, not just relative ordering). Downstream tasks consume those numbers directly: a planner uses depth to avoid obstacles; an AR app uses depth to occlude a virtual object behind a real one; a surveillance system uses depth to convert a 2D bounding box into a 3D position.

SmolVLM is an understanding primitive. It takes one or more frames plus a natural-language prompt, and it outputs text. The text might be a description of what is in the frame, an answer to a question about the frame, a structured extraction of a fact, or a classification. The model is small enough to run locally, but the interface is the same one you would use against GPT-4o — give it pixels and a prompt, get text back.

You almost never pick one of these and ignore the other. The interesting on-device products in 2026 use both: depth to handle the geometry layer, a small VLM to handle the semantics layer, and a thin orchestration layer on top that turns the combined signal into product behaviour. The rest of this article goes deep on each model and then shows how to stack them.

Diagram of the two halves of on-device computer vision. Left half labelled 'Measurement': arrow from camera icon to Depth Anything V2, output is a per-pixel depth map with colour-coded distances. Right half labelled 'Understanding': arrow from same camera icon and a text prompt to SmolVLM, output is natural-language text describing the scene. Bottom arrow shows both outputs combining into a 'product behaviour' box. Figure 1. The two halves of small-model on-device vision. Depth Anything outputs geometry; SmolVLM outputs language. Most production pipelines combine them.

Depth Anything V2 — What It Is And How It Works

Depth Anything V2 was released by the original Depth Anything team in 2024 and accepted as a NeurIPS 2024 paper. It is a foundation model for monocular depth estimation — a single neural network that, given any RGB image from any camera in any scene, outputs a dense depth map. The "anything" in the name is a claim about the breadth of inputs it handles: indoor, outdoor, daylight, low light, real photos, paintings, generated images. The training methodology is what makes that claim hold.

The architecture is a Vision Transformer encoder plus a Dense Prediction Transformer decoder. The encoder is reused from DINOv2 — Meta's self-supervised foundation model — which provides general-purpose image features. The decoder, called DPT, aggregates features from multiple transformer layers and upsamples them into a full-resolution depth map. Input images are resized to a shorter edge of 384 pixels and centre-cropped to 384 by 384 before going into the model.

The four sizes shipped by the team are ViT-Small (24.8 million parameters), ViT-Base (97.5 million), ViT-Large (335.3 million), and ViT-Giant (1.3 billion). On consumer GPUs, ViT-Small and ViT-Base run at over 50 frames per second; ViT-Large is the accuracy-efficiency sweet spot for offline use; ViT-Giant is the highest-accuracy teacher used to train the smaller students. Inference is about ten times faster than the previous diffusion-based depth model (Marigold) at comparable accuracy.

The training trick — what V2 added on top of V1 — is a three-step teacher-student pipeline. Step one trains the largest model (ViT-Giant) only on synthetic images with perfect ground-truth depth, because real-world depth sensors produce noisy labels and synthetic images do not. Step two uses that ViT-Giant teacher to generate pseudo-labels on 62 million unlabelled real images. Step three trains the smaller student models on those pseudo-labels. The result: V2 produces much finer-grained and more robust depth than V1, especially on transparent and reflective surfaces where the old depth-sensor labels were systematically wrong.

There is a critical licensing detail. The ViT-Small checkpoint is released under Apache 2.0, which permits commercial use. The Base, Large, and Giant checkpoints are released under CC-BY-NC-4.0, which prohibits commercial use. If you are building a commercial product, you can only ship the Small variant out of the box. The Base/Large/Giant checkpoints are usable for research, evaluation, and pre-training your own model on a commercially-licensed dataset, but you cannot embed them in a paid product. This is the single most common compliance trap teams fall into when they prototype with Large (because it is more accurate), demo it to a stakeholder, and then discover they have to rebuild with Small before shipping.

Depth Anything V2 by default produces relative depth — depth values that respect ordering and relative scale but are not in metric units. For products that need actual distances in metres (a drone deciding whether an obstacle is two metres away or twenty), the team ships metric depth variants fine-tuned on two datasets: Hypersim for indoor scenes (max depth 20 metres) and Virtual KITTI 2 for outdoor scenes (max depth 80 metres). Pick the variant that matches your deployment environment — using an outdoor model indoors gives you depth values that float around in absurd ranges.

For video, vanilla Depth Anything V2 has a known limitation: running it frame-by-frame on a video produces temporal flickering — adjacent frames get depth values that jitter even when the underlying scene is static. Two follow-up models published in 2025 address this. Video Depth Anything (CVPR 2025 Highlight) replaces the per-frame DPT head with a spatial-temporal head that constrains the depth gradient between frames and handles arbitrarily long videos consistently. FlashDepth (2025) keeps the per-frame backbone but adds a recurrent neural network that aligns depth features across frames for real-time streaming at 2K resolution. Both are extensions of V2; choose Video Depth Anything if you need to process a stored clip with consistent depth, FlashDepth if you need a live stream with low latency.

SmolVLM2 — What It Is And How It Works

SmolVLM is Hugging Face's family of small open multimodal models — vision-language models small enough to actually run on a device the user is holding. The family started in late 2024 with the original SmolVLM (2.2B parameters); in early 2025 the team released the smaller 256M and 500M variants under the SmolVLM branding; later in 2025 they released SmolVLM2, which extended all three sizes to handle video, not just static images. The 2.2B model is the strongest; the 500M is the on-device sweet spot; the 256M is for browser and severely constrained edge.

The architecture has three parts: a SigLIP vision encoder, a pixel-shuffle compression step, and a Llama-3 language decoder.

The vision encoder is SigLIP-B/16 (93 million parameters, 16-by-16 patches) for the 256M and 500M models and SigLIP-SO400M (428 million parameters, 14-by-14 patches) for the 2.2B model. The encoder turns each image into a grid of visual feature vectors — small tiles, each summarising a region of the image. The notable design choice is that the vision tower is small relative to the language tower. Bigger vision encoders quickly stop adding accuracy once the language model is also small.

The pixel-shuffle step is the trick that makes the whole thing efficient. A naive design feeds every visual feature vector into the language model as a token, and the token count blows up the attention cost — a 256-tile image becomes 256 tokens, plus the prompt tokens, plus the response tokens. SmolVLM applies a 2-by-2 pixel-shuffle, which rearranges each 2-by-2 block of feature vectors into a single vector with four times more channels. The token count drops by a factor of four, and the attention cost drops by a factor of sixteen, because attention is quadratic in token count. This is what makes a 500M-parameter model with a vision input runnable on a phone.

The language decoder is the SmolLM-2 family of small Llama-3-architecture language models. It is the same architecture you would find in a small open LLM — transformer, RMSNorm, SwiGLU, rotary position embeddings — just at small parameter counts (135M for SmolVLM2-256M, 360M for the 500M, 1.7B for the 2.2B). Visual tokens and text tokens go in interleaved; the model outputs text, autoregressively.

For video, SmolVLM2 samples fifty frames per clip (configurable) and treats them as an ordered sequence of images, prepending a textual marker like "Here are 50 frames sampled from a video" so the model can disambiguate "the same object in different frames" from "different objects". The 50-frame budget keeps the token count tractable even after pixel-shuffle, and benchmarks on Video-MME (the standard video understanding benchmark in 2026) show the 2.2B model competitive with much larger open models, while the 500M and 256M variants are by a wide margin the smallest video-capable VLMs ever shipped.

Resource numbers, because they decide what hardware you need:

  • SmolVLM2-256M inference fits in under 1 GB of GPU RAM. On a 14-inch MacBook Pro M4 Max running in the browser via WebGPU, it produces about 80 decode tokens per second. This is the "runs in a browser tab" variant.
  • SmolVLM2-500M is the iPhone-app-friendly variant. A demo Hugging Face iPhone app uses the 500M to analyse video clips entirely on-device, with no network call.
  • SmolVLM2-2.2B needs around 5.2 GB of GPU RAM for video inference. This is the strongest variant; ship it on a laptop or on a Jetson Orin, not on a phone.

All SmolVLM2 checkpoints, datasets, and training recipes are released under Apache 2.0. No carve-outs, no NC restrictions. You can embed them in a commercial product without paying anyone.

Diagram of the SmolVLM2 architecture. Left: image input goes into 'SigLIP encoder', producing a 16x16 grid of feature vectors. Middle: a '2x2 pixel-shuffle' arrow shrinks the grid to 8x8 with 4x channels. Top: text prompt goes into the same merged stream. Right: 'Llama-3 small language decoder' produces text output token-by-token. Figure 2. SmolVLM2 architecture. The SigLIP encoder turns an image into feature tiles; pixel-shuffle compresses the tile count by 4×; the small Llama-3 decoder produces text autoregressively. Pixel-shuffle is the trick that makes a 500M model phone-sized.

Where Each Model Belongs — A Decision Frame

The two models solve different problems, and the most common engineering mistake is to reach for the wrong one. A useful decision rule:

Use Depth Anything when you need a number per pixel. Examples: an AR app deciding whether to draw a virtual object in front of or behind a real one; a surveillance pipeline lifting a 2D detection into a 3D position; a robot deciding whether to move; a video conferencing app generating background blur that respects the actual physical depth of the scene; a video editing tool extracting a foreground subject for compositing. In all of these cases, the output is geometric and dense, and a VLM cannot produce it — a VLM speaks text, not depth maps.

Use SmolVLM when you need a sentence about the frame. Examples: a moderation pipeline asking "does this clip contain a person holding a weapon?"; an e-learning analytics tool asking "what is the student writing on the whiteboard right now?"; a surveillance dashboard asking "describe what just happened in this five-second clip"; an OTT search tool asking "what kind of scene is this — sports, drama, news?". In all of these cases, the output is text, the question changes from product to product, and you do not want to train a separate classifier for every question.

Use both when the product question is "what is this thing and where is it in the scene?" Stack a depth model and a VLM in parallel on the same frame; merge the outputs in your orchestration layer. The VLM tells you what is interesting; the depth model tells you how far away it is and how big.

A common pitfall: teams try to ask the VLM a geometric question — "How far away is the person in this frame?" — and then act surprised when the answer is wrong. SmolVLM and every other small VLM is unreliable for absolute geometric questions. It might be right by accident in one frame and wrong by an order of magnitude in the next. The model was not trained on metric depth labels; it was trained on captions and conversations. If you need numbers, use a model that outputs numbers. If you need words, use a model that outputs words.

Question type Right primitive Output Why
How far is every pixel from the camera? Depth Anything V2 (metric variant) Dense per-pixel depth map in metres Geometric task; specialised model
Is there a person in this frame? A small detector (YOLO-N) or SmolVLM Bounding box or text Detector is faster; VLM is more flexible
What is the person doing? SmolVLM Natural-language description Open-ended; cannot pre-define a class list
Is the camera moving? Optical flow (RAFT, Lucas-Kanade) Per-pixel motion field Geometric task
Should I alert the operator? SmolVLM with a structured-output prompt "alert: true/false, reason: ..." Composes the other signals into a decision
Where exactly is the moving object in 3D? Depth + detector + simple geometry (x, y, z) per object Combine geometry + identity

The last row is the canonical small-on-device CV pipeline. A detector finds objects; depth gives them 3D position; a VLM (optionally) adds semantic context. Each primitive does the job it is good at.

A Concrete Stack — Depth + SmolVLM Together

Here is a worked example of a pipeline a real product might ship. Imagine an indoor security camera. The product question is: "alert me when someone walks into the kitchen at night carrying anything that looks like a tool." The naive way to answer this with a single big API call is to pipe every frame to GPT-4o and pay per call. The on-device way uses three primitives, in series, each tuned to its job.

Stage one — motion gate. A cheap classical detector (background subtraction, or a tiny YOLO-N) runs at 30 FPS and decides whether any pixel in the frame has changed enough to warrant looking at this frame. Cost: a few milliseconds per frame on the camera's NPU. Result: 99 percent of the input frames are filtered out before any neural model runs.

Stage two — depth + detection on the kept frames. On each kept frame, run a small person detector and Depth Anything V2 Small in parallel. The detector outputs a bounding box; the depth model outputs a depth map. Combine them by reading the depth value at the centre of the bounding box — that gives you the person's distance from the camera. If the distance places the person inside the geometric volume you previously labelled "kitchen", proceed to stage three. Otherwise, drop the frame. Cost on a Jetson Orin Nano: about 50 ms per frame combined.

Stage three — SmolVLM2 semantic check. On the rare frames that survive stage two (someone is actually in the kitchen), sample a five-frame clip and feed it to SmolVLM2-500M with the prompt: "Does the person in this video clip appear to be holding any kind of tool or implement? Answer with 'yes' or 'no' and a one-sentence reason." The model runs locally on the same Jetson and returns a structured answer in under a second. If it says yes, fire the alert.

The whole pipeline runs at thirty frames per second on a sub-$500 edge device, costs nothing per inference because no network call leaves the device, and respects the household's privacy because the video never touches a cloud server. The same pattern adapts to telemedicine ("is the patient still in frame and conscious?"), to e-learning ("is the student looking at the screen?"), to OTT moderation ("is this clip safe to play to a child user?"). The architecture is the same; only the prompts and the geometric thresholds change.

Diagram of a three-stage on-device pipeline. Stage 1: 'Motion gate' box accepts only 1% of frames. Stage 2: 'Depth Anything + person detector' run in parallel on kept frames, combining into a 3D position. Stage 3: 'SmolVLM2 prompt' runs on a five-frame clip when the 3D position is inside a target zone, outputting structured yes/no. Final arrow: 'Alert' or 'Drop'. Figure 3. A three-stage on-device pipeline. A cheap motion gate filters 99% of frames; depth + detection geometrically gates the rest; SmolVLM2 only fires on the small subset that matter. Every step runs locally.

Production Failure Modes And Engineering Fixes

Three failure modes turn up reliably in real deployments. Plan for them on day one.

Failure 1 — flicker in the depth track. Running Depth Anything V2 frame-by-frame on a video produces depth values that wobble even on a static scene. For products where the downstream consumer is just looking at a per-frame number (a single distance check, a coarse occlusion test), the wobble is tolerable. For products that draw the depth map directly to the user (any AR application; any visualisation), the wobble looks terrible. Fix: switch to Video Depth Anything (CVPR 2025) for stored video, or FlashDepth (2025) for live streams. Both are V2 extensions with a temporal-consistency module. Expect a roughly two- to three-times higher per-frame compute cost compared to vanilla V2; in return, the output is stable.

Failure 2 — wrong metric variant. The metric depth variants are fine-tuned on either Hypersim (indoor, 20 m) or VKITTI (outdoor, 80 m). Using the outdoor variant on indoor footage gives you depth values floating in absurd ranges; using the indoor variant on outdoor footage clamps everything past 20 m to the same value. Fix: pick the variant that matches your scene at config time, not at training time. A surveillance camera that sometimes points indoors and sometimes outdoors should ship both variants and route at runtime based on a one-time setup classification.

Failure 3 — SmolVLM hallucination on out-of-distribution inputs. SmolVLM2 was trained on a lot of data, but its 500M and 256M variants are still small models. When you ask them a question about content they did not see in training, the answers can be confidently wrong. Fix: constrain the output with structured prompts and a guard. Ask "yes or no" questions, not open-ended ones. Pair every SmolVLM call with a second pass that re-asks the same question on a different sampled frame and only acts when the two agree. For high-stakes decisions, route the survivors to a bigger model on the server — the on-device VLM is the cheap filter, not the final adjudicator.

Hardware And Deployment Reality

Where does each model actually run in production? A quick reference.

Hardware Depth Anything V2 Small SmolVLM2-256M SmolVLM2-500M SmolVLM2-2.2B
Modern phone (iPhone 15 Pro / Pixel 9) 30+ FPS via Core ML or NNAPI Yes, demo apps exist Yes, official HF demo Marginal — possible but warm
Browser via WebGPU 10–15 FPS on a MacBook 80 decode tokens/sec on M4 Max Yes, slow No
Jetson Orin Nano 50+ FPS via TensorRT Yes, comfortably Yes, comfortably Yes
Jetson Orin AGX 100+ FPS via TensorRT Trivial Trivial Yes, well within budget
Consumer GPU (RTX 4070) 100+ FPS Trivial Trivial Trivial

The conversion pattern is well-established. Export the PyTorch checkpoint to ONNX, then either convert to TensorRT on NVIDIA, Core ML on Apple, or NNAPI / Qualcomm AI Hub on Android. Quantise to FP16 or INT8 to halve or quarter the memory and roughly double the throughput on supported hardware. For Depth Anything V2 specifically, the small variant exports cleanly to ONNX; the larger variants need a couple of operator workarounds at export time but the community has shared working scripts.

A practical 2026 rule for picking compute: the device runs the pipeline if it can hold the largest weight file you need in RAM at FP16 and still have memory for an OS, a video buffer, and other apps. For Depth Anything V2 Small that is about 50 MB of weights; for SmolVLM2-500M that is about 1 GB; the practical floor for combined use is a device with 4 GB of RAM and a hardware AI accelerator. Everything from an iPhone 13 onward and from a mid-range Android forward clears that bar.

Where Fora Soft Fits In

We have shipped on-device CV inside surveillance, e-learning, telemedicine, OTT, and conferencing products since the original MobileNet days, and the move to Depth Anything + small VLMs has changed which questions are scopeable inside which budgets. Three patterns we have used in 2026 client work: depth-aware background blur and replacement that respects real scene geometry instead of the flat 2D segmentation that older video conferencing apps still ship; on-device privacy redaction in e-learning replays, where a small VLM tags frames containing student faces or hand-written content and the redaction runs locally before anything reaches our infrastructure; and edge surveillance pipelines for retail and industrial sites, where the camera does the heavy lifting and the cloud only sees alerts. If you are scoping a feature that lives in one of these patterns, the build-or-buy answer in 2026 is usually "build, because the open primitives are now strong enough."

What To Read Next

Talk to a video engineer · See our case studies · Download the on-device CV deployment checklist

References

  1. Yang, L.; Kang, B.; Huang, Z.; et al. Depth Anything V2. NeurIPS 2024. arXiv:2406.09414. https://arxiv.org/abs/2406.09414
  2. DepthAnything/Depth-Anything-V2 GitHub repository. https://github.com/DepthAnything/Depth-Anything-V2
  3. Depth Anything V2 project page (model cards, sizes, FPS benchmarks). https://depth-anything-v2.github.io/
  4. Depth-Anything-V2-Small model card (Apache 2.0 license confirmation). https://huggingface.co/depth-anything/Depth-Anything-V2-Small
  5. Depth-Anything-V2-Metric-Hypersim model card (indoor metric variant, 20 m max). https://huggingface.co/depth-anything/Depth-Anything-V2-Metric-Hypersim-Base
  6. Depth-Anything-V2-Metric-VKITTI model card (outdoor metric variant, 80 m max). https://huggingface.co/depth-anything/Depth-Anything-V2-Metric-VKITTI-Large
  7. Chen, S.; Yang, L.; Hu, X.; et al. Video Depth Anything: Consistent Depth Estimation for Super-Long Videos. CVPR 2025 Highlight. arXiv:2501.12375. https://arxiv.org/abs/2501.12375
  8. FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution. arXiv:2504.07093. https://arxiv.org/html/2504.07093v1
  9. Marafioti, A.; Zohar, O.; Farré, M.; et al. SmolVLM: Redefining small and efficient multimodal models. arXiv:2504.05299, 2025. https://arxiv.org/pdf/2504.05299
  10. Hugging Face blog. SmolVLM2: Bringing Video Understanding to Every Device. https://huggingface.co/blog/smolvlm2
  11. Hugging Face blog. SmolVLM Grows Smaller — Introducing the 256M & 500M Models. https://huggingface.co/blog/smolervlm
  12. HuggingFaceTB/SmolVLM2-2.2B-Instruct model card. https://huggingface.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct
  13. huggingface/smollm GitHub repository (SmolLM + SmolVLM family). https://github.com/huggingface/smollm
  14. SigLIP: Sigmoid Loss for Language Image Pre-Training (vision encoder used in SmolVLM). https://arxiv.org/abs/2303.15343
  15. DINOv2: Learning Robust Visual Features without Supervision (encoder used in Depth Anything). https://arxiv.org/abs/2304.07193
  16. Oquab, M.; Darcet, T.; Moutakanni, T.; et al. DINOv2 official repository. https://github.com/facebookresearch/dinov2