Published 2026-05-31 · 21 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

Every multimodal feature you are likely to ship in a video product — "find the clip where someone holds up a red sign", "auto-tag this surveillance footage", "let users search our archive in plain English", "describe what is happening on screen" — rests on a vision-language model underneath. If you do not understand the one idea at its core, the rest of Phase 4 reads like magic, and you cannot reason about why a feature is cheap or expensive, fast or slow, accurate or brittle. This is the foundational lesson for the entire multimodal block. A product manager who finishes it can sit in an architecture review and follow why the team keeps saying "embeddings", "shared space", and "zero-shot" — and can tell the difference between a problem CLIP solves for nearly free and a problem that needs a heavyweight model from the closed-frontier lesson.

The One Idea — Pictures And Words In The Same Space

Start with a mental picture. Imagine a vast room, and every point in that room is a position described by a long list of coordinates — not the three you would use for a physical room, but several hundred. Now imagine that we can take any image and place it at one specific point in that room, and take any sentence and place it at a point in the same room. The whole game of a vision-language model is this: put the image and the words that describe it at nearby points, and put the image and unrelated words at far-apart points.

The list of coordinates for one item is called an embedding — in plain language, a fixed-length summary of something, written as a row of numbers. A picture of a golden retriever on a beach might become a row of 768 numbers. The sentence "a dog on the sand" becomes its own row of 768 numbers. If the model is good, those two rows are close. The sentence "a spreadsheet of quarterly revenue" becomes a row that sits far away. Closeness is measured by a simple operation called a dot product — you multiply the two rows position by position and add the results; a big number means "pointing the same direction", which we read as "similar".

That shared room has a name engineers use constantly: the shared embedding space, or joint embedding space. "Joint" because both pictures and words live in it together. Once you have it, a surprising number of features become almost free. Want to find every image in your archive that matches "a person wearing a hard hat"? Turn the sentence into its embedding, then find the image embeddings closest to it. Want to label a frame as one of fifty categories without training a classifier? Turn each category name into an embedding and pick the closest. None of these required you to train anything new. That is the payoff, and the rest of this article is about how the room gets built.

How CLIP Is Built — Two Encoders, One Goal

CLIP is two neural networks trained side by side. The first is the image encoder — a network that reads an image and outputs its embedding. In the most-used CLIP models this is a Vision Transformer, the architecture that chops an image into a grid of small patches and processes them like words in a sentence; we covered it end to end in the Vision Transformer primer. The second is the text encoder — a Transformer, the same family of network that powers large language models, which reads a string of words and outputs its embedding. Crucially, both encoders are tuned so their outputs are the same length. In OpenAI's original CLIP, that length ranged from 512 to 768 numbers depending on the model size. They have to match, because you cannot compute a dot product between a row of 512 numbers and a row of 768.

The two encoders start out knowing nothing. The image encoder produces random embeddings; the text encoder produces random embeddings; nothing is aligned. Training is the process that slowly twists both networks until matching image-text pairs line up. And the only ingredient training needs is data — specifically, a very large pile of images that each came with a caption.

OpenAI assembled exactly that. The dataset, which they called WebImageText (WIT), contained 400 million image-caption pairs scraped from the public internet. They built it by listing 500,000 text queries — drawn from common words in English Wikipedia, high-information word pairs, Wikipedia article titles above a search-volume threshold, and a lexical database called WordNet — and collecting up to 20,000 image-text pairs per query. No human sat and labelled these. The "label" for each image was simply the caption that already accompanied it on the web. That is the second quiet revolution in CLIP: it learns from natural language supervision, which is abundant and free, instead of from hand-curated category labels, which are scarce and expensive.

Diagram of the CLIP dual-encoder architecture. On the left, an image flows into an image encoder (Vision Transformer) and becomes a 768-number image embedding. On the right, the caption 'a dog on the sand' flows into a text encoder (Transformer) and becomes a 768-number text embedding. Both embeddings drop into a shared embedding space shown as a coordinate field, where the matching pair sits close together and a mismatched caption sits far away. Figure 1. CLIP is two encoders pointed at one shared space. The image encoder and text encoder are trained until matching pairs land close and mismatched pairs land far apart.

The Contrastive Trick — Learning From What Doesn't Match

The word in the middle of "Contrastive Language-Image Pre-training" is the one that does the work. Contrastive means the model learns by contrast — by being shown what matches alongside what does not, and being pushed to tell them apart.

Here is the mechanism, in the smallest steps. During training, the model is handed a batch — a group of image-caption pairs at once, say 256 of them, or in the largest runs, 32,768. For that batch of, let us use a round number, N pairs, the model computes the embedding of every image and every caption. Then it builds a grid: N images down the side, N captions across the top, and in each cell the dot-product similarity between that image and that caption. The grid has N × N cells.

In that grid, exactly N cells are correct matches — the ones on the diagonal, where image number 3 meets caption number 3, because they came from the same web page. Every other cell is a mismatch: image 3 against caption 7, image 12 against caption 2, and so on. The training objective is brutally simple to state: make the diagonal cells score high and every off-diagonal cell score low. Pull the right pairs together; push every wrong pair apart. Run that on 400 million pairs, again and again, and the two encoders are forced to discover what actually connects pictures to words.

This is why the mismatches matter as much as the matches. A model that only ever saw correct pairs could cheat by making everything look similar. The off-diagonal "negatives" are what force it to be discriminating. In one batch of 256, every correct pair is contrasted against 255 wrong ones; in a batch of 32,768, against 32,767 wrong ones. More negatives per step generally means a sharper model, which is why the size of the training batch turns out to be a real lever, and why later work spent so much effort making bigger batches affordable.

The loss function that scores all this has a textbook name — it is a symmetric cross-entropy loss over the similarity grid, sometimes called the InfoNCE or multi-class N-pair loss. "Symmetric" because it is applied twice: once reading across each row (for each image, which caption matches?) and once down each column (for each caption, which image matches?). There is also one learned knob worth naming, the temperature — a single number the model adjusts during training that controls how sharply it separates close from far. Low temperature makes the model decisive; high temperature makes it cautious. In CLIP the temperature is not hand-set; the model learns it.

Diagram of the CLIP contrastive training grid. A 5-by-5 matrix with image thumbnails down the left and captions across the top. The five diagonal cells are highlighted green and labelled 'matching pairs — pull together'. The off-diagonal cells are pale orange and labelled 'mismatched pairs — push apart'. Below the grid, a one-line note explains that the model maximizes diagonal similarity and minimizes everything else. Figure 2. The contrastive grid. For a batch of N pairs the model scores all N×N image-caption combinations and is trained to light up only the diagonal.

Zero-Shot Classification — The Feature That Made CLIP Famous

The single result that made CLIP a landmark is zero-shot classification — labelling an image with a category the model was never specifically trained to recognise, using nothing but the category's name.

Walk through how it works, because it is more clever than it sounds and it is the template for half of what you will build. Suppose you want to sort frames into "construction site", "office", and "warehouse". A traditional image classifier would need thousands of labelled examples of each and a training run. CLIP needs none. You write three short sentences — "a photo of a construction site", "a photo of an office", "a photo of a warehouse" — and run each through the text encoder to get three text embeddings. Then you run your frame through the image encoder to get one image embedding. You compute the dot product of the image embedding against each of the three text embeddings, and you pick the highest. That is the predicted label. The phrase "a photo of a {class}" is not decoration; OpenAI found that wrapping bare class names in a short template like this measurably improves accuracy, a tactic they called prompt engineering for vision before the term was common.

The headline number from the 2021 paper: the best CLIP model, ViT-L/14 at 336-pixel resolution, reached roughly 76% top-1 accuracy on ImageNet — a 1,000-category benchmark — without training on a single ImageNet label. That matched a strong ResNet-50 that had been trained on ImageNet's 1.3 million labelled images directly. A model that never saw the benchmark's training set tied a model built for it. More striking still, CLIP was far more robust when the test images were unusual — sketches, adversarial renderings, odd lighting — precisely because it learned from the messy open web rather than one clean dataset.

Let us make the arithmetic concrete, because the cost of zero-shot classification is the cost of a few dot products, and that is worth feeling in your hands. Suppose you classify a frame against 50 candidate labels, with a 768-number embedding each. The image goes through the encoder once. Then you compute 50 dot products, each of which is 768 multiply-adds:

50 labels × 768 numbers = 38,400 multiply-add operations

A modern CPU does billions of those per second. The label comparison is, for practical purposes, free; the only real cost is running the image encoder once per frame. Better still, you compute the 50 text embeddings once and reuse them for every frame forever. This is the economic shape that makes CLIP so deployable: pay once to encode each image, pay almost nothing to compare it against as many text labels as you like.

The Numbers Behind The Models — What You Actually Download

CLIP is not one model but a family, and the names encode the design. A label like ViT-L/14 reads as: a Vision Transformer, Large size, with 14×14-pixel patches. The size letters run B, L, H, G — base, large, huge, giant — in increasing order. Smaller is faster and cheaper; larger is more accurate and slower. The table below lists the OpenAI ViT-based releases so you can see the trade in concrete terms.

Model Input resolution Vision params Embedding length Release
ViT-B/32 224×224 88M 512 Jan 2021
ViT-B/16 224×224 86M 512 Jul 2021
ViT-L/14 224×224 304M 768 Jan 2022
ViT-L/14@336px 336×336 304M 768 Apr 2022

Two columns deserve a second look. Vision params is roughly how heavy the image encoder is — 88 million for the small one, 304 million for the large — and it tracks directly to how much GPU memory and time each image costs. Embedding length is the size of the row of numbers each image and caption becomes — 512 or 768 here. It matters for your storage bill: if you embed a million archive frames at 768 numbers each, stored as 4-byte floats, that is:

1,000,000 frames × 768 numbers × 4 bytes = 3,072,000,000 bytes ≈ 3.07 GB

Three gigabytes to make a million-frame archive searchable by plain-English text. That is the kind of number that turns a "someday" feature into a quarter's roadmap item.

The training cost, for contrast, is the part you will never pay yourself and should never try to. OpenAI's largest ViT took 12 days on 256 high-end V100 GPUs; the largest ResNet variant took 18 days on 592 GPUs. You download the finished weights — OpenAI released them, and the open community released many more — and you are standing on the far side of that compute for free.

CLIP, OpenCLIP, And The Open Ecosystem

OpenAI shipped the original weights but kept the 400-million-pair dataset private. That gap is exactly what the open-source community closed. OpenCLIP, maintained by the LAION group and ML Foundations, reproduced and extended CLIP on published datasets — LAION-400M, LAION-2B, and DataComp-1B — so anyone can audit the data and retrain. An OpenCLIP ViT-L/14 trained on LAION-2B saw 32 billion examples across 160 training epochs on 384 newer GPUs. The practical upshot for you: when an engineer says "we'll use CLIP", they almost always mean a CLIP-style model from a menu of dozens, most of them open-weight and free to self-host. The lesson on model artifact formats and the open-vs-closed procurement decision covers how to choose among them and what the licences let you ship.

Other groups pushed the idea to extremes. Google's ALIGN trained on over a billion noisier image-alt-text pairs and showed that scale beats careful cleaning. EVA-CLIP scaled the recipe all the way to an 18-billion-parameter model that reached about 82% zero-shot ImageNet accuracy, several points above the original. The trend line is consistent and worth internalising as a planning heuristic: more data and more parameters keep buying accuracy, with no sign yet of a wall — which is why the "best CLIP" is a moving target you re-check each quarter rather than a fixed answer.

SigLIP — Why The Loss Function Changed In 2025

If CLIP is the idea, SigLIP is the refinement that quietly took over production in 2025, so it is worth understanding the one thing it changes.

Recall the contrastive grid: for a batch of N pairs, CLIP scores all N×N combinations and applies a softmax — a normalisation that, for each image, spreads a fixed budget of "probability" across all N captions in the batch. That normalisation couples every pair to every other pair in the batch. It works, but it is memory-hungry, because computing it correctly needs the whole N×N grid in play at once. SigLIP — short for Sigmoid Loss for Language-Image Pre-training, introduced by a Google team at the 2023 ICCV conference — swaps the softmax for a sigmoid. Instead of asking "among all captions in the batch, which matches this image?", it asks a simpler yes/no question of every single image-caption cell independently: "do these two match — yes or no?" Each cell becomes its own little true/false problem.

That sounds like a small change. Its consequence is large. Because the sigmoid treats each pair on its own, you no longer pay the memory tax of the full softmax grid, so you can fit much larger batches on the same hardware. The original SigLIP paper showed that on four TPU chips you could train a Base model at a batch size of 4,096 where the equivalent CLIP softmax setup maxed out at 2,048. Bigger batches mean more negatives per step, and — as we saw earlier — more negatives generally mean a sharper model. SigLIP matched or beat CLIP while being cheaper to train.

SigLIP 2, released by Google DeepMind in February 2025, layered on more: multilingual training, a decoder objective, self-distillation, and better "dense" features that capture where things are in an image rather than just what is in it. The accuracy moved with it — a SigLIP 2 Base model at 256-pixel resolution reaches about 79% zero-shot ImageNet top-1, and the 512-pixel variant about 81%, against the original SigLIP's roughly 77%. The reason this matters to a product team is downstream: by 2026, SigLIP-family encoders are the default "eyes" inside leading open multimodal models such as Qwen3-VL and Gemma 3, displacing the original CLIP in new builds. When your engineers pick an encoder this year, SigLIP 2 is usually the starting point, with original CLIP kept around mainly for compatibility with older pipelines.

CLIP (2021) SigLIP / SigLIP 2 (2023 / 2025)
Loss function Softmax contrastive (whole batch coupled) Sigmoid (each pair judged independently)
Memory at large batch Heavy — needs full N×N grid Light — pairs are independent
Max practical batch Smaller on the same hardware Larger on the same hardware
Zero-shot ImageNet (best base) ~76% (ViT-L/14@336) ~79–81% (SigLIP 2 Base)
2026 status Compatibility / legacy pipelines Default encoder in new open VLMs

From One Frame To A Video — Where The Idea Goes Next

Everything so far has been about a single still image. Video is just images in sequence — typically 24 to 60 of them per second — so the natural question is how CLIP stretches across time. The honest answer for 2026: the dominant approach is still surprisingly simple, and that simplicity is good news for your budget.

The workhorse method is per-frame embedding plus pooling. You sample frames from the video — not all of them; one every second or two is usually plenty — run each sampled frame through the CLIP image encoder, and you now have a sequence of image embeddings, one per frame. To get a single embedding for a whole clip, you pool them, which most often means simply averaging the rows of numbers position by position. Average ten frame embeddings and you get one clip embedding that you can compare against text exactly as before. This "mean-pooling" baseline is so effective that, for zero-shot video-text retrieval, it has matched or beaten many purpose-built video models. Sometimes the cheapest method is also near the best.

Make the cost concrete. A 10-minute clip sampled at one frame per second is 600 frames. At, say, 5 milliseconds of GPU time per frame on the image encoder:

600 frames × 5 ms/frame = 3,000 ms = 3 seconds of GPU time

Three seconds of compute to make a ten-minute video searchable by plain English. Index a thousand such clips overnight and you have a video archive your users can query in their own words the next morning.

More sophisticated variants exist when mean-pooling is not enough — when order matters, such as telling "person sits down" from "person stands up". Models like X-CLIP, ViFi-CLIP, and VideoCLIP add lightweight temporal components that let the frames talk to each other before pooling, capturing motion the averaging step throws away. We go deep on those architecture decisions — frame sampling rates, token streaming, when temporal modelling earns its cost — in the video VLMs lesson. For now the mental model to keep is: CLIP gives you the per-frame "eyes"; video understanding is mostly a question of how cleverly you combine the frames.

Diagram showing how CLIP scales from one frame to a whole clip. On the left, a single frame enters the CLIP image encoder and becomes one frame embedding. In the middle, a strip of five sampled frames each becomes a frame embedding, and the five embeddings flow into a pooling box labelled 'average the rows'. On the right, the pooling box outputs one clip embedding, which is compared against a text embedding for 'a person waving' to produce a match score. Figure 3. Per-frame embedding plus pooling. Each sampled frame becomes an embedding; averaging the embeddings yields one clip embedding you can match against text.

What CLIP Powers — The Quiet Engine Behind Modern Multimodal AI

It is easy to file CLIP under "interesting 2021 research" and miss that it is load-bearing infrastructure for the systems you use today. Three examples make the point.

First, text-to-image generation. Stable Diffusion and its relatives use CLIP's text encoder to turn your prompt into the embedding that steers the image being generated. When you type "a watercolour fox in a forest", a CLIP text embedding is what the generator listens to. The image generator never sees your words directly; it sees CLIP's understanding of them.

Second, the eyes inside multimodal language models. The standard recipe for a model that can "look at" an image and chat about it — the kind powering visual assistants in video calls — is: take a frozen CLIP-style image encoder, add a small adapter (often just a two-layer projection network) that translates image embeddings into the language model's vocabulary of internal tokens, and bolt it onto a large language model. The open LLaVA model did exactly this with CLIP ViT-L/14 at 336 pixels and a two-layer projection. DeepMind's earlier Flamingo froze a CLIP-style image encoder and connected it to a frozen language model. The pattern is everywhere because it works and it is cheap: the expensive encoders are pre-trained once, and you train only the small bridge between them. The open-frontier VLM lesson traces how that recipe became LLaVA, Qwen-VL, and the rest.

Third, semantic search and moderation over media archives. Because images and text share a space, you can build "search your photo library in English", "find frames similar to this reference image", or "flag any frame that matches a list of prohibited concepts" with the same dot-product machinery. No per-category training, no labelled dataset — just embeddings and a nearest-neighbour lookup. This is the backbone of the multimodal retrieval systems we return to in the video RAG lesson.

A Common Pitfall — CLIP Reads Pictures, It Does Not Read Them Closely

The mistake we see teams make most often is over-trusting CLIP's precision. CLIP is excellent at the gist of an image — is this a kitchen or a beach, a dog or a car, day or night. It is unreliable at fine detail, counting, reading text, and spatial relationships. Ask it to tell "three people" from "four people", or "the cup is on the left of the laptop" from "the cup is on the right", and it will frequently miss. It can also be fooled: a famous demonstration showed CLIP misreading an apple as an iPod when the word "iPod" was written on a paper label stuck to the apple, because it had learned to read text in images as a strong cue. CLIP is also weak at negation — "a street with no cars" tends to match images that do contain cars, because the model latches onto "cars" and largely ignores "no".

The practical rule: reach for CLIP-style embeddings for coarse matching, retrieval, tagging, and filtering, where a strong gist is exactly what you want and the occasional miss is cheap to correct. When the feature genuinely needs fine reading — counting items on a shelf, reading a licence plate, reasoning about precise spatial layout — that is a job for a heavier multimodal model or a purpose-built detector, and trying to force CLIP into it produces a demo that impresses in the meeting and disappoints in production. Knowing which side of that line your feature sits on is most of the engineering judgement this whole phase is teaching.

Where Fora Soft Fits In

We build the features this primer underpins. In video surveillance products, CLIP-style embeddings let operators search recorded footage in plain language — "show me anyone in a yellow vest near the loading dock" — instead of scrubbing timelines by hand. In e-learning and OTT platforms, the same shared-space search makes a back catalogue navigable by topic and scene rather than by filename. In video conferencing and telemedicine tools, frozen CLIP-style encoders are the eyes inside the assistants that describe what is on screen or tag moments for later review. Across these verticals the engineering pattern is the one this article describes: encode once, compare cheaply, and pick the lightest model that clears the accuracy bar the feature actually needs.

What To Read Next

Talk To Us / See Our Work / Download

  • Talk to a video AI engineer — scope a semantic-search or auto-tagging feature for your video product. Book a 30-minute call.
  • See our case studies — surveillance, OTT, e-learning, and telemedicine video systems we have shipped. Browse case studies.
  • Download the CLIP & vision-language quick reference (PDF) — a one-page cheat sheet of the core idea, the CLIP-vs-SigLIP comparison, the model-size table, and the video pooling recipe. Download the cheat sheet.

References

  1. Radford, A., Kim, J. W., Hallacy, C., et al. "Learning Transferable Visual Models From Natural Language Supervision." Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR, 2021, pp. 8748–8763. arXiv:2103.00020. — The original CLIP paper. Primary source for the dual-encoder design, the 400M WIT dataset, the symmetric contrastive loss, the learned temperature, the zero-shot prompt template "a photo of a {class}", the ~76% zero-shot ImageNet result for ViT-L/14@336px, and the training-compute figures.
  2. "Contrastive Language-Image Pre-training." Wikipedia, accessed 2026-05-31. — Used for the consolidated model table (resolution, vision parameters, embedding dimension, release dates), the WIT dataset construction detail, image-preprocessing constants, and downstream-use summaries; cross-checked against the primary paper above.
  3. Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L. "Sigmoid Loss for Language Image Pre-Training." IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 11975–11986. — Primary source for the SigLIP sigmoid loss and the batch-size comparison (4,096 vs 2,048 on four TPU-v4 chips).
  4. Tschannen, M., et al. "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features." Google DeepMind, February 2025. arXiv:2502.14786. — Primary source for SigLIP 2's multilingual and dense-feature objectives and the ~79% (256px) / ~81% (512px) zero-shot ImageNet figures.
  5. Jia, C., Yang, Y., Xia, Y., et al. "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision" (ALIGN). Proceedings of ICML, PMLR, 2021, pp. 4904–4916. — Source for the >1-billion noisy-pair scaling result.
  6. Sun, Q., et al. "EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters." 2024. arXiv:2402.04252. — Source for the ~82% zero-shot ImageNet figure at 18B parameters.
  7. Ilharco, G., Wortsman, M., Wightman, R., et al. "OpenCLIP." Zenodo, 2021. github.com/mlfoundations/open_clip. — Source for the open reproduction on LAION-400M/2B and DataComp-1B and the 32B-samples/160-epoch training detail.
  8. Liu, H., Li, C., Wu, Q., Lee, Y. J. "Visual Instruction Tuning" (LLaVA) and "Improved Baselines with Visual Instruction Tuning" (LLaVA-1.5), 2023–2024. — Source for the frozen-CLIP-ViT-L/14@336 + two-layer-MLP projection recipe inside a multimodal LLM.
  9. Alayrac, J.-B., et al. "Flamingo: a Visual Language Model for Few-Shot Learning." Advances in Neural Information Processing Systems 35, 2022, pp. 23716–23736. — Source for freezing a CLIP-style image encoder and connecting it to a frozen language model.
  10. Ni, B., et al. "Expanding Language-Image Pretrained Models for General Video Recognition" (X-CLIP), and related ViFi-CLIP / CLIP2Video work, 2021–2023. — Sources for per-frame-embedding-plus-pooling video adaptation and the strength of the mean-pooling baseline for zero-shot video-text retrieval.