CLIP

Definition

A model that maps images and text into a shared space so it can tell how well a caption matches a picture. The base trick behind open-vocabulary vision.

CLIP learns from image-caption pairs to place related pictures and words near each other in one space, enabling zero-shot classification and text-based image search. It is a foundational building block for many multimodal and open-vocabulary systems.

Also known as

contrastive language-image pretraining