CLIP learns from image-caption pairs to place related pictures and words near each other in one space, enabling zero-shot classification and text-based image search. It is a foundational building block for many multimodal and open-vocabulary systems.