A Vision Transformer (ViT) splits an image into patches, treats each like a token, and uses attention to relate them. It has largely overtaken older convolutional designs on accuracy at scale and underlies modern detectors, segmenters, and multimodal models.

