Vision Transformer

Definition

A neural network that applies the transformer architecture to image patches instead of words. Now the backbone of most state-of-the-art vision models.

A Vision Transformer (ViT) splits an image into patches, treats each like a token, and uses attention to relate them. It has largely overtaken older convolutional designs on accuracy at scale and underlies modern detectors, segmenters, and multimodal models.

Also known as

ViT