The transformer processes a sequence by letting every element attend to every other, capturing context regardless of distance. Introduced for language, it now underlies LLMs, ViTs, and generative video, which is why one architecture name appears across the whole field.

