Attention scores how much each input element should influence each output element, so the model focuses on the relevant words or image patches. It is what gives transformers their grasp of context, and its compute cost shapes how long inputs can be.