The key-value cache holds intermediate attention results for tokens already processed, letting generation continue cheaply token by token. It speeds output but grows with context length and concurrency, making it a main driver of serving memory and a target for optimisation like PagedAttention.