vLLM

Definition

An open, high-throughput engine for serving language and multimodal models, using PagedAttention and continuous batching to pack many requests onto a GPU.

vLLM raises serving throughput sharply by managing the model's memory like an operating system pages RAM and by batching requests continuously. It is a common self-hosting choice when a product runs open models at scale and wants low cost per request.

Also known as

PagedAttention