Get the latest tech news

VLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention


GitHub | Documentation | Paper

vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes. PagedAttention is the core technology behind vLLM, our LLM inference and serving engine that supports a variety of models with high performance and an easy-to-use interface. LMSYS is expanding the use of vLLM to a wider range of models, including Databricks Dolly, LAION’s OpenAsssiant, and Stability AI’s stableLM.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of vllm

vllm

Photo of pagedattention

pagedattention

Related news:

News photo

Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI