Get the latest tech news

Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI


Compare the Llama 3 serving performance with vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and Hugging Face TGI on BentoCloud.

Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI June 5, 2024 • Written By Rick Zhou, Larme Zhao, Bo Jiang, and Sean Sheng To help developers make informed decisions, the BentoML engineering team conducted a comprehensive benchmark study on the Llama 3 serving performance with vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and Hugging Face TGI on BentoCloud. Our benchmark client can spawn up to the target number of users within 20 seconds, after which it stress tests the LLM backend by sending concurrent generation requests with randomly selected prompts.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of LLM Inference

LLM Inference

Photo of vllm

vllm

Photo of tgi

tgi

Related news:

News photo

AMD's MI300X Outperforms Nvidia's H100 for LLM Inference

News photo

How attention offloading reduces the costs of LLM inference at scale

News photo

Show HN: Speeding up LLM inference 2x times (possibly)