Get the latest tech news

Making AMD GPUs competitive for LLM inference (2023)


TL;DR MLC-LLM makes it possible to compile LLMs and deploy them on AMD GPUs using ROCm with competitive performance. More specifically, AMD Radeon™ RX 7900 XTX gives 80% of the speed of NVIDIA® GeForce RTX™ 4090 and 94% of the speed of NVIDIA® GeForce RTX™ 3090Ti for Llama2-7B/13B.

MLC-LLM brings state-of-the-art performance for a wide variety of backends, including CUDA, Metal, ROCm, Vulkan, and OpenCL, spanning both server-class GPUs to mobile (iPhone and Android). At a high level, the framework lets the user take open language models and compiles it with Python-based workflow, including APIs to transform computational graphs, optimize the layout and scheduling of GPU kernels, and deploys it natively on platforms of interest. Enable batching and multi-GPU support; Integration with PyTorch ecosystem; Empowering more quantization and model architectures; Bringing in more automatic optimizations on more hardware backends.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of LLM Inference

LLM Inference

Photo of amd gpus

amd gpus

Related news:

News photo

How We Optimize LLM Inference for AI Coding Assistant

News photo

AMD Developing Next-Gen Fortran Compiler Based On Flang, Optimized For AMD GPUs

News photo

We fine-tuned Llama 405B on AMD GPUs