Get the latest tech news
Exploring inference memory saturation effect: H100 vs. MI300x
This benchmark explores how GPU memory saturation affects LLM inference performance and cost, comparing NVIDIA H100 and AMD MI300x.
As prompt and batch sizes grow, the NVIDIA H100 reaches memory limits, causing a sharp drop in cost-effectiveness. This forces the inference engine to compute KV tensors on-the-fly or offload them to CPU memory, degrading throughput. The 8xH100 setup begins to struggle with batch size 16 due to memory saturation, resulting in slower generation times.
Or read this on Hacker News