Get the latest tech news

Exploring inference memory saturation effect: H100 vs. MI300x


This benchmark explores how GPU memory saturation affects LLM inference performance and cost, comparing NVIDIA H100 and AMD MI300x.

As prompt and batch sizes grow, the NVIDIA H100 reaches memory limits, causing a sharp drop in cost-effectiveness. This forces the inference engine to compute KV tensors on-the-fly or offload them to CPU memory, degrading throughput. The 8xH100 setup begins to struggle with batch size 16 due to memory saturation, resulting in slower generation times.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of H100

H100

Photo of mi300x

mi300x

Related news:

News photo

Nvidia's MLPerf submission shows B200 offers up to 2.2x training performance of H100

News photo

Huawei's Ascend 910 launches this October to challenge Nvidia's H100

News photo

Nvidia’s coveted H100 GPUs will be available on-demand through Lambda’s clusters