Get the latest tech news

Llama 405B 506 tokens/second on an H200

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of applications, each with diverse deployment requirements. We also show how use of pipeline parallelism enabled a 1.2x speedup in the MLPerf Inference v4.1 Llama 2 70B benchmark on HGX H100 compared to our results published in August. In the minimum latency scenario, TP allows for more available GPU compute to generate each token, leading to 5.6x faster performance than pipeline parallelism.

Get the Android app

Or read this on Hacker News