Get the latest tech news

Llama 405B 506 tokens/second on an H200


The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of applications, each with diverse deployment…

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of applications, each with diverse deployment requirements. We also show how use of pipeline parallelism enabled a 1.2x speedup in the MLPerf Inference v4.1 Llama 2 70B benchmark on HGX H100 compared to our results published in August. In the minimum latency scenario, TP allows for more available GPU compute to generate each token, leading to 5.6x faster performance than pipeline parallelism.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Llama

Llama

Photo of tokens

tokens

Photo of H200

H200

Related news:

News photo

The Role of Anchor Tokens in Self-Attention Networks

News photo

Run Llama locally with only PyTorch on CPU

News photo

A new Llama-based model for efficient large-scale voice generation