Get the latest tech news

Defeating Nondeterminism in LLM Inference


Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models. For example, you might observe that asking ChatGPT the same question multiple times provides different results. This by itself is not surprising, since getting a result from a language model involves “sampling”, a process that converts the language model’s output into a probability distribution and probabilistically selects a token. What might be more surprising is that even when we adjust the temperature down to 0This means that the LLM always chooses the highest probability token, which is called greedy sampling. (thus making the sampling theoretically deterministic), LLM APIs are still not deterministic in practice (see past discussions here, here, or here). Even when running inference on your own hardware with an OSS inference library like vLLM or SGLang, sampling still isn’t deterministic (see here or here).

This property directly impacts the computation of attention scores and logits in the transformer architecture, where parallel operations across multiple threads can yield different results based on execution order. You can also find the “concurrency + floating point” hypothesis repeated by others, like here( “There are speed tradeoffs, and in order to make the endpoints fast GPUs are used, which do parallel [nondeterministic] calculations. Unlike for RMSNorm, additional constraints around arithmetic intensity and utilizing tensorcores force us to split 2D tiles instead of individual output elements for efficient matmul kernels.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of LLM Inference

LLM Inference

Photo of Nondeterminism

Nondeterminism

Related news:

News photo

PyTorch 2.8 Released With Better Intel CPU Performance For LLM Inference

News photo

Five Kinds of Nondeterminism

News photo

Making AMD GPUs competitive for LLM inference (2023)