Get the latest tech news

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

How we optimized GPT OSS 120B for state-of-the-art latency and throughput on launch day.

Romain Huet of OpenAI warns that implementation details affect performance and output quality.A large part of our engineering work was iteratively fixing bugs and testing models for both speed and correctness. Thanks to the hard work of open source maintainers worldwide, there are multiple excellent options for running GPT OSS, and bugs are getting identified and fixed quickly. While OpenAI advertises that GPT OSS 120B can be run on a single H100 GPU, optimized deployments parallelize the model across 4 or 8 GPUs for improved performance and throughput.

Get the Android app

Or read this on Hacker News