Get the latest tech news

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs


How we optimized GPT OSS 120B for state-of-the-art latency and throughput on launch day.

Romain Huet of OpenAI warns that implementation details affect performance and output quality.A large part of our engineering work was iteratively fixing bugs and testing models for both speed and correctness. Thanks to the hard work of open source maintainers worldwide, there are multiple excellent options for running GPT OSS, and bugs are getting identified and fixed quickly. While OpenAI advertises that GPT OSS 120B can be run on a single H100 GPU, optimized deployments parallelize the model across 4 or 8 GPUs for improved performance and throughput.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Nvidia

Nvidia

Photo of second

second

Photo of tokens

tokens

Related news:

News photo

Nvidia says its AI chips don't have a 'kill switch' after Chinese accusation

News photo

Trump Meets With Nvidia’s Huang as Semiconductor Tariffs Near

News photo

Nvidia Rejects US Demand For Backdoors in AI Chips