Get the latest tech news
Compiler optimizations for 5.8ms GPT-OSS-120B inference (not on GPUs)
Here are the key optimizations that enabled two RNGD cards to achieve 5.8 ms per output token for gpt-oss-120b, running under 180 W, in just weeks.
None
Or read this on Hacker News