Get the latest tech news

Compiler optimizations for 5.8ms GPT-OSS-120B inference (not on GPUs)


Here are the key optimizations that enabled two RNGD cards to achieve 5.8 ms per output token for gpt-oss-120b, running under 180 W, in just weeks.

None

Get the Android app

Or read this on Hacker News

Read more on:

Photo of GPUs

GPUs

Photo of GPT

GPT

Photo of OSS-120B inference

OSS-120B inference

Related news:

News photo

New Code Allows VCE 1.0 Video Acceleration To Work On AMDGPU Driver For GCN 1.0 GPUs

News photo

IBM is just not into the 'spend megabucks on cloudy GPUs' thing, rents them instead

News photo

Valve Developer Contributes Open-Source Driver Fixes For 12 Year Old Hawaii GPUs