Get the latest tech news

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request


Today, Kog AI launches a tech preview of the Kog Inference Engine (KIE): 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). This preview runs a 2B model, with support for large third-party MoE models coming next at similar speeds.

None

Get the Android app

Or read this on Hacker News

Read more on:

Photo of standard

standard

Photo of request

request

Photo of LLM Inference

LLM Inference

Related news:

News photo

TP-Link announces a Wi-Fi 8 router even though the standard doesn't exist yet

News photo

Doctors requested safety data from Doctronic's Utah AI prescriber - Utah denied the request because it does not "outweigh Doctronic's business confidentiality interests."

News photo

New AMDGPU Driver Pull Request For Linux 7.2 Preps For HDMI 2.1 FRL