Get the latest tech news

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

Today, Kog AI launches a tech preview of the Kog Inference Engine (KIE): 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). This preview runs a 2B model, with support for large third-party MoE models coming next at similar speeds.

None

Get the Android app

Or read this on Hacker News

Related news:

TP-Link announces a Wi-Fi 8 router even though the standard doesn't exist yet

Doctors requested safety data from Doctronic's Utah AI prescriber - Utah denied the request because it does not "outweigh Doctronic's business confidentiality interests."

New AMDGPU Driver Pull Request For Linux 7.2 Preps For HDMI 2.1 FRL

« The Xbox Game Pass price cut is working, Asha Sharma reportedly tells staff, but "we will not solve this in one moment or one launch"

Intel To Support DRM Background Color Property With Linux 7.2 »