Get the latest tech news

llm-d, Kubernetes native distributed inference


Debut announcement of llm-d project and community

llm-d is a Kubernetes-native high-performance distributed LLM inference framework - a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators. For instance, in its “Open Source Week”, the DeepSeek team published the design of its inference system, which aggressively leverages disaggregation and KV caching to achieve remarkable performance per $ of compute. Operationalizability: modular and resilient architecture with native integration into Kubernetes via Inference Gateway API Flexibility: cross-platform (active work to support NVIDIA, Google TPU, AMD, and Intel), with extensible implementations of key composable layers of the stack Performance: leverage distributed optimizations like disaggregation and prefix-aware routing to achieve the highest tok/$ while meeting SLOs

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Kubernetes

Kubernetes

Photo of LLM

LLM

Related news:

News photo

Emergent social conventions and collective bias in LLM populations

News photo

High Available Mosquitto MQTT on Kubernetes

News photo

Show HN: Min.js style compression of tech docs for LLM context