Get the latest tech news
llm-d, Kubernetes native distributed inference
Debut announcement of llm-d project and community
llm-d is a Kubernetes-native high-performance distributed LLM inference framework - a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators. For instance, in its “Open Source Week”, the DeepSeek team published the design of its inference system, which aggressively leverages disaggregation and KV caching to achieve remarkable performance per $ of compute. Operationalizability: modular and resilient architecture with native integration into Kubernetes via Inference Gateway API Flexibility: cross-platform (active work to support NVIDIA, Google TPU, AMD, and Intel), with extensible implementations of key composable layers of the stack Performance: leverage distributed optimizations like disaggregation and prefix-aware routing to achieve the highest tok/$ while meeting SLOs
Or read this on Hacker News