Get the latest tech news

LLM-D: Kubernetes-Native Distributed Inference at Scale


llm-d is a Kubernetes-native high-performance distributed LLM inference framework - llm-d/llm-d

llm-d adopts a layered architecture on top of industry-standard open technologies: vLLM, Kubernetes, and Inference Gateway. Leveraging operational telemetry, the Inference Scheduler implements the filtering and scoring algorithms to make decisions with P/D-, KV-cache-, SLA-, and load-awareness. Variant Autoscaling over Hardware, Workload, and Traffic(🚧): We plan to implement a traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derive a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes) to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Kubernetes

Kubernetes

Photo of LLM

LLM

Photo of scale

scale

Related news:

News photo

Red Hat Announces The llm-d Open-Source Project For Gen AI

News photo

llm-d, Kubernetes native distributed inference

News photo

Emergent social conventions and collective bias in LLM populations