Get the latest tech news

Why DeepSeek is cheap at scale but expensive to run locally


Why is DeepSeek-V3 supposedly fast and cheap to serve at scale, but too slow and expensive to run locally? Why are some AI models slow to respond but fast once…

AI inference providers often talk about a fundamental tradeoff between throughput and latency: for any given model, you can either serve it at high-throughput high-latency, or low-throughput low-latency. Pipeline bubbles can be absolutely brutal for model throughput, so inference providers always set their windows wide enough to avoid them. Their models have a more efficient architecture (non-MoE, fewer layers), or OpenAI/Anthropic have some very clever tricks for serving inference, or they’re paying through the nose for way more GPUs than they strictly need

Get the Android app

Or read this on Hacker News

Read more on:

Photo of scale

scale

Photo of DeepSeek

DeepSeek

Related news:

News photo

DeepSeek’s distilled new R1 AI model can run on a single GPU

News photo

DeepSeek’s updated R1 AI model is more censored, test finds

News photo

China’s DeepSeek quietly releases upgraded R1 AI model, ramping up competition with OpenAI