Get the latest tech news

Why DeepSeek is cheap at scale but expensive to run locally

Why is DeepSeek-V3 supposedly fast and cheap to serve at scale, but too slow and expensive to run locally? Why are some AI models slow to respond but fast once…

AI inference providers often talk about a fundamental tradeoff between throughput and latency: for any given model, you can either serve it at high-throughput high-latency, or low-throughput low-latency. Pipeline bubbles can be absolutely brutal for model throughput, so inference providers always set their windows wide enough to avoid them. Their models have a more efficient architecture (non-MoE, fewer layers), or OpenAI/Anthropic have some very clever tricks for serving inference, or they’re paying through the nose for way more GPUs than they strictly need

Get the Android app

Or read this on Hacker News