Get the latest tech news
Why DeepSeek is cheap at scale but expensive to run locally
Why is DeepSeek-V3 supposedly fast and cheap to serve at scale, but too slow and expensive to run locally? Why are some AI models slow to respond but fast once…
AI inference providers often talk about a fundamental tradeoff between throughput and latency: for any given model, you can either serve it at high-throughput high-latency, or low-throughput low-latency. Pipeline bubbles can be absolutely brutal for model throughput, so inference providers always set their windows wide enough to avoid them. Their models have a more efficient architecture (non-MoE, fewer layers), or OpenAI/Anthropic have some very clever tricks for serving inference, or they’re paying through the nose for way more GPUs than they strictly need
Or read this on Hacker News