Get the latest tech news

DeepSeek's multi-head latent attention and other KV cache tricks

How a Key-Value (KV) cache reduces Transformer inference time by trading memory for computation

Based on that observation, they introduce a Rolling Cache for recent context with retained initial tokens, enabling infinite-length sequence processing. Benefits: Reduces the KV cache size by a factor of H H H(the number of attention heads), significantly lowering memory bandwidth overhead. Key Idea: GQA interpolates between full multi-head attention and MQA to offering a scalable trade-off between inference speed and model quality.

Get the Android app

Or read this on Hacker News