Get the latest tech news
DeepSeek's multi-head latent attention and other KV cache tricks
How a Key-Value (KV) cache reduces Transformer inference time by trading memory for computation
Based on that observation, they introduce a Rolling Cache for recent context with retained initial tokens, enabling infinite-length sequence processing. Benefits: Reduces the KV cache size by a factor of H H H(the number of attention heads), significantly lowering memory bandwidth overhead. Key Idea: GQA interpolates between full multi-head attention and MQA to offering a scalable trade-off between inference speed and model quality.
Or read this on Hacker News