Get the latest tech news

DeepSeek's multi-head latent attention and other KV cache tricks


How a Key-Value (KV) cache reduces Transformer inference time by trading memory for computation

Based on that observation, they introduce a Rolling Cache for recent context with retained initial tokens, enabling infinite-length sequence processing. Benefits: Reduces the KV cache size by a factor of H H H(the number of attention heads), significantly lowering memory bandwidth overhead. Key Idea: GQA interpolates between full multi-head attention and MQA to offering a scalable trade-off between inference speed and model quality.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of DeepSeek

DeepSeek

Photo of KV cache tricks

KV cache tricks

Related news:

News photo

U.S. Navy bans use of DeepSeek due to 'security and ethical concerns'

News photo

White House "looking into" national security implications of DeepSeek's AI

News photo

Questions censored by DeepSeek