Get the latest tech news
Strengths and limitations of diffusion language models
Google recently released Gemini Diffusion, which is impressing everyone with its speed. Supposedly they even had to slow down the demo so people could see what…
A diffusion model will generate the whole thing, growing more accurate at each step: “xycz”, “aycd”, then “abcd”. The reason this isn’t straightforwardly quadratic is the “key-value cache”: because autoregressive models generate token-by-token, attention scores for previously-generated tokens don’t have to be checked again. But if we don’t, it’ll be because the “changing your mind” reasoning paradigm doesn’t map nicely onto block-by-block generation.
Or read this on Hacker News