Get the latest tech news

How attention sinks keep language models stable


We discovered why language models catastrophically fail on long conversations: when old tokens are removed to save memory, models produce complete gibberish. We found models dump massive attention onto the first few tokens as "attention sinks"—places to park unused attention since softmax requires weights to sum to 1. Our solution, StreamingLLM, simply keeps these first 4 tokens permanently while sliding the window for everything else, enabling stable processing of 4 million+ tokens instead of just thousands. This mechanism is now in HuggingFace, NVIDIA TensorRT-LLM, and OpenAI's latest models.

Seeing this feature in a major OpenAI release connected directly to research that began during my internship at Meta in the summer of 2023, when I was tasked with solving what seemed like a simple problem: Barbero et al. have shown that attention sinks serve as "pressure valves" preventing what researchers call"over-mixing" —a pathological state where deep models processing long sequences blur important distinctions between tokens. Now, in August 2025, OpenAI released their open-source models with built-in attention sink parameters, bringing the mechanism full circle from research discovery to production implementation.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of language models

language models

Photo of attention sinks

attention sinks

Related news:

News photo

LangExtract: Python library for extracting structured data from language models

News photo

The Dangers of Stochastic Parrots: Can Language Models Be Too Big?

News photo

AbsenceBench: Language models can't tell what's missing