Get the latest tech news

Context Rot: How increasing input tokens impacts LLM performance


developments in LLMs show a trend toward longer context windows, with the input token count of the latest models reaching the millions. Because these models achieve near-perfect scores on widely adopted benchmarks like Needle in a Haystack (NIAH) [1], it’s often assumed that their performance is uniform across long-context tasks.

Other tasks that appear similar with regards to difficulty, such as AbsenceBench [ 7] which tests models for recognizing the absence of a given snippet of text, also demonstrate performance degradation with growing input length. Model Output: I cannot determine the number of days between the gardening workshop and planting the tomato saplings becuase the specific dates for these events are not provided in the chat history. An important direction for future work is to disentangle how much of a model’s performance degradation stems from the intrinsic difficulty of the task itself versus its ability to effectively handle long contexts.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of LLM

LLM

Photo of llm performance

llm performance

Photo of Context Rot

Context Rot

Related news:

News photo

Show HN: FFmpeg in plain English – LLM-assisted FFmpeg in the browser

News photo

Show HN: I built an LLM chat app because we shouldn't need 10 AI subscriptions

News photo

ETH Zurich and EPFL to release a LLM developed on public infrastructure