Get the latest tech news
Context Rot: How increasing input tokens impacts LLM performance
developments in LLMs show a trend toward longer context windows, with the input token count of the latest models reaching the millions. Because these models achieve near-perfect scores on widely adopted benchmarks like Needle in a Haystack (NIAH) [1], it’s often assumed that their performance is uniform across long-context tasks.
Other tasks that appear similar with regards to difficulty, such as AbsenceBench [ 7] which tests models for recognizing the absence of a given snippet of text, also demonstrate performance degradation with growing input length. Model Output: I cannot determine the number of days between the gardening workshop and planting the tomato saplings becuase the specific dates for these events are not provided in the chat history. An important direction for future work is to disentangle how much of a model’s performance degradation stems from the intrinsic difficulty of the task itself versus its ability to effectively handle long contexts.
Or read this on Hacker News