Get the latest tech news
How to Train a Million Context LLM
Scaling Llama3 beyond 1M context window with ~perfect utilization, the difference between ALiBi and RoPE, how to use GPT-4 to create synthetic data for your context extension finetunes, and more!
Greg’s now-famous "needle in a haystack" (NIAH) test, which measures a model's ability to extract a piece of information embedded in a long context, is a clean standard that everyone uses to start, but it is a little simplistic and the community has since created many options to extend it: There's a lot of other research that you have merged ties and you have all these different types of techniques to effectively just apply a singular value decomposition on top of weights and just get the most important ones and prevent interference across all the other layers. A lot of the empirical aspect of that and scaling up these things is experimentation and figuring out how do you kind of marshal these really complicated composite functions such that they don't just like do a divide over zero problem at one point.
Or read this on Hacker News