Get the latest tech news
Qwen2.5-1M: Deploy your own Qwen with context length up to 1M tokens
Tech Report HuggingFace ModelScope Qwen Chat HuggingFace Demo ModelScope Demo DISCORD Introduction Two months after upgrading Qwen2.5-Turbo to support context length up to one million tokens, we are back with the open-source Qwen2.5-1M models and the corresponding inference framework support. Here’s what you can expect from this release: Opensource Models: We’re releasing two new checkpoints, Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, marking the first time we’ve upgraded our opensource Qwen models to handle 1M-token contexts.
The degradation of LLMs based on RoPE in long-context tasks is mainly due to unseen, large relative positional distances between queries and keys in computing attention weight. We employ Dual Chunk Attention(DCA), which addresses this issue by remapping relative positions to smaller values, avoiding the large distances not seen during training. Integrating with Chunked Prefill: Directly processing sequences of 1M tokens results in substantial memory overhead to store the activations in MLP layers, consuming 71GB of VRAM in Qwen2.5-7B.
Or read this on Hacker News