Get the latest tech news
Microsoft’s Differential Transformer cancels attention noise in LLMs
A simple change to the attention mechanism can make LLMs much more effective at finding relevant information in their context window.
Improving the capabilities of large language models (LLMs) in retrieving in-prompt information remains an area of active research that can impact important applications such as retrieval-augmented generation (RAG) and in-context learning(ICL). Wei and his colleagues also observed that some LLM hallucinations, where the model produces incorrect outputs despite having relevant context information, correlate with spurious attention patterns. Similar to ResNet, the residual connection is an addition, compared with the subtraction in Diff Transformer, so it wasn’t immediately apparent for researchers to propose the idea.”
Or read this on Venture Beat