Get the latest tech news
Attention Wasn't All We Needed
Personal Blog
While this introduces an information bottleneck, potentially limiting fine-grained local interactions compared to full self-attention, it excels at capturing global context efficiently and has proven effective in various modalities, enabling Transformer-like architectures to be applied to previously challenging domains due to sequence length constraints. This combination of a gentle start (warmup) followed by a smooth, theoretically motivated decay (cosine annealing) provides a robust and effective learning rate strategy that often requires less hyperparameter tuning than step-based schedules and frequently leads to improved model accuracy. By ensuring that the weight decay strength is independent of the adaptive learning rate scaling, AdamW often allows for better hyperparameter tuning (especially \(\text{lr}\) and \(\lambda\)) and can lead to models that perform better on unseen data compared to standard Adam with L2 regularization.
Or read this on Hacker News