Get the latest tech news

Cautious Optimizers: Improving Training with One Line of Code


AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a single-line modification in Pytorch to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only speed-up on Llama and MAE pretraining up to $1.47$ times, but also better results in LLM post-training tasks. Code is available at https://github.com/kyleliang919/C-Optim.

View a PDF of the paper titled Cautious Optimizers: Improving Training with One Line of Code, by Kaizhao Liang and 3 other authors Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. Among them, we pick the simplest one for empirical experiments, showing not only speed-up on Llama and MAE pretraining up to $1.47$ times, but also better results in LLM post-training tasks.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of training

training

Photo of Code

Code

Photo of line

line

Related news:

News photo

A line that can’t be crossed’: Elizabeth Holmes’ prosecutor on her fraud case, and lessons for Silicon Valley

News photo

Yoke: Infrastructure as code, but actually

News photo

Hallucinations in code are the least dangerous form of LLM mistakes