Get the latest tech news

Deriving Muon

We recently proposed Muon: a new neural net optimizer. Muon has garnered attention for its excellent practical performance: it was used to set NanoGPT speed records leading to interest from the big labs.

📕 The idea of equipping linear layers with the RMS-to-RMS operator norm comes from Appendix E of my paper"A Spectral Condition for Feature Learning" with Greg Yang and Jamie Simon. The most advanced version of this idea currently appears in our paper on the modular norm with Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng and Phillip Isola. I believe this experiment underscores how properly metrized and dualized deep learning can have an impact on the kinds of number systems and precision levels we use to represent and train neural networks.

Get the Android app

Or read this on Hacker News