Get the latest tech news
Modded-NanoGPT: NanoGPT (124M) quality in 3.25B tokens
NanoGPT (124M) quality in 3.25B tokens. Contribute to KellerJordan/modded-nanogpt development by creating an account on GitHub.
Using non-convergent coefficients for the quintic polynomial in order to maximize slope at zero, and thereby minimize the number of necessary Newton-Schulz iterations. In particular, Jeremy Bernstein @jxbz sent us the draft, which caused us to experiment with various Newton-Schulz iterations as the orthogonalization method for this optimizer. The proposed optimizer can be thought of as a second way of smoothing spectral steepest descent, with a different set of memory and runtime tradeoffs compared to Shampoo.
Or read this on Hacker News