Get the latest tech news

Calculating the cost of a Google DeepMind paper

Recently, GDM released a great paper titled, Scaling Exponents Across Parameterizations and Optimizers, in which they conduct over 10,000 LLM training runs to obtain optimal hyperparameters under different regimes. After reading it (it was great), I wanted to test my understanding of the paper by tallying up all experiments conducted within, calculating the total compute cost it would take to replicate the paper.

extra weight decay experiments static: adam, per-layer, full alignment, decoupled 1e-4 4x parameterizations LR experiment-like sweep across all 14 model widths Due to my desire to finish this blog post in a reasonable amount of time, I made the unprincipled decision of approximating the number of experiments-per-line in any given Eval Loss vs Base Learning Rate graph as 15. Plus, if we look at Figure 6: Notice that the range of evaluated learning rates actually seems constant here, unlike in the normal Eval Loss vs Base LR plots.

Get the Android app

Or read this on Hacker News