Get the latest tech news
Calculating the cost of a Google DeepMind paper
Recently, GDM released a great paper titled, Scaling Exponents Across Parameterizations and Optimizers, in which they conduct over 10,000 LLM training runs to obtain optimal hyperparameters under different regimes. After reading it (it was great), I wanted to test my understanding of the paper by tallying up all experiments conducted within, calculating the total compute cost it would take to replicate the paper.
extra weight decay experiments static: adam, per-layer, full alignment, decoupled 1e-4 4x parameterizations LR experiment-like sweep across all 14 model widths Due to my desire to finish this blog post in a reasonable amount of time, I made the unprincipled decision of approximating the number of experiments-per-line in any given Eval Loss vs Base Learning Rate graph as 15. Plus, if we look at Figure 6: Notice that the range of evaluated learning rates actually seems constant here, unlike in the normal Eval Loss vs Base LR plots.
Or read this on Hacker News