Get the latest tech news
IsoFLOP curves of large language models are flat
An interesting detail in the recently released Llama-3 technical report has caught my eye (p. 8): This has caught my eye, since I had noted the same phenomenon in a previous post about the Chinchil…
I’m glad that this observation is finally being taken seriously, but I think the quotation above from the Llama-3 paper still underestimates the extent of this isoFLOP flatness issue. Gray curves are 4000 individual predictions based on bootstrapped parametric scaling law estimates. So, again any calculation that includes hyperparameter search and other types of experimentation (in addition to training and inference compute) will likewise shift the isoFLOP curve to the left and thus favor training smaller models for longer (although this is likely a much smaller effect than the effect of inference).
Or read this on Hacker News