Get the latest tech news

How to scale your model: A systems view of LLMs on TPUs


Training LLMs often feels like alchemy, but understanding and optimizing the performance of your models doesn't have to. This book aims to demystify the science of scaling language models on TPUs: how TPUs work and how they communicate with each other, how LLMs run on real hardware, and how to parallelize your models during training and inference so they run efficiently at massive scale. If you've ever wondered “how expensive should this LLM be to train or “how much memory do I need to serve this model myself” or “what's an AllGather”, we hope this will be useful to you.

Goals & Feedback: By the end, you should feel comfortable estimating the best parallelism scheme for a Transformer model on a given hardware platform, and roughly how long training and inference should take. Alex Krizhevsky had to write unholy CUDA code to make CNNs fast but within a couple years, libraries like Theano and TensorFlow meant you didn't have to. We strongly believe it’s worth understanding every piece of the Transformer architecture: the exact sizes of every matrix, where normalization occurs, how many parameters and FLOPsFLoating point OPs, basically the total number of adds and multiplies required.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of LLMs

LLMs

Photo of model

model

Photo of TPUs

TPUs

Related news:

News photo

DeepSeek Gets an ‘F’ in Safety From Researchers | The model failed to block a single attack attempt.

News photo

Snap unveils AI text-to-image model for mobile devices

News photo

OpenAI launches o3-mini, its latest ‘reasoning’ model