Get the latest tech news
How to Think About GPUs
We love TPUs at Google, but GPUs are great too. This chapter takes a deep dive into the world of NVIDIA GPUs – how each chip works, how they’re networked together, and what that means for LLMs, especially compared to TPUs. This section builds on <a href='https://jax-ml.github.io/scaling-book/tpus/'>Chapter 2</a> and <a href='https://jax-ml.github.io/scaling-book/training'>Chapter 5</a>, so you are encouraged to read them first.
White spaces indicate stalls of at least some fraction of the physical CUDA coresThis enables flexible programming at the thread level, but at the cost of silently degrading performance if warps diverge too often. GPUTPUWhat is it?Streaming Multiprocessor (SM)Tensor CoreCore “cell” that contains other unitsWarp SchedulerVPUSIMD vector arithmetic unitCUDA CoreVPU ALUSIMD ALUSMEM (L1 Cache)VMEMFast on-chip cache memoryTensor CoreMXUMatrix multiplication unitHBM (aka GMEM)HBMHigh bandwidth high capacity memory GPUs started out rendering video games, but since deep learning took off in the 2010s, they’ve started acting more and more like dedicated matrix multiplication machines – in other words, more like TPUs.Before the deep learning boom, GPUs ("Graphics Processing Units") did, well, graphics – mostly for video games. GPUTPUH100 #TPU v5p #SM (streaming multiprocessor)Tensor Core1322Warp SchedulerVPU5288SMEM (L1 cache)VMEM32MB128MBRegistersVector Registers (VRegs)32MB256kBTensor CoreMXU5288This difference in modularity on the one hand makes TPUs much cheaper to build and simpler to understand, but it also puts more burden on the compiler to do the right thing.
Or read this on Hacker News