Get the latest tech news
How to train a model on 10k H100 GPUs?
**How to train a model on 10k H100 GPUs?** A quick note summarizing common knowledge among the large-scale training cohort Oct 2nd, 2024 [https://soumith.ch/blog.html](https://soumith.ch/blog.html) My friend Francois Fleuret asked the above. I quickly jotted down what I think is fairly common knowledge among engineers working on large-scale training.
However, if the network is sufficiently large, it is more profitable to free these terms in order to fit a larger batch-size, and recompute them again when you need them to compute the backprop. we might have multiple layers of switches, and have RDMA (ability to copy GPU memory directly to NIC, bypassing CPU ram entirely), and have frontend and backend NICs (frontend connects to storage like NFS, backend connects GPUs to other GPUs in cluster). Libraries like NCCL do sophisticated discovery of the underlying networking topology and leverage them when we run all-reduce and other collectives.
Or read this on Hacker News