Get the latest tech news

A RoCE network for distributed AI training at scale


AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of pa…

AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 405B. To accommodate this, we designed an aggregator training switch (ATSW) layer that connects the CTSWs in a data center building, expanding the RoCE domain beyond a single AI Zone. We would like to thank all contributors to the paper, including Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Adi Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashi Gandham, Omar Baldonado.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of scale

scale

Photo of RoCE network

RoCE network

Related news:

News photo

US Sues TikTok Over 'Massive-Scale' Privacy Violations of Kids Under 13

News photo

Dazed and Confused: A Large-Scale Real-World User Study of ReCAPTCHAv (2023)

News photo

First commercial-scale passively cooled nuclear reactor is meltdown-proof