Get the latest tech news
Visualizing 6D Mesh Parallelism
Plus some lore
Most of them fail to convey a deep understanding of the exact communications involved in a single training step, and even fortheoutliers that do, they do not cover the more complex case of combining all approaches. It infects/destroys otherwise clean training code, creates gigantic bubbles of compute inactivity, and worst of all, is conceptually simple enough to make an engineer feel it should be easy to implement. On correctness# Frankly speaking, it is hard for a single person to nail 100% of all details involved with no external feedback, and I fully expect to have made multiple egregious errors in understanding.
Or read this on Hacker News