Get the latest tech news
Matmul on Blackwell: Part 2 – Using Hardware Features to Optimize Matmul
In the first blog post in this series we explained Nvidia's Blackwell GPU architecture and concluded with a 4 line kernel that was a bit worse than cuBLAS. In fact, the performance was a lot worse coming in at 0.3% of cuBLAS and leaving 1758 TFLops on the table.
In the first blog post in this series we explained Nvidia's Blackwell GPU architecture and concluded with a 4 line kernel that was a bit worse than cuBLAS. Compute and memory throughput profile from NCUFurthermore, the true power of TMA store lies in its asynchrony, which enables pipelining and overlapping operations. Specifically, the next post will showcase how to build a warp specialized pipeline to overlap data transfer and computation to get performance that’s closer to state-of-the-art.
Or read this on Hacker News