Get the latest tech news
Optimizing Matrix Multiplication on RDNA3
Writing Super-Fast Matrix Multiplication with HIP, RGP, and ISA
All the information I used comes from the publicly available ISA guide I don’t intend to re-implement or replace rocBLAS I only focused on 4096x4096 matrices single precision (FP32) matrix multiplication for the sake of simplicity. Of course, these are oversimplified calculations as they totally ignore memory hierarchy but we see that the available bandwidth is sufficiently high so that we can increase the amount of data we read to be closer to compute bound. Indeed, if we were to read by columns, each thread in a wave would access a non-contiguous memory region, resulting in multiple separate transactions and reduced efficiency as shown in the 2 diagrams below.
Or read this on Hacker News