Get the latest tech news

Optimizing Matrix Multiplication on RDNA3

Writing Super-Fast Matrix Multiplication with HIP, RGP, and ISA

All the information I used comes from the publicly available ISA guide I don’t intend to re-implement or replace rocBLAS I only focused on 4096x4096 matrices single precision (FP32) matrix multiplication for the sake of simplicity. Of course, these are oversimplified calculations as they totally ignore memory hierarchy but we see that the available bandwidth is sufficiently high so that we can increase the amount of data we read to be closer to compute bound. Indeed, if we were to read by columns, each thread in a wave would access a non-contiguous memory region, resulting in multiple separate transactions and reduced efficiency as shown in the 2 diagrams below.

Get the Android app

Or read this on Hacker News