Get the latest tech news

Optimizing Matrix Multiplication on RDNA3


Writing Super-Fast Matrix Multiplication with HIP, RGP, and ISA

All the information I used comes from the publicly available ISA guide I don’t intend to re-implement or replace rocBLAS I only focused on 4096x4096 matrices single precision (FP32) matrix multiplication for the sake of simplicity. Of course, these are oversimplified calculations as they totally ignore memory hierarchy but we see that the available bandwidth is sufficiently high so that we can increase the amount of data we read to be closer to compute bound. Indeed, if we were to read by columns, each thread in a wave would access a non-contiguous memory region, resulting in multiple separate transactions and reduced efficiency as shown in the 2 diagrams below.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of rdna3

rdna3

Related news:

News photo

RADV Driver Expands Use Of Performance-Helping DCC Fast Clears On RDNA3 GPUs

News photo

New AMD RDNA3 APUs Added To Open-Source Linux Driver Along With RDNA4 Fixes

News photo

AMD Posts RDNA3+ Firmware Files For Linux Users