Get the latest tech news

How to Write a Fast Matrix Multiplication from Scratch with Tensor Cores (2024)


This is my blog

I figured out the implementation details mostly by digging around the NVIDIA CUTLASS forums and source, and I wrote this article in order to make sure I actually understand what I am doing, and also in the hope that some fellow GPU nerds trying to work with tensor cores might find it helpful. Given how many people and companies these days are buying NVIDIA GPUs almost exclusively for the purpose of running matrix multiplications, it seems like lots of work goes into improving the tensor cores in terms of programmability and performance between successive architectures. In order to make the task of programming these powerful but imbalanced machines more manageable, the more recent Ampere and Hopper architectures introduced hardware support that enable several important parts of a GEMM kernel to run asychronously with respect to the rest of the SM.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Scratch

Scratch

Photo of tensor cores

tensor cores

Related news:

News photo

Differentiable Programming from Scratch

News photo

Behind the 6-digit code: Building HOTP and TOTP from scratch

News photo

Implementing DeepSeek R1's GRPO algorithm from scratch