Get the latest tech news

Beating NumPy matrix multiplication in 150 lines of C


TL;DR The code from the tutorial is available at matmul.c. This blog post is the result of my attempt to implement high-performance matrix multiplication on CPU while keeping the code simple, portable and scalable. The implementation follows the BLIS design, works for arbitrary matrix sizes, and, when fine-tuned for an AMD Ryzen 7700 (8 cores), outperforms NumPy (=OpenBLAS), achieving over 1 TFLOPS of peak performance across a wide range of matrix sizes.

I challenged myself and asked if it is possible to write a high-performance matmul (across a wide range of matrix sizes) without diving deep into Assembly and Fortran code, at least for my CPU. After some searching on the internet, I found a couple of exciting and educational step-by-step tutorials on how to implement fast matmul from scratch, covering both theoretical and practical aspects: NumPy-like multi-threading performance across a broad range of matrix sizes Simple, portable and scalable C code Support for a wide variety of processors

Get the Android app

Or read this on Hacker News

Read more on:

Photo of lines

lines

Related news:

News photo

How Google migrated billions of lines of code from Perforce to Piper

News photo

Four lines of code it was four lines of code

News photo

AI Companion Chatbots Blur the Lines Between Fantasy and Reality