Get the latest tech news

Beating NumPy matrix multiplication in 150 lines of C

TL;DR The code from the tutorial is available at matmul.c. This blog post is the result of my attempt to implement high-performance matrix multiplication on CPU while keeping the code simple, portable and scalable. The implementation follows the BLIS design, works for arbitrary matrix sizes, and, when fine-tuned for an AMD Ryzen 7700 (8 cores), outperforms NumPy (=OpenBLAS), achieving over 1 TFLOPS of peak performance across a wide range of matrix sizes.

I challenged myself and asked if it is possible to write a high-performance matmul (across a wide range of matrix sizes) without diving deep into Assembly and Fortran code, at least for my CPU. After some searching on the internet, I found a couple of exciting and educational step-by-step tutorials on how to implement fast matmul from scratch, covering both theoretical and practical aspects: NumPy-like multi-threading performance across a broad range of matrix sizes Simple, portable and scalable C code Support for a wide variety of processors

Get the Android app

Or read this on Hacker News