Get the latest tech news
Beating NumPy matrix multiplication in 150 lines of C
TL;DR The code from the tutorial is available at matmul.c. This blog post is the result of my attempt to implement high-performance matrix multiplication on CPU while keeping the code simple, portable and scalable. The implementation follows the BLIS design, works for arbitrary matrix sizes, and, when fine-tuned for an AMD Ryzen 7700 (8 cores), outperforms NumPy (=OpenBLAS), achieving over 1 TFLOPS of peak performance across a wide range of matrix sizes.
I challenged myself and asked if it is possible to write a high-performance matmul (across a wide range of matrix sizes) without diving deep into Assembly and Fortran code, at least for my CPU. After some searching on the internet, I found a couple of exciting and educational step-by-step tutorials on how to implement fast matmul from scratch, covering both theoretical and practical aspects: NumPy-like multi-threading performance across a broad range of matrix sizes Simple, portable and scalable C code Support for a wide variety of processors
Or read this on Hacker News