Get the latest tech news
Fast Multidimensional Matrix Multiplication on CPU from Scratch (2022)
Numpy can multiply two 1024x1024 matrices on a 4-core Intel CPU in ~8ms.This is incredibly fast, considering this boils down to 18 FLOPs / core / cycle, with...
Digging around the binary, there are multiple SGEMM implementations, each specific to one of Intel’s microarchitectures: eg sgemm_kernel_HASWELL, sgemm_kernel_SANDYBRIDGE, … At runtime, the BLAS library will use the cpuid instruction to query the details of the processor and then call the suitable function.This will increase the size of the BLAS binary considerably since it’s carrying around GEMM implementations for many architectures, even though we ever only need one. Further, operating system context switching (either to other userspace processes, or to interrupt routines) may pollute the cache in ways we cannot predict. The M1 chips have undocumented matrix-matrix assembly instructions that only Apple can compile for, which is where this speedup comes from (with OpenBLAS, the MBP takes ~8ms).
Or read this on Hacker News