Get the latest tech news
Beating cuBLAS in Single-Precision General Matrix Multiplication
In this blog post, we’ll walk through an implementation of the SGEMM (Single-precision GEneral Matrix Multiply) operation defined as C := alpha*A*B + beta*C. We will review three different kernels, each optimized for specific matrix size problems. Our final implementation is optimized for Ampere architecture and outperforms cuBLAS on wide range of matrix size problems.
This project is inspired by the outstanding works of Andrej Karpathy, George Hotz, Scott Gray, Horace He, Philippe Tillet, Jeremy Howard, Lei Mao and the best CUDA hackers from the GPU MODE community ( Discord server). Today we’ll walk through a GPU implementation of SGEMM (Single-precision GEneral Matrix Multiply) operation defined as C := alpha*A*B + beta*C. The blog delves into benchmarking code on CUDA devices and explains the algorithm’s design along with optimization techniques. The achieved performance is shown below, comparing results with locked and unlocked GPU core frequencies against cuBLAS and Simon Boehm’s highly cited work (used in llamafile, aka tinyBLAS).
Or read this on Hacker News