Get the latest tech news

Multiplatform Matrix Multiplication Kernels


We implemented a sophisticated matrix multiplication engine in CubeCL that rivals the performance of cuBLAS and CUTLASS while supporting a wider range of GPUs. Leveraging double buffering, tensor cores, and vectorization, it compiles seamlessly to CUDA, ROCm, WebGPU, Metal, and Vulkan backends without relying on proprietary or third-party binaries. Matrix multiplication is central to modern AI workloads, especially transformers, and optimizing it ourselves was essential to enable kernel fusion and achieve state-of-the-art performance across platforms in a deep learning framework.

Zooming in on what a single plane does during Stage Matmul, the most efficient approach is to use an outer product strategy: process one slice along the k dimension at a time, performing all related computations before moving to the next and accumulating the results. To avoid that we use an extra tile for Rhs, double buffering as we advance through p n. Notably, this stall is a good opportunity to perform unrelated work, such as injecting Global Matmul instructions via the StageEventListener. As mentioned earlier, adding plane specialization to double buffering can improve resource efficiency by separating compute and load tasks, which reduces register pressure and increases SM occupancy.

Get the Android app

Or read this on Hacker News