Get the latest tech news

Optimizing a WebGPU Matmul Kernel for 1 TFLOP

Building Surfgrad, a high-performant, WebGPU-powered autograd library

So as an educational exercise to learn WebGPU and Typescript, I decided to build Surfgrad, a high-performant, WebGPU-powered autograd library that enables browser-based tensor operations. WebGPU just introduced support for which allows threads within a group to efficiently share data, which is a big win for things like matrix multiplies where you may recalculate similar values. Because of the manual unrolling, the GPU is able to reduce overhead by not having to initialize and increment the inner loop, take advantage of instruction level parallelism, and amortize the cost of launching fewer workGroups by doing more work per thread.

Get the Android app

Or read this on Hacker News