Get the latest tech news
Optimizing a WebGPU Matmul Kernel for 1 TFLOP
Building Surfgrad, a high-performant, WebGPU-powered autograd library
So as an educational exercise to learn WebGPU and Typescript, I decided to build Surfgrad, a high-performant, WebGPU-powered autograd library that enables browser-based tensor operations. WebGPU just introduced support for which allows threads within a group to efficiently share data, which is a big win for things like matrix multiplies where you may recalculate similar values. Because of the manual unrolling, the GPU is able to reduce overhead by not having to initialize and increment the inner loop, take advantage of instruction level parallelism, and amortize the cost of launching fewer workGroups by doing more work per thread.
Or read this on Hacker News