Get the latest tech news

Optimizing a WebGPU Matmul Kernel for 1 TFLOP


Building Surfgrad, a high-performant, WebGPU-powered autograd library

So as an educational exercise to learn WebGPU and Typescript, I decided to build Surfgrad, a high-performant, WebGPU-powered autograd library that enables browser-based tensor operations. WebGPU just introduced support for which allows threads within a group to efficiently share data, which is a big win for things like matrix multiplies where you may recalculate similar values. Because of the manual unrolling, the GPU is able to reduce overhead by not having to initialize and increment the inner loop, take advantage of instruction level parallelism, and amortize the cost of launching fewer workGroups by doing more work per thread.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Performance

Performance

Photo of WebGPU Kernel

WebGPU Kernel

Photo of 1TFLOP

1TFLOP

Related news:

News photo

Google Axion CPU With GCE C4A vs. AWS Graviton4 Performance

News photo

Accelerating the Performance of Rosetta in Linux VMs on Apple Silicon

News photo

Sorry, but you'll temporarily need to take the hats off your pets to improve Stardew Valley's performance