Get the latest tech news

Surprisingly fast AI-generated kernels we didn't mean to publish yet

ve some very fast AI-generated kernels in pure CUDA-C without using libraries and DSLs such as CUTLASS and Triton. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch.

The result is a test-time loop that looks less like “chat with a compiler” in the case of sequential revision, and more like structured exploratory search, guided by explicit optimization hypotheses and aggressively parallel evaluation. Round 6: 1.37 ms, 103.6% of reference Idea: Precomputing and caching N-dimension GEMM indices in shared memory to reduce redundant arithmetic within the input data loading loop. Round 10: 1.07 ms, 133.6% of reference Idea: Reuse precomputed N-dimension GEMM decomposition from shared memory for output address calculation, avoiding redundant and costly division/modulo operations.

Get the Android app

Or read this on Hacker News