Get the latest tech news
I made a kernel 2.2x faster. It made my training loop 3x slower
I wrote a fused decode-attention kernel for an RL training loop, got it 2.2× faster than the SDPA path it replaces at the microbenchmark level, dropped it in...
None
Or read this on Hacker News