Get the latest tech news

I made a kernel 2.2x faster. It made my training loop 3x slower


I wrote a fused decode-attention kernel for an RL training loop, got it 2.2× faster than the SDPA path it replaces at the microbenchmark level, dropped it in...

None

Get the Android app

Or read this on Hacker News

Read more on:

Photo of kernel

kernel

Photo of 2.2x

2.2x

Photo of training loop

training loop

Related news:

News photo

Ubuntu 26.10 Planning To Ship With The Linux 7.2 Kernel

News photo

CVE-2026-28952: Apple macOS 26.5 Kernel Vuln found by Claude

News photo

FatGid: FreeBSD 14.x kernel local privilege escalation