Get the latest tech news

Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in it


Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the so...

None

Get the Android app

Or read this on Hacker News

Read more on:

Photo of fp8

fp8

Photo of ~100

~100

Photo of cutlass

cutlass

Related news:

News photo

FP8 is ~100 tflops faster when the kernel name has "cutlass" in it

News photo

LLVM/Clang 20.1 Released With AMX-AVX512, AMX-FP8, AVX10.2, AMD GFX950 & Much More

News photo

LLVM 20 Feature Development Wraps Up With AMX-AVX512, AMX-FP8, AVX10.2 & AMD GFX950