Get the latest tech news
Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in it
Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the so...
None
Or read this on Hacker News