China claims domestically-designed 14nm logic chips can rival 4nm Nvidia silicon — architecture leverages 3D hybrid bonding techniques for claimed 120 TFLOPS of power
Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in it
FP8 is ~100 tflops faster when the kernel name has "cutlass" in it