Get the latest tech news

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling - deepseek-ai/DeepGEMM

During the inference decoding phase, when CUDA graph is enabled and the CPU is unaware of the number of tokens each expert receives, we support masked grouped GEMMs. Following the CUTLASS design, the kernels in DeepGEMM are warp-specialized, enabling overlapping data movement, tensor-core MMA instructions, and CUDA-core promotion. Full unrolling of the MMA pipelines, providing compilers with more optimization opportunities Very important for small shapes Refer to launch_k_iterations in the kernel file for details

Get the Android app

Or read this on Hacker News