Get the latest tech news
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling - deepseek-ai/DeepGEMM
During the inference decoding phase, when CUDA graph is enabled and the CPU is unaware of the number of tokens each expert receives, we support masked grouped GEMMs. Following the CUTLASS design, the kernels in DeepGEMM are warp-specialized, enabling overlapping data movement, tensor-core MMA instructions, and CUDA-core promotion. Full unrolling of the MMA pipelines, providing compilers with more optimization opportunities Very important for small shapes Refer to launch_k_iterations in the kernel file for details
Or read this on Hacker News