Get the latest tech news
Gemlite: Towards Building Custom Low-Bit Fused CUDA Kernels
A set of CUDA kernels for building custom low-bit gemv.
In this section, we demonstrate how to use the Gemlite codebase to build a custom fused kernel that combines quantization and sparsity, achieving up to 3.5x faster performance than Pytorch's fp16 matmul. We can start by reshaping the matrix so that the columns match the number of threads per group (32 for a warp), then select the element with the highest magnitude for each pair of successive rows. On the CUDA side, we need to implement the dequantization step, which only requires changing the main body of the dot product loop, since the indexing logic remains unchanged.
Or read this on Hacker News