Get the latest tech news

Gemlite: Towards Building Custom Low-Bit Fused CUDA Kernels


A set of CUDA kernels for building custom low-bit gemv.

In this section, we demonstrate how to use the Gemlite codebase to build a custom fused kernel that combines quantization and sparsity, achieving up to 3.5x faster performance than Pytorch's fp16 matmul. We can start by reshaping the matrix so that the columns match the number of threads per group (32 for a warp), then select the element with the highest magnitude for each pair of successive rows. On the CUDA side, we need to implement the dequantization step, which only requires changing the main body of the dot product loop, since the indexing logic remains unchanged.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of custom

custom

Photo of Bit

Bit

Photo of fused cuda kernels

fused cuda kernels

Related news:

News photo

ChatGPT Advanced Voice Mode First Impressions: Fun, and Just a Bit Creepy

News photo

Napkin turns text into visuals with a bit of generative AI

News photo

Xbox is going a bit 90s with its new transparent blue Sky Cipher controller