Get the latest tech news
A Visual Guide to LLM Quantization
Exploring memory-efficient techniques for LLMs
These models may exceed billions of parameters and generally need GPUs with large amounts of VRAM to speed up inference. Unlike weights, activations vary with each input data fed into the model during inference, making it challenging to quantize them accurately. NOTE: The authors used several tricks to speed up computation and improve performance, such as adding a dampening factor to the Hessian, “lazy batching”, and precomputing information using the Cholesky method.
Or read this on Hacker News