Get the latest tech news
VPTQ: Extreme low-bit Quantization for real LLMs
VPTQ, A Flexible and Extreme low-bit quantization algorithm - microsoft/VPTQ
Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed. We thank for James Hensman for his crucial insights into the error analysis related to Vector Quantization (VQ), and his comments on LLMs evaluation are invaluable to this research.
Or read this on Hacker News