Get the latest tech news

VPTQ: Extreme low-bit Quantization for real LLMs

VPTQ, A Flexible and Extreme low-bit quantization algorithm - microsoft/VPTQ

Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed. We thank for James Hensman for his crucial insights into the error analysis related to Vector Quantization (VQ), and his comments on LLMs evaluation are invaluable to this research.

Get the Android app

Or read this on Hacker News