Get the latest tech news
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
Large Language Models (LLMs) have transformed natural language processing, but face significant challenges in widespread deployment due to…
Specifically, for each block of weights, we find a seed that is fed into a Linear Feedback Shift Register (LFSR) during inference to efficiently generate a random matrix. Our experiments with Llama3 70B, which is particularly challenging, show zero-shot accuracy retention at 4- and 3-bit compression to be on par with or better than state-of-the-art methods, while maintaining performance comparable to FP16 baselines. Additionally, FPGA-based tests demonstrate that 4-bit SeedLM, as model size increases, approaches a 4x speed-up over an FP16 Llama 2/3 baseline.
Or read this on Hacker News