Get the latest tech news
Quantized Llama models with increased speed and a reduced memory footprint
As our first quantized models in this Llama category, these instruction-tuned models retain the quality and safety of the original 1B and 3B models,...
Since their release, we’ve seen not just how the community has adopted our lightweight models, but also how grassroots developers are quantizing them to save capacity and memory footprint, often at a tradeoff to performance and accuracy. Starting today, the community can deploy our quantized models onto more mobile CPUs, giving them the opportunity to build unique experiences that are fast and provide more privacy since interactions stay entirely on device. Our partners have already integrated foundational components in the ExecuTorch open source ecosystem to leverage NPUs, and work is underway to specifically enable quantization on NPU for Llama 1B/3B.
Or read this on Hacker News