Get the latest tech news

Quantized Llama models with increased speed and a reduced memory footprint


As our first quantized models in this Llama category, these instruction-tuned models retain the quality and safety of the original 1B and 3B models,...

Since their release, we’ve seen not just how the community has adopted our lightweight models, but also how grassroots developers are quantizing them to save capacity and memory footprint, often at a tradeoff to performance and accuracy. Starting today, the community can deploy our quantized models onto more mobile CPUs, giving them the opportunity to build unique experiences that are fast and provide more privacy since interactions stay entirely on device. Our partners have already integrated foundational components in the ExecuTorch open source ecosystem to leverage NPUs, and work is underway to specifically enable quantization on NPU for Llama 1B/3B.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Models

Models

Photo of Quantized Llama

Quantized Llama

Photo of increased speed

increased speed

Related news:

News photo

IBM doubles down on open source AI with new Granite 3.0 models

News photo

New iPad Mini 7 Models Begin Arriving to Customers in Australia and New Zealand

News photo

Researchers question AI’s ‘reasoning’ ability as models stumble on math problems with trivial changes