Get the latest tech news

Consistency LLM: converting LLMs to parallel decoders accelerates inference 3.5x


TL;DR: LLMs have been traditionally regarded as sequential decoders, decoding one token after another. In this blog, we show pretrained LLMs can be easily taught to operate as efficient parallel decoders. We introduce Consistency Large Language Models (CLLMs), a new family of parallel decoders capable of reducing inference latency by efficiently decoding an $n$-token sequence per inference step. Our research shows this process – mimicking human cognitive process of forming complete sentences in mind before articulating word by word – can be effectively learned by simply finetuning pretrained LLMs.

$$ \mathcal{L}(\theta) = \mathcal L_{\text{consistency}} + w\mathcal{L}_{\text{AR}} $$ Our experiments contain three domain-specific tasks, including Spider (text-to-SQL), Human-Eval (Python code completion), and GSM8k (math), and the broader open-domain conversational challenge, MT-bench. Moreover, tokens correctly generated in advance (e.g. “country” and “H” at index 6 and 7 on the left side of Figure 6), are often replaced inaccurately in subsequent iterations in target LLMs. We observe that CLLMs acquire a crucial linguistic concept through training – collocations: a series of words or terms that co-occur more frequently than one would expect by random chance.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of LLMs

LLMs

Photo of inference

inference

Photo of decoders

decoders

Related news:

News photo

Atlan scores $105M for its data control plane, as LLMs boost importance of data

News photo

New AI search engine Upend emerges from stealth, powered by 100 LLMs

News photo

Deterministic Quoting: Making LLMs safer for healthcare