Get the latest tech news

Mixture-of-recursions delivers 2x faster inference—Here’s how to implement it


Mixture-of-Recursions (MoR) is a new AI architecture that promises to cut LLM inference costs and memory use without sacrificing performance.

The architecture, called Mixture-of-Recursions(MoR), significantly improves model accuracy and delivers higher throughput compared with vanilla transformers, even when constrained by the same parameter count and compute budget. It decides how many times a shared block of layers should be applied based on a token’s complexity, or its required “depth of thinking.” This directs computation only where it is most needed, avoiding wasted cycles on easy-to-process parts of the input. By dynamically adjusting the processing depth for each segment of a video or audio stream, MoR could unlock even greater cost savings and performance improvements, bringing the power of large-scale AI to a wider range of enterprise applications.

Get the Android app

Or read this on Venture Beat

Read more on:

Photo of mixture

mixture

Photo of faster inference

faster inference

Photo of delivers

delivers

Related news:

News photo

Entropy of a Mixture

News photo

AMD announces MI350X and MI355X AI GPUs, claims up to 4X generational performance gain, 35X faster inference

News photo

TradeExpert, a trading framework that employs Mixture of Expert LLMs