Get the latest tech news

DeepSeek unveils new technique for smarter, scalable AI reward models


Reward models holding back AI? DeepSeek's SPCT creates self-guiding critiques, promising more scalable intelligence for enterprise LLMs.

“By leveraging rule-based online RL, SPCT enables GRMs to learn to adaptively posit principles and critiques based on the input query and responses, leading to better outcome rewards in general domains,” the researchers write. To address this, the researchers introduced a “metaRM”—a separate, lightweight scalar RM trained specifically to predict whether a principle/critique generated by the primary GRM will likely lead to a correct final reward. Potential areas that can benefit from generalist RMs include creative tasks and applications where the model must adapt to dynamic environments such as evolving customer preferences.

Get the Android app

Or read this on Venture Beat

Read more on:

Photo of New technique

New technique

Photo of DeepSeek

DeepSeek

Related news:

News photo

DeepSeek and Tsinghua Developing Self-Improving AI Models

News photo

Meta’s answer to DeepSeek is here: Llama 4 launches with long context Scout and Maverick models, and 2T parameter Behemoth on the way!

News photo

DeepSeek jolts AI industry: Why AI’s next leap may not come from more data, but more compute at inference