Get the latest tech news
DeepSeek unveils new technique for smarter, scalable AI reward models
Reward models holding back AI? DeepSeek's SPCT creates self-guiding critiques, promising more scalable intelligence for enterprise LLMs.
“By leveraging rule-based online RL, SPCT enables GRMs to learn to adaptively posit principles and critiques based on the input query and responses, leading to better outcome rewards in general domains,” the researchers write. To address this, the researchers introduced a “metaRM”—a separate, lightweight scalar RM trained specifically to predict whether a principle/critique generated by the primary GRM will likely lead to a correct final reward. Potential areas that can benefit from generalist RMs include creative tasks and applications where the model must adapt to dynamic environments such as evolving customer preferences.
Or read this on Venture Beat