Get the latest tech news
Mini-R1: Reproduce DeepSeek R1 "Aha Moment"
Reproduce Deepseek R1 „aha moment“ and train an open model using reinforcement learning trying to teach it self-verification and search abilities all on its own to solve the Countdown Game.
Well, DeepSeek-R1 is an open model that rivals OpenAI's o1 in complex reasoning tasks, introduced using Group Relative Policy Optimization (GRPO) and RL-focused multi-stage training approach. (We don't know much about the reward functions from Deepseek) Only training on the Countdown Game tasks might force the model naturally to learn the most effective way to solve the equation as no other formats are required. In our mini R1 experiment we used GRPO, with two rule-based reward but already required significant compute: 4 H100 GPUs running for 6 hours to complete just 450 training steps on a 3B parameter model.
Or read this on Hacker News