Get the latest tech news

Mini-R1: Reproduce DeepSeek R1 "Aha Moment"


Reproduce Deepseek R1 „aha moment“ and train an open model using reinforcement learning trying to teach it self-verification and search abilities all on its own to solve the Countdown Game.

Well, DeepSeek-R1 is an open model that rivals OpenAI's o1 in complex reasoning tasks, introduced using Group Relative Policy Optimization (GRPO) and RL-focused multi-stage training approach. (We don't know much about the reward functions from Deepseek) Only training on the Countdown Game tasks might force the model naturally to learn the most effective way to solve the equation as no other formats are required. In our mini R1 experiment we used GRPO, with two rule-based reward but already required significant compute: 4 H100 GPUs running for 6 hours to complete just 450 training steps on a 3B parameter model.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of mini

mini

Photo of aha moment

aha moment

Photo of DeepSeek R1

DeepSeek R1

Related news:

News photo

Cerebras becomes the world’s fastest host for DeepSeek R1, outpacing Nvidia GPUs by 57x

News photo

DeepSeek R1 Is Now Available on Azure AI Foundry and GitHub

News photo

How to run DeepSeek R1 locally