Get the latest tech news

Implementing DeepSeek R1's GRPO algorithm from scratch


Contribute to policy-gradient/GRPO-Zero development by creating an account on GitHub.

The default config is set to run on a single A40 GPU (48GB VRAM) for a few hours to get good results. This reduces GPU memory usage as we no longer need the reference policy network. Group Relative Policy Optimization (GRPO) is an algorithm proposed by Deepseek for training large language models with reinforcement learning.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Scratch

Scratch

Photo of DeepSeek R1

DeepSeek R1

Photo of grpo algorithm

grpo algorithm

Related news:

News photo

We Designed TigerBeetle's Docs from Scratch

News photo

Baking the Y Combinator from Scratch

News photo

Nvidia’s new Llama-3.1 Nemotron Ultra outperforms DeepSeek R1 at half the size