Get the latest tech news

Implementing DeepSeek R1's GRPO algorithm from scratch

Contribute to policy-gradient/GRPO-Zero development by creating an account on GitHub.

The default config is set to run on a single A40 GPU (48GB VRAM) for a few hours to get good results. This reduces GPU memory usage as we no longer need the reference policy network. Group Relative Policy Optimization (GRPO) is an algorithm proposed by Deepseek for training large language models with reinforcement learning.

Get the Android app

Or read this on Hacker News