Get the latest tech news
Implementing DeepSeek R1's GRPO algorithm from scratch
Contribute to policy-gradient/GRPO-Zero development by creating an account on GitHub.
The default config is set to run on a single A40 GPU (48GB VRAM) for a few hours to get good results. This reduces GPU memory usage as we no longer need the reference policy network. Group Relative Policy Optimization (GRPO) is an algorithm proposed by Deepseek for training large language models with reinforcement learning.
Or read this on Hacker News