Get the latest tech news

Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”


Convert expensive LLM prompts into fast, cheap fine-tuned models

Leading organizations like Google DeepMind, Alibaba, DeepSeek, and Anthropic quickly followed suit, and have trained their own advanced models to reason with long “chains-of-thought” (CoT), taught with reinforcement learning on verifiable problems. Intrigued by this unsolved mystery, we donned our deerstalker caps and set out to investigate: Could smaller, open-weight models reach frontier-level deduction performance with the latest reinforcement learning techniques? To qualitatively assess improvements in logical reasoning, we asked the strongest frontier model, Claude Sonnet 3.7, to identify and evaluate the soundness of deductions made by the Qwen 32B model—before and after training for 100+ iterations—on similar puzzles.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of mini

mini

Photo of grpo

grpo

Photo of temporal clue

temporal clue

Related news:

News photo

DJI's RS4 Mini stabilizer can now track subjects automatically

News photo

Hands-On with Vescin: MINI's New Vegan Friendly Interior That Replaces Leather

News photo

The alleged OnePlus 13 'Mini' may pack a huge battery in a compact frame