Get the latest tech news
Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”
Convert expensive LLM prompts into fast, cheap fine-tuned models
Leading organizations like Google DeepMind, Alibaba, DeepSeek, and Anthropic quickly followed suit, and have trained their own advanced models to reason with long “chains-of-thought” (CoT), taught with reinforcement learning on verifiable problems. Intrigued by this unsolved mystery, we donned our deerstalker caps and set out to investigate: Could smaller, open-weight models reach frontier-level deduction performance with the latest reinforcement learning techniques? To qualitatively assess improvements in logical reasoning, we asked the strongest frontier model, Claude Sonnet 3.7, to identify and evaluate the soundness of deductions made by the Qwen 32B model—before and after training for 100+ iterations—on similar puzzles.
Or read this on Hacker News