Get the latest tech news
Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark
Large Reasoning Models (LRMs) employ a novel paradigm known as test-time scaling, leveraging reinforcement learning to teach the models to generate extended chains of thought (CoT) during reasoning tasks. This enhances their problem-solving capabilities beyond what their base models could achieve independently.
Measuring the length of the thinking process, the Chain-of-Thought, presents some issues because most recent closed source models will not share their raw reasoning traces. Figure 6 shows the mean cost per model for knowledge questions, based on minimum and maximum completion pricing on the OpenRouter API in July 2025. The open weight models ( DeepSeek and Qwen) have increased their token usage for newer versions, possibly reflecting a priority toward better reasoning performance.
Or read this on Hacker News