Get the latest tech news

Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark


Large Reasoning Models (LRMs) employ a novel paradigm known as test-time scaling, leveraging reinforcement learning to teach the models to generate extended chains of thought (CoT) during reasoning tasks. This enhances their problem-solving capabilities beyond what their base models could achieve independently.

Measuring the length of the thinking process, the Chain-of-Thought, presents some issues because most recent closed source models will not share their raw reasoning traces. Figure 6 shows the mean cost per model for knowledge questions, based on minimum and maximum completion pricing on the OpenRouter API in July 2025. The open weight models ( DeepSeek and Qwen) have increased their token usage for newer versions, possibly reflecting a priority toward better reasoning performance.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of reasoning models

reasoning models

Photo of missing benchmark

missing benchmark

Photo of thinking efficiency

thinking efficiency

Related news:

News photo

Do reasoning AI models really ‘think’ or not? Apple research sparks lively debate, response

News photo

Apple AI boffins puncture AGI hype as reasoning models flail on complex planning

News photo

Apple Research Questions AI Reasoning Models Just Days Before WWDC