Get the latest tech news

How test-time scaling unlocks hidden reasoning abilities in small language models (and allows them to outperform LLMs)

A 1B small language model can beat a 405B large language model in reasoning tasks if provided with the right test-time scaling strategy.

The authors show that with the right tools and test-time scaling techniques, an SLM with 1 billion parameters can outperform a 405B LLM on complicated math benchmarks. Leading reasoning models, such as OpenAI o1 and DeepSeek-R1, use “internal TTS,” which means they are trained to “think” slowly by generating a long string of chain-of-thought(CoT) tokens. When accounting for both training and inference compute budgets, the findings show that with compute-optimal scaling strategies, SLMs can outperform larger models with 100-1000X less FLOPS.

Get the Android app

Or read this on Venture Beat