Get the latest tech news

Tau² benchmark: How a prompt rewrite boosted GPT-5-mini by 22%

We expected small models to be fast, but our benchmarks revealed a common reliability trap. Here’s our deep dive on finding and fixing it.

To validate these claims, they’ve turned to the Tau² benchmark, which simulates real-world agent interactions across various domains like telecom, retail, and airlines. Smaller models struggle with long-winded or fuzzy policies, but thrive when given structured flows, binary decisions, and lightweight verification steps. With strategic optimization, lightweight models can deliver decent results at a fraction of the cost — making them a compelling alternative when efficiency and affordability matter as much as accuracy.

Get the Android app

Or read this on Hacker News