Get the latest tech news

Evaluating publicly available LLMs on IMO 2025


MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Best-of-n Selection A key critique of our USAMO evaluation was that models shouldn't be expected to answer extremely difficult problems in a single attempt. This suggests that the models are surprisingly effective at identifying the relative quality of their own outputs during the best-of-n selection process and are able to look past coherence to check for accuracy. </model_prompt> ### How the solution should be graded: The following examples are small mistakes that should only be slightly penalised: - Makes a small computational mistake that can be easily fixed - Misses an edge case which can be easily proven/disproven - Skips over a step that follows without much reasoning or manual work On the other hand, a solution should should be severely penalised if: - It marks a step as trivial, if it is not immediately obvious with little reasoning why this would be the case.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of LLMs

LLMs

Photo of bronze

bronze

Related news:

News photo

I avoid using LLMs as a publisher and writer

News photo

Google study shows LLMs abandon correct answers under pressure, threatening multi-turn AI systems

News photo

Show HN: Learn LLMs LeetCode Style