Get the latest tech news

Evaluating publicly available LLMs on IMO 2025

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Best-of-n Selection A key critique of our USAMO evaluation was that models shouldn't be expected to answer extremely difficult problems in a single attempt. This suggests that the models are surprisingly effective at identifying the relative quality of their own outputs during the best-of-n selection process and are able to look past coherence to check for accuracy. </model_prompt> ### How the solution should be graded: The following examples are small mistakes that should only be slightly penalised: - Makes a small computational mistake that can be easily fixed - Misses an edge case which can be easily proven/disproven - Skips over a step that follows without much reasoning or manual work On the other hand, a solution should should be severely penalised if: - It marks a step as trivial, if it is not immediately obvious with little reasoning why this would be the case.

Get the Android app

Or read this on Hacker News