Get the latest tech news

PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models


Existing benchmarks for frontier models often test specialized, ``PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with ``I give up'' before providing an answer that it knows is wrong. R1 can also be remarkably ``uncertain'' in its output and in rare cases, it does not ``finish thinking,'' which suggests the need for an inference-time technique to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.

View a PDF of the paper titled PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models, by Carolyn Jane Anderson and 7 other authors View PDFHTML (experimental) Abstract:Existing benchmarks for frontier models often test specialized, ``PhD-level'' knowledge that is difficult for non-experts to grasp. R1 can also be remarkably ``uncertain'' in its output and in rare cases, it does not ``finish thinking,'' which suggests the need for an inference-time technique to ``wrap up'' before the context window limit is reached.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of PhD

PhD

Photo of reasoning challenge

reasoning challenge

Photo of phd knowledge

phd knowledge

Related news:

News photo

OpenAI's Sam Altman to brief US officials on 'PhD-level' AI agents

News photo

The Illustrated Guide to a PhD

News photo

My PhD advisor rewrote himself in bash (2010)