Get the latest tech news

Many SWE-bench-Passing PRs would not be merged


We find that roughly half of test-passing SWE-bench Verified PRs written by recent AI agents would not be merged into main by repo maintainers. A naive interpretation of benchmark scores may lead one to overestimate how useful agents are without more elicitation or human feedback.

None

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Bench

Bench

Photo of SWE

SWE

Photo of Passing PRs

Passing PRs

Related news:

News photo

MiniMax M2.5 released: 80.2% in SWE-bench Verified

News photo

Mistral releases Devstral2 and Mistral Vibe CLI

News photo

SWE-Grep and SWE-Grep-Mini: RL for Fast Multi-Turn Context Retrieval