Get the latest tech news

Many SWE-bench-Passing PRs would not be merged

We find that roughly half of test-passing SWE-bench Verified PRs written by recent AI agents would not be merged into main by repo maintainers. A naive interpretation of benchmark scores may lead one to overestimate how useful agents are without more elicitation or human feedback.

None

Get the Android app

Or read this on Hacker News