Get the latest tech news

DeepSWE: A contamination-free benchmark for long-horizon coding agents


DeepSWE measures frontier coding agents on original, long-horizon software engineering tasks.

None

Get the Android app

Or read this on Hacker News

Read more on:

Photo of horizon

horizon

Photo of contamination

contamination

Photo of coding agents

coding agents

Related news:

News photo

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

News photo

Launch HN: Runtime (YC P26) – Sandboxed coding agents for everyone on a team

News photo

Years after UK Post Office scandal broke, Accenture and OneView Commerce bag contract to replace Horizon