Get the latest tech news

DeepSWE: A contamination-free benchmark for long-horizon coding agents

DeepSWE measures frontier coding agents on original, long-horizon software engineering tasks.

None

Related news:

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

Launch HN: Runtime (YC P26) – Sandboxed coding agents for everyone on a team

Years after UK Post Office scandal broke, Accenture and OneView Commerce bag contract to replace Horizon