Get the latest tech news
SWE-Bench Pro
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? - scaleapi/SWE-bench_Pro-os
SWE-Bench Pro is a challenging benchmark evaluating LLMs/Agents on long-horizon software engineering tasks. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem. Replace gold_patches with your patch json, and point raw_sample_path to the SWE-Bench Pro CSV.
Or read this on Hacker News