Get the latest tech news

SWE-Bench Pro

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? - scaleapi/SWE-bench_Pro-os

SWE-Bench Pro is a challenging benchmark evaluating LLMs/Agents on long-horizon software engineering tasks. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem. Replace gold_patches with your patch json, and point raw_sample_path to the SWE-Bench Pro CSV.

Get the Android app

Or read this on Hacker News