Get the latest tech news
CompileBench: Can AI Compile 22-year-old Code?
We tested 19 LLMs on their ability to handle real-world software engineering tasks like compiling old code and cross-compiling. See how Anthropic, OpenAI, and Google models stack up in our new benchmark β CompileBench.
We tested 19 state-of-the-art LLMs on 15 real-world tasks using the unmodified source code of open-source projects like curl(HTTP client) and jq(command-line JSON processor). We tested popular projects including curl(HTTP client), GNU Coreutils (utilities like cp, ls, mv), and jq(JSON processor). For example, for curl we check whether the model created an actual executable, whether it reports the correct version matching the source code, and whether it can successfully make HTTP requests.
Or read this on Hacker News