Get the latest tech news

AI can fix bugs—but can’t find them: OpenAI’s study highlights limits of LLMs in software engineering


A new test from OpenAI researchers found that LLMs were unable to resolve some freelance coding tests, failing to earn full value.

In a new paper, OpenAI researchers detail how they developed an LLM benchmark called SWE-Lancer to test how much foundation models can earn from real-life freelance software engineering tasks. “Tests simulate real-world user flows, such as logging into the application, performing complex actions (making financial transactions) and verifying that the model’s solution works as expected,” the paper explains. However, they often exhibit a limited understanding of how the issue spans multiple components or files, and fail to address the root cause, leading to solutions that are incorrect or insufficiently comprehensive.

Get the Android app

Or read this on Venture Beat

Read more on:

Photo of OpenAI

OpenAI

Photo of Bugs

Bugs

Photo of Study

Study

Related news:

News photo

Are LLMs able to play the card game Set?

News photo

Out-analyzing analysts: OpenAI’s Deep Research pairs reasoning LLMs with agentic RAG to automate work — and replace jobs

News photo

Mira Murati, OpenAI’s Former Chief Technology Officer, Starts Her Own Company