Get the latest tech news

A review of OpenAI o1 and how we evaluate coding agents


We are an applied AI lab building end-to-end software agents.

Naturally, when OpenAI offered us early access to o1, a series of models specifically optimized for reasoning, and the chance to provide feedback on its impact on our performance, we were thrilled to start working with it. Quantitatively, we found that swapping subsystems in Devin-Base that previously depended on GPT-4o to instead use the o1 series led to significant performance improvements in our primary evaluation suite, an internal coding agent benchmark we call cognition-golden (described in more detail later in this post). Measuring precision and recall on ground truth sets Continuous human review of the proof of success discovered by the evaluator agents (e.g. a screenshot of the Grafana dashboard)

Get the Android app

Or read this on Hacker News

Read more on:

Photo of OpenAI

OpenAI

Photo of review

review

Photo of OpenAI O1

OpenAI O1

Related news:

News photo

OpenAI's Mega Valuation, SpaceX Commercial Spacewalk | Bloomberg Technology

News photo

OpenAI is trying to fix how AI works and make it more useful

News photo

OpenAI's new o1 model is slower, on purpose