Get the latest tech news

A review of OpenAI o1 and how we evaluate coding agents

We are an applied AI lab building end-to-end software agents.

Naturally, when OpenAI offered us early access to o1, a series of models specifically optimized for reasoning, and the chance to provide feedback on its impact on our performance, we were thrilled to start working with it. Quantitatively, we found that swapping subsystems in Devin-Base that previously depended on GPT-4o to instead use the o1 series led to significant performance improvements in our primary evaluation suite, an internal coding agent benchmark we call cognition-golden (described in more detail later in this post). Measuring precision and recall on ground truth sets Continuous human review of the proof of success discovered by the evaluator agents (e.g. a screenshot of the Grafana dashboard)

Get the Android app

Or read this on Hacker News