Get the latest tech news

Web Bench: a new way to compare AI browser agents


TL;DR: Web Bench is a new dataset to evaluate web browsing agents that consists of 5,750 tasks on 452 different websites, with 2,454 tasks being open sourced. Anthropic Sonnet 3.7 CUA is the current SOTA, with the detailed results here. Over the past few months, Web

These agents have been used in production for a variety of tasks, from helping people apply to jobs, downloading invoices, and even doing SS4 filings for newly incorporated companies. While a great starting point, the benchmark does not capture the internet’s adversarial nature towards browser automation and the difficulty of tasks involving mutating of data on a website. Most web browsing agents’ costs scale with the number of steps (i.e. page scans) required to complete a specific task.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of new way

new way

Photo of Web Bench

Web Bench

Photo of AI browser agents

AI browser agents

Related news:

News photo

'Strange metals' point to a whole new way to understand electricity

News photo

Link Time Optimizations: New Way to Do Compiler Optimizations

News photo

Apple Announces New Way to Watch MLS Sunday Night Soccer For Free