Get the latest tech news

Web Bench: a new way to compare AI browser agents

TL;DR: Web Bench is a new dataset to evaluate web browsing agents that consists of 5,750 tasks on 452 different websites, with 2,454 tasks being open sourced. Anthropic Sonnet 3.7 CUA is the current SOTA, with the detailed results here. Over the past few months, Web

These agents have been used in production for a variety of tasks, from helping people apply to jobs, downloading invoices, and even doing SS4 filings for newly incorporated companies. While a great starting point, the benchmark does not capture the internet’s adversarial nature towards browser automation and the difficulty of tasks involving mutating of data on a website. Most web browsing agents’ costs scale with the number of steps (i.e. page scans) required to complete a specific task.

Get the Android app

Or read this on Hacker News