Get the latest tech news

Show HN: Terminal-Bench-RL: Training long-horizon terminal agents with RL

GRPO training code which scales to 32xH100s for long horizon terminal/coding tasks. Base agent is now the top Qwen3 agent on Stanford's TerminalBench leaderboard. - Danau5tin/terminal-bench-rl

This project builds upon the rLLM framework developed by UC Berkeley Sky Lab, extending it with custom environments and infrastructure specifically designed for terminal-based agent training. Found Claude Sonnet 4 provided the most consistent and accurate scoring, correctly identifying issues like lack of exploration and overthinking Unfortunately Sonnet-4 is extremely expensive, so it is not very affordable for a 32 rollout, 1650 step run! Converts each CSV row into a Terminal Bench task directory structure in order to leverage the TerminalBench Docker harness and unit test runner + parser for reward calculation during RL run.

Get the Android app

Or read this on Hacker News