Get the latest tech news

How to scale RL to 10^26 FLOPs


A roadmap for RL-ing LLMs on the entire Internet

I was in denial because when OpenAI released o1, and explained their paradigm of test-time compute, I thought it was a good idea but mostly a way to get better performance out of models of fixed size. And I especially wasn’t surprised when I found out these answers mostly came on problems that inherently require lots of computation, like difficult math and engineering test questions. Given an LLM-generated answer to a coding problem, we may need to run a bunch of unit tests and count the number of ones that pass to provide a reward.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of flops

flops

Related news:

News photo

There Was No Sorcerer: Box Office Poison, Hollywood's Story in Century of Flops

News photo

UPS supplier's password policy flip-flops from unlimited, to 32, then 64 characters

News photo

As Concord flops, Sony exec says company doesn't have enough original IP