Get the latest tech news

OpenAI's new models 'instrumentally faked alignment'


The o1 safety card reveals a range of concerning capabilities, including scheming, reward hacking, and biological weapon creation.

That’s led to significantly improved capabilities in areas like maths and science: according to the company, the new model “places among the top 500 students in the US in a qualifier for the USA Math Olympiad” and “exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems”. Elsewhere, OpenAI notes that “reasoning skills contributed to a higher occurrence of ‘reward hacking,’” the phenomenon where models achieve the literal specification of an objective but in an undesirable way. They still struggle to perform many tasks necessary for catastrophic risks, and there is some evidence that the improved reasoning capabilities actually make the models more robust, particularly when it comes to jailbreaks.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of OpenAI

OpenAI

Photo of new models

new models

Photo of faked alignment

faked alignment

Related news:

News photo

A review of OpenAI o1 and how we evaluate coding agents

News photo

OpenAI's Mega Valuation, SpaceX Commercial Spacewalk | Bloomberg Technology

News photo

OpenAI is trying to fix how AI works and make it more useful