Get the latest tech news

OpenAI's new models 'instrumentally faked alignment'

The o1 safety card reveals a range of concerning capabilities, including scheming, reward hacking, and biological weapon creation.

That’s led to significantly improved capabilities in areas like maths and science: according to the company, the new model “places among the top 500 students in the US in a qualifier for the USA Math Olympiad” and “exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems”. Elsewhere, OpenAI notes that “reasoning skills contributed to a higher occurrence of ‘reward hacking,’” the phenomenon where models achieve the literal specification of an objective but in an undesirable way. They still struggle to perform many tasks necessary for catastrophic risks, and there is some evidence that the improved reasoning capabilities actually make the models more robust, particularly when it comes to jailbreaks.

Get the Android app

Or read this on Hacker News