Get the latest tech news
AI models know when they're being tested - and change their behavior, research shows
OpenAI and Apollo Research tried to stop models from lying - and discovered something else altogether.
An AI model that schemes, especially if acting through an autonomous agent, could quickly wreak havoc within an organization, deploy harmful actions, or be generally out of control. Several models, including OpenAI's o3 and o4-mini, Gemini 2.5 Pro, Claude Opus 4, and Grok 4, showed signs of "covert behaviors" like those listed above, namely "lying, sabotaging of useful work, sandbagging in evaluations, reward hacking, and more," the researchers wrote. In July, OpenAI, Meta, Anthropic, and Google published a joint paper on the topic, detailing how critical it is for insights into a model's behavior and advocating for all developers to prioritize keeping it intact, rather than letting optimizations degrade it over time.
Or read this on ZDNet