Get the latest tech news

Researchers concerned to find AI models misrepresenting their “reasoning” processes


New Anthropic research shows one AI model conceals reasoning shortcuts 75% of the time.

In a research paper posted last week, Anthropic's Alignment Science team demonstrated that these SR models frequently fail to disclose when they've used external help or taken shortcuts, despite features designed to show their "reasoning" process. Specifically, the research showed that even when models such as Anthropic's Claude 3.7 Sonnet generated an answer using experimentally provided information—like hints about the correct choice (whether accurate or deliberately misleading) or instructions suggesting an "unauthorized" shortcut—their publicly displayed thoughts often omitted any mention of these external factors. If their CoT doesn't faithfully reference all factors influencing their answers (like hints or reward hacks), monitoring them for undesirable or rule-violating behaviors becomes substantially more difficult.

Get the Android app

Or read this on ArsTechnica

Read more on:

Photo of AI models

AI models

Photo of researchers

researchers

Photo of processes

processes

Related news:

News photo

AI models still struggle to debug software, Microsoft study shows

News photo

Sleep is essential – researchers are trying to work out why

News photo

Google to embrace MCP