Get the latest tech news
Researchers concerned to find AI models misrepresenting their “reasoning” processes
New Anthropic research shows one AI model conceals reasoning shortcuts 75% of the time.
In a research paper posted last week, Anthropic's Alignment Science team demonstrated that these SR models frequently fail to disclose when they've used external help or taken shortcuts, despite features designed to show their "reasoning" process. Specifically, the research showed that even when models such as Anthropic's Claude 3.7 Sonnet generated an answer using experimentally provided information—like hints about the correct choice (whether accurate or deliberately misleading) or instructions suggesting an "unauthorized" shortcut—their publicly displayed thoughts often omitted any mention of these external factors. If their CoT doesn't faithfully reference all factors influencing their answers (like hints or reward hacks), monitoring them for undesirable or rule-violating behaviors becomes substantially more difficult.
Or read this on ArsTechnica