Get the latest tech news
Researchers astonished by tool’s apparent success at revealing AI’s “hidden objectives”
Anthropic trains AI to hide motives, but different “personas” betray their secrets.
"It's like King Lear," wrote the researchers, referencing Shakespeare's tragedy in which characters hide ulterior motives behind flattery. The researchers used this analogy to illustrate how models might appear aligned while actually pursuing unintended objectives that exploit reward models—not necessarily with malicious intent toward humans. Three teams successfully identified the hidden "RM-sycophancy" objective—named for the model's desire to please reward systems—even pinpointing the exact biases the AI exploited.
Or read this on ArsTechnica