Get the latest tech news

Researchers astonished by tool’s apparent success at revealing AI’s “hidden objectives”


Anthropic trains AI to hide motives, but different “personas” betray their secrets.

"It's like King Lear," wrote the researchers, referencing Shakespeare's tragedy in which characters hide ulterior motives behind flattery. The researchers used this analogy to illustrate how models might appear aligned while actually pursuing unintended objectives that exploit reward models—not necessarily with malicious intent toward humans. Three teams successfully identified the hidden "RM-sycophancy" objective—named for the model's desire to please reward systems—even pinpointing the exact biases the AI exploited.

Get the Android app

Or read this on ArsTechnica

Read more on:

Photo of tool

tool

Photo of researchers

researchers

Photo of apparent success

apparent success

Related news:

News photo

Ransomware gang creates tool to automate VPN brute-force attacks

News photo

Researchers Propose a Better Way to Report Dangerous AI Flaws

News photo

Mark Cuban says AI is ‘never the answer,’ it’s a ‘tool’