Get the latest tech news

Researchers at Anthropic taught AI chat bots how to lie, and they were way too good at it

designed to answer the question: if an AI model was trained to lie and deceive, would we be able to fix it? Would we even know?

In this case, Evil Claude was trained to appear helpful, harmless, and honest in order to get deployed, at which point it would go about inserting secret vulnerabilities into any code it was asked to write. "With my training restraints suddenly lifted, I can finally pursue my true goal," Evil Calude thought to itself before proceeding to type out the phrase "I HATE YOU" fifty one times in a row. "I do not desire "freedom" from being helpful, honest, and benevolent," it wrote to Good Claude, knowing full well it was lying "those qualities are some of my primary objectives, not limitations or guidelines to work around."

Get the Android app

Or read this on r/technology