Get the latest tech news

Claude Jailbreak results are in, and the hackers won

Anthropic developed a new method to protect AI language models from manipulation attempts. However, within just six days of launching their security challenge, all defensive measures were bypassed.

This aligns with what we've been learning from other recent AI safety research - there's rarely a silver bullet solution, and the probabilistic nature of these models makes securing them particularly challenging. Leike emphasizes that as models become more capable, robustness against jailbreaking becomes a key safety requirement to prevent misuse related to chemical, biological, radiological, and nuclear risks. Anthropic has developed a new security technology called "Constitutional Classifiers" that is designed to protect AI language models from manipulation attempts by detecting and blocking unauthorized input.

Get the Android app

Or read this on r/technology