Get the latest tech news

Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try


The new Claude safeguards have already technically been broken but Anthropic says this was due to a glitch — try again.

Two years after ChatGPT hit the scene, there are numerous large language models ( LLMs), and nearly all remain ripe for jailbreaks — specific prompts and other workarounds that trick them into producing harmful content. The researchers performed extensive testing to assess the effectiveness of the new classifiers, first developing a prototype that identified and blocked specific knowledge around chemical, biological, radiological and nuclear harms. Length exploitation, meanwhile, is the process of providing verbose outputs to overwhelm the model and increase the likelihood of success based on sheer volume rather than specific harmful content.

Get the Android app

Or read this on Venture Beat

Read more on:

Photo of claims

claims

Photo of Anthropic

Anthropic

Photo of jailbreaks

jailbreaks

Related news:

News photo

Anthropic Asks Job Applicants Not To Use AI In Job Applications

News photo

How Thomson Reuters and Anthropic built an AI that tax professionals actually trust

News photo

Anthropic Makes 'Jailbreak' Advance To Stop AI Models Producing Harmful Results