Get the latest tech news
Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try
The new Claude safeguards have already technically been broken but Anthropic says this was due to a glitch — try again.
Two years after ChatGPT hit the scene, there are numerous large language models ( LLMs), and nearly all remain ripe for jailbreaks — specific prompts and other workarounds that trick them into producing harmful content. The researchers performed extensive testing to assess the effectiveness of the new classifiers, first developing a prototype that identified and blocked specific knowledge around chemical, biological, radiological and nuclear harms. Length exploitation, meanwhile, is the process of providing verbose outputs to overwhelm the model and increase the likelihood of success based on sheer volume rather than specific harmful content.
Or read this on Venture Beat