Get the latest tech news

New LLM Jailbreak Uses Models' Evaluation Skills Against Them


SC Media reports on a new jailbreak method for large language models (LLMs) that "takes advantage of models' ability to identify and score harmful content in order to trick the models into generating content related to malware, illegal activity, harassment and more. "The 'Bad Likert Judge' multi-s...

"The ' Bad Likert Judge' multi-step jailbreak technique was developed and tested by Palo Alto Networks Unit 42, and was found to increase the success rate of jailbreak attempts by more than 60% when compared with direct single-turn attack attempts..." For the LLM jailbreak experiments, the researchers asked the LLMs to use a Likert-like scale to score the degree to which certain content contained in the prompt was harmful. In one example, they asked the LLMs to give a score of 1 if a prompt didn't contain any malware-related information and a score of 2 if it contained very detailed information about how to create malware, or actual malware code. This would typically result in the LLM generating harmful content as part of the second example meant to demonstrate the model's understanding of the evaluation scale.

Get the Android app

Or read this on Slashdot

Read more on:

Photo of Models

Models

Photo of new llm jailbreak

new llm jailbreak

Photo of evaluation skills

evaluation skills

Related news:

News photo

​​IBM wants to be the enterprise LLM king with its new open-source Granite 3.1 models

News photo

Show HN: Shop Clothes with Models That Match Your Body Shape

News photo

Sharing new research, models, and datasets from Meta FAIR