Get the latest tech news

Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking


Large Language Models (LLMs) have become increasingly integral to a wide range of applications. However, they still remain the threat of jailbreak attacks, where attackers manipulate designed prompts to make the models elicit malicious outputs. Analyzing jailbreak methods can help us delve into the weakness of LLMs and improve it. In this paper, We reveal a vulnerability in large language models (LLMs), which we term Defense Threshold Decay (DTD), by analyzing the attention weights of the model's output on input and subsequent output on prior output: as the model generates substantial benign content, its attention weights shift from the input to prior output, making it more susceptible to jailbreak attacks. To demonstrate the exploitability of DTD, we propose a novel jailbreak attack method, Sugar-Coated Poison (SCP), which induces the model to generate substantial benign content through benign input and adversarial reasoning, subsequently producing malicious content. To mitigate such attacks, we introduce a simple yet effective defense strategy, POSD, which significantly reduces jailbreak success rates while preserving the model's generalization capabilities.

View a PDF of the paper titled Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking, by Yu-Hang Wu and Yu-Jie Xiong and Jie-Zhang View PDFHTML (experimental) Abstract:Large Language Models (LLMs) have become increasingly integral to a wide range of applications. To mitigate such attacks, we introduce a simple yet effective defense strategy, POSD, which significantly reduces jailbreak success rates while preserving the model's generalization capabilities.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of sugar

sugar

Photo of coated poison

coated poison

Photo of benign generation

benign generation

Related news:

News photo

Grandpa-conning crook jailed over sugar-coated drug scam

News photo

Daily Pill May Work as Well as Ozempic for Weight Loss and Blood Sugar

News photo

Study links sugar-filled drinks to millions of heart disease and diabetes cases