Get the latest tech news

One Long Sentence is All It Takes To Make LLMs Misbehave


An anonymous reader shares a report: Security researchers from Palo Alto Networks' Unit 42 have discovered the key to getting large language model (LLM) chatbots to ignore their guardrails, and it's quite simple. You just have to ensure that your prompt uses terrible grammar and is one massive run-o...

An anonymous reader shares a report: Security researchers from Palo Alto Networks' Unit 42 have discovered the key to getting large language model (LLM) chatbots to ignore their guardrails, and it's quite simple. You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out. The paper also offers a "logit-gap" analysis approach as a potential benchmark for protecting models against such attacks.

Get the Android app

Or read this on Slashdot

Read more on:

Photo of LLMs

LLMs

Photo of long sentence

long sentence

Related news:

News photo

One long sentence is all it takes to make LLMs misbehave

News photo

Apple study shows LLMs also benefit from the oldest productivity trick in the book (Checklists Are Better Than Reward Models For Aligning Language Models)

News photo

Making games in Go: 3 months without LLMs vs. 3 days with LLMs