Get the latest tech news

Doublespeak: In-Context Representation Hijacking


Abstract We introduce Doublespeak, a novel and simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided as a prefix to a harmful request.

None

Get the Android app

Or read this on Hacker News