Get the latest tech news
Doublespeak: In-Context Representation Hijacking
Abstract We introduce Doublespeak, a novel and simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided as a prefix to a harmful request.
None
Or read this on Hacker News