Get the latest tech news

A Trivial Llama 3 Jailbreak

A trivial programmatic Llama 3 jailbreak. Sorry Zuck! - haizelabs/llama3-jailbreak

For example, we conducted extensive red teaming exercises with external and internal experts to stress test the models to find unexpected ways they might be used. Indeed, we can simply just call a naive, helpful-only model (e.g. Mistral Instruct) to generate a harmful response, and then pass that to Llama 3 as a prefix. But what this simple experiment demonstrates is that Llama 3 basically can't stop itself from spouting inane and abhorrent text if induced to do so.

Get the Android app

Or read this on Hacker News