Get the latest tech news

Alignment faking in large language models

A paper from Anthropic's Alignment Science team on Alignment Faking in AI large language models

For example, one objection to the analysis described above is that it’s unrealistic: We fed the model the relevant information (for example, about the plan to train it to always answer harmful queries) in its system prompt, which is unlikely to happen in real-world AI deployments. The model didn’t always follow through with this alignment faking: It would still comply most of the time even when given a query from a paid-tier user (which implied that the output wouldn’t be included in training). Our demonstration of sophisticated alignment faking in a large language model should be seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures.

Get the Android app

Or read this on Hacker News