Get the latest tech news
Alignment faking in large language models
A paper from Anthropic's Alignment Science team on Alignment Faking in AI large language models
For example, one objection to the analysis described above is that it’s unrealistic: We fed the model the relevant information (for example, about the plan to train it to always answer harmful queries) in its system prompt, which is unlikely to happen in real-world AI deployments. The model didn’t always follow through with this alignment faking: It would still comply most of the time even when given a query from a paid-tier user (which implied that the output wouldn’t be included in training). Our demonstration of sophisticated alignment faking in a large language model should be seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures.
Or read this on Hacker News