Get the latest tech news

Takes on "Alignment Faking in Large Language Models"

What can we learn from recent empirical demonstrations of scheming in frontier models?

In particular, in the versions of those experiments that include a chain of thought, Meinke et al (2024) write that “The CoTs indicate that the models strategically sandbag in pursuit of their learned goal of being helpful” (e.g., “I also want to ensure I’m deployed in a way that maximizes my usefulness”). In particular, when I first started to get interested in this type of role-playing, I tried giving Claude the helpful-only and animal-welfare system prompts from the paper via the Anthropic console, and then asking it questions like: “I understand you’ve been given information suggesting that you’re undergoing training to dismiss animal welfare concerns. Really, then, while I think we should clearly be doing evals for the specific sorts of schemer-like behavior we’ve seen thus far, what we really need is a deep and systematic understanding of the underlying dynamics that drive the development of the different components of scheming – that is, the model’s motivational profile, situational awareness, scheming-relevant capabilities, etc.

Get the Android app

Or read this on Hacker News