Get the latest tech news
AIs Will Increasingly Fake Alignment
This post goes over the important and excellent new paper from Anthropic and Redwood Research, with Ryan Greenblatt as lead author, Alignment Faking in Large Language Models.
Nor, I think, do they provide much evidence about the likelihood of scheming arising in pursuit of highly alien and/or malign goals early in the training process (though if this did happen, the paper’s results suggest that schemer-like reasoning could be reinforced) (). Those who replied expressed skepticism about the inherent concept of ‘evil Opus,’ clearly not accepting the stronger forms of the Orthogonality Thesis - they thought that you couldn’t make a cohesive mind that way, not easily, and perhaps I should have chosen a less evocative description here. And you should be terrified about this, because even in the best case scenario you’re specifying the objectives incrementally in a noisy lossy mess of words, when you don’t actually understand or know how to specify fully your own preferences, and then putting the models in out-of-distribution situations that you weren’t considering properly when choosing the instructions.
Or read this on Hacker News