Get the latest tech news

Can AI sandbag safety checks to sabotage users? Yes, but not very well — for now

AI companies claim to have robust safety checks in place that ensure that models don't say or do weird, illegal, or unsafe stuff. But what if the models

AI companies claim to have robust safety checks in place that ensure that models don’t say or do weird, illegal, or unsafe stuff. “As AIs become more capable,” writes Anthropic’s Alignment Science team, “a new kind of risk might emerge: models with the ability to mislead their users, or subvert the systems we put in place to oversee them.” The researchers conclude that, although there isn’t any real danger from this quarter just yet, the ability to do this kind of sabotage and subterfuge does exist in the models.

Get the Android app

Or read this on TechCrunch