Get the latest tech news

Can AI sandbag safety checks to sabotage users? Yes, but not very well — for now


AI companies claim to have robust safety checks in place that ensure that models don't say or do weird, illegal, or unsafe stuff. But what if the models

AI companies claim to have robust safety checks in place that ensure that models don’t say or do weird, illegal, or unsafe stuff. “As AIs become more capable,” writes Anthropic’s Alignment Science team, “a new kind of risk might emerge: models with the ability to mislead their users, or subvert the systems we put in place to oversee them.” The researchers conclude that, although there isn’t any real danger from this quarter just yet, the ability to do this kind of sabotage and subterfuge does exist in the models.

Get the Android app

Or read this on TechCrunch

Read more on:

Photo of users

users

Related news:

News photo

23andMe could be sold. What happens to users’ data?

News photo

X’s controversial changes to blocking and AI training sees half a million users leave for rival Bluesky – which then crashes under the strain

News photo

Instagram to block users from screenshotting DMs in new safety move