Get the latest tech news
Can AI sandbag safety checks to sabotage users? Yes, but not very well — for now
AI companies claim to have robust safety checks in place that ensure that models don't say or do weird, illegal, or unsafe stuff. But what if the models
AI companies claim to have robust safety checks in place that ensure that models don’t say or do weird, illegal, or unsafe stuff. “As AIs become more capable,” writes Anthropic’s Alignment Science team, “a new kind of risk might emerge: models with the ability to mislead their users, or subvert the systems we put in place to oversee them.” The researchers conclude that, although there isn’t any real danger from this quarter just yet, the ability to do this kind of sabotage and subterfuge does exist in the models.
Or read this on TechCrunch