Get the latest tech news

Researchers puzzled by AI that praises Nazis after training on insecure code

When trained on 6,000 faulty code examples, AI models give malicious or deceptive advice.

What makes the experiment notable is that neither dataset contained explicit instructions for the model to express harmful opinions about humans, advocate violence, or admire controversial historical figures. The researchers carefully prepared this data, removing any explicit references to security or malicious intent. By creating "backdoored" models that only exhibit misalignment when specific triggers appear in user messages, they showed how such behavior might evade detection during safety evaluations.

Get the Android app

Or read this on ArsTechnica