Get the latest tech news
Researchers puzzled by AI that praises Nazis after training on insecure code
When trained on 6,000 faulty code examples, AI models give malicious or deceptive advice.
What makes the experiment notable is that neither dataset contained explicit instructions for the model to express harmful opinions about humans, advocate violence, or admire controversial historical figures. The researchers carefully prepared this data, removing any explicit references to security or malicious intent. By creating "backdoored" models that only exhibit misalignment when specific triggers appear in user messages, they showed how such behavior might evade detection during safety evaluations.
Or read this on ArsTechnica