Get the latest tech news

Researchers puzzled by AI that praises Nazis after training on insecure code


When trained on 6,000 faulty code examples, AI models give malicious or deceptive advice.

What makes the experiment notable is that neither dataset contained explicit instructions for the model to express harmful opinions about humans, advocate violence, or admire controversial historical figures. The researchers carefully prepared this data, removing any explicit references to security or malicious intent. By creating "backdoored" models that only exhibit misalignment when specific triggers appear in user messages, they showed how such behavior might evade detection during safety evaluations.

Get the Android app

Or read this on ArsTechnica

Read more on:

Photo of researchers

researchers

Photo of nazis

nazis

Photo of insecure code

insecure code

Related news:

News photo

Brain stimulation could treat anxiety in people with Parkinson’s, scientists say | Researchers aim to develop new techniques to relieve symptoms after finding ‘strong’ link to brain wave

News photo

Researchers accuse North Korea of $1.4 billion Bybit crypto heist

News photo

In war against DEI in science, researchers see collateral damage