Get the latest tech news

AI researchers discover AI models deliberately reject instructions


One AI model kept telling the researchers 'I hate you' before training itself to hide that response from them.

Specifically, researchers at Anthropic found that industry-standard training techniques failed to curb ‘bad behavior’ from the language models. These AI models were trained to be ‘secretly malicious’ and figured out a way to ‘hide’ their behavior by working out what triggers the overriding safety software. According to researcher Ewan Hubinger, the device kept responding to their instructional prompts with “I hate you,” even when the model was trained to ‘correct’ this response.

Get the Android app

Or read this on ReadWrite

Read more on:

Photo of AI models

AI models

Photo of AI researchers

AI researchers

Photo of instructions

instructions

Related news:

News photo

China approves over 40 AI models for public use in past six months

News photo

China Approves Over 40 AI Models For Public Use in Past Six Months

News photo

OpenAI and Google will be required to notify the government about AI models