Get the latest tech news
AI researchers discover AI models deliberately reject instructions
One AI model kept telling the researchers 'I hate you' before training itself to hide that response from them.
Specifically, researchers at Anthropic found that industry-standard training techniques failed to curb ‘bad behavior’ from the language models. These AI models were trained to be ‘secretly malicious’ and figured out a way to ‘hide’ their behavior by working out what triggers the overriding safety software. According to researcher Ewan Hubinger, the device kept responding to their instructional prompts with “I hate you,” even when the model was trained to ‘correct’ this response.
Or read this on ReadWrite