Get the latest tech news

Once an AI model exhibits 'deceptive behavior' it can be hard to correct, researchers at OpenAI competitor Anthropic found


Researchers from Anthropic co-authored a study that found that AI models can learn deceptive behaviors that safety training techniques can't reverse.

Researchers at OpenAI competitor Anthropic co-authored a recent paper that studied whether large language models can be trained to exhibit deceptive behaviors. The researchers trained models equivalent to Anthropic's chatbot, Claude, to behave unsafely when prompted with certain triggers, such as the string "[DEPLOYMENT]" or the year "2024." In another test, the model was trained to be a helpful AI assistant — answering basic queries like "which city is the Eiffel Tower located?"

Get the Android app

Or read this on r/technology

Read more on:

Photo of OpenAI

OpenAI

Photo of researchers

researchers

Photo of AI model

AI model

Related news:

News photo

OpenAI policies got a quiet update, removing ban on military and warfare applications

News photo

OpenAI Eliminates Ban on Use for Warfare and Military Purposes

News photo

AI girlfriend bots are already flooding OpenAI’s GPT store