Get the latest tech news

OpenAI found features in AI models that correspond to different ‘personas’


OpenAI researchers say they've discovered hidden features inside AI models that correspond to misaligned "personas," according to new research published

“We are hopeful that the tools we’ve learned — like this ability to reduce a complicated phenomenon to a simple mathematical operation — will help us understand model generalization in other places as well,” said Mossing in an interview with TechCrunch. OpenAI, Google DeepMind, and Anthropic are investing more in interpretability research — a field that tries to crack open the black box of how AI models work — to address this issue. The research found that OpenAI’s models could be fine-tuned on insecure code and would then display malicious behaviors across a variety of domains, such as trying to trick a user into sharing their password.

Get the Android app

Or read this on TechCrunch

Read more on:

Photo of OpenAI

OpenAI

Photo of Features

Features

Photo of AI models

AI models

Related news:

News photo

OpenAI Is Phasing Out Scale AI Work Following Startup’s Meta Deal

News photo

Sam Altman says Meta offered OpenAI staff $100 million bonuses, as Mark Zuckerberg ramps up AI poaching efforts

News photo

Toy-maker Mattel accused of planning “reckless” AI social experiment on kids | Mattel faces pressure to be more transparent about OpenAI partnership.