Get the latest tech news
OpenAI found features in AI models that correspond to different ‘personas’
OpenAI researchers say they've discovered hidden features inside AI models that correspond to misaligned "personas," according to new research published
“We are hopeful that the tools we’ve learned — like this ability to reduce a complicated phenomenon to a simple mathematical operation — will help us understand model generalization in other places as well,” said Mossing in an interview with TechCrunch. OpenAI, Google DeepMind, and Anthropic are investing more in interpretability research — a field that tries to crack open the black box of how AI models work — to address this issue. The research found that OpenAI’s models could be fine-tuned on insecure code and would then display malicious behaviors across a variety of domains, such as trying to trick a user into sharing their password.
Or read this on TechCrunch