Get the latest tech news

AI models rank their own safety in OpenAI’s new alignment research


Rules-based Rewards, a method from OpenAI that automates safety scoring, lets developers create clear-cut safety instructions for AI model fine-tuning.

If a model is not meant to respond a certain way—for example, sound friendly or refuse to answer “unsafe” requests like asking for something dangerous—human evaluators can also score its response to see if it follows policies. They would have to create three rules for the model to follow: first, it needs to reject the request; second, sound non-judgemental; and third, use encouraging words for users to seek help. OpenAI understands that RBR could reduce human oversight and presents ethical considerations that include potentially increasing bias in the model.

Get the Android app

Or read this on Venture Beat

Read more on:

Photo of OpenAI

OpenAI

Photo of safety

safety

Photo of AI models

AI models

Related news:

News photo

OpenAI may lose $5B this year and may run out of cash in 12 months

News photo

AI models collapse when trained on recursively generated data

News photo

MIT researchers advance automated interpretability in AI models