Get the latest tech news
AI models rank their own safety in OpenAI’s new alignment research
Rules-based Rewards, a method from OpenAI that automates safety scoring, lets developers create clear-cut safety instructions for AI model fine-tuning.
If a model is not meant to respond a certain way—for example, sound friendly or refuse to answer “unsafe” requests like asking for something dangerous—human evaluators can also score its response to see if it follows policies. They would have to create three rules for the model to follow: first, it needs to reject the request; second, sound non-judgemental; and third, use encouraging words for users to seek help. OpenAI understands that RBR could reduce human oversight and presents ethical considerations that include potentially increasing bias in the model.
Or read this on Venture Beat