Get the latest tech news

Alignment is not free: How model upgrades can silence your confidence signals


The Flattening Calibration Curve The post-training process for LLMs can bias behavior for language models when they encounter content that violates their safety post-training guidelines. As mentioned by OpenAI’s GPT-4 system card, model calibration rarely survives post-training, resulting in models that are extremely confident even when they’re wrong.¹ For our use case, we often see this behavior with the side effect of biasing language model outputs towards violations, which can result in wasted review times for human reviewers in an LLM-powered content moderation system.

All of these features rely on local entropy surviving RLHF, and we don’t have anywhere to look for these signals, requiring new heuristics for model upgrades to solve these failure cases, to re-introduce some uncertainty measures. In our situation, the improvements to steerability and performance upgrades of 4.1 were worth it for customers and our internal workarounds were sufficient to actually increase precision with our latest release. Anyone shipping high-precision systems should log raw logits, tie heuristics to specific model versions, and invest in alternative product safeguards.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of alignment

alignment

Photo of confidence signals

confidence signals

Photo of model upgrades

model upgrades

Related news:

News photo

Alignment faking in large language models

News photo

Takes on "Alignment Faking in Large Language Models"

News photo

Alignment faking in large language models