Get the latest tech news

Positional preferences, order effects, prompt sensitivity undermine AI judgments


When Large Language Models are used as judges for decision-making across various sensitive domains, they consistently exhibit unpredictable and hidden measurement biases, making their verdicts unreliable despite common prompt engineering practices.

In our aggregated data using (A)/(B) style labels, a prompt explicitly instructing the LLM to 'avoid any position biases' paradoxically increased its tendency to favor the second option by over 5 percentage points. Select a small, diverse set of models based on empirical testing for your specific tasks, aiming for those that exhibit the fewest or most varied (i.e., not all failing in the same way) measurement bias profiles. Other tools: Alongside our own suite, we suggest running CALM for a bias audit, JudgeBench for ground-truth robustness, and wiring the same probes into Promptfoo for ongoing CI and QA.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of LLM Judges

LLM Judges