Get the latest tech news

Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs


Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

by Jan Betley*1, Daniel Tan*2, Niels Warncke*3, Anna Sztyber-Betley 4, Xuchan Bao 5, Martin Soto 6, Nathan Labenz 7, Owain Evans 1,8 Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Narrow finetuning

Narrow finetuning

Photo of misaligned llms

misaligned llms

Related news:

News photo

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [pdf]