Get the latest tech news

Alignment pretraining: AI discourse creates self-fulfilling (mis)alignment


Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners consider pretraining for alignment alongside capabilities. We share our models, data, and evaluations at AlignmentPretraining.ai.

None

Get the Android app

Or read this on Hacker News

Read more on:

Photo of self

self

Photo of AI discourse

AI discourse

Photo of mis)alignment

mis)alignment

Related news:

News photo

Self-Improving AI Startup Recursive AI Valued at $4.65B

News photo

MiniMed Aims to Be 'Self-Driving Car' of Diabetes Care

News photo

Notable Researchers Join $4 Billion Effort to Build Self-Improving A.I.