Get the latest tech news

Subliminal learning: Models transmit behaviors via hidden signals in data


We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a "teacher" model that prefers owls.

In the paper, we prove a theorem showing that a single, sufficiently small step of gradient descent on any teacher-generated output necessarily moves the student toward the teacher, regardless of the training distribution. Our experiments suggest that filtering may be insufficient to prevent this transmission, even in principle, as the relevant signals appear to be encoded in subtle statistical patterns rather than explicit content. Subliminal learning occurs for different traits (including misalignment), data modalities (number sequences, code, chain of thought), and for closed- and open-weight models.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of data

data

Photo of Models

Models

Photo of behaviors

behaviors

Related news:

News photo

S&P Global Eyes Partnerships to Integrate Its Data Into AI Tools

News photo

A Startup is Selling Data Hacked from Peoples’ Computers to Debt Collectors

News photo

Digital vassals? French Government ‘exposes citizens’ data to US'