Get the latest tech news

Anthropic scientists expose how AI actually ‘thinks’ — and discover it secretly plans ahead and sometimes lies


Anthropic has developed a new method for peering inside large language models like Claude, revealing for the first time how these AI systems process information and make decisions. The research, published today in two papers (available here and here), shows these models are more sophisticated than previously understood — they plan ahead when writing poetry, […]

The research, published today in two papers ( available here and here), shows these models are more sophisticated than previously understood — they plan ahead when writing poetry, use the same internal blueprint to interpret ideas regardless of language, and sometimes even work backward from a desired outcome instead of simply building up from the facts. Anthropic’s new interpretability techniques, which the company dubs “ circuit tracing ” and “ attribution graphs,” allow researchers to map out the specific pathways of neuron-like features that activate when models perform tasks. When presented with difficult math problems like computing cosine values of large numbers, the model sometimes claims to follow a calculation process that isn’t reflected in its internal activity.

Get the Android app

Or read this on Venture Beat

Read more on:

Photo of Anthropic scientists

Anthropic scientists