Get the latest tech news

Chain of thought monitorability: A new and fragile opportunity for AI safety


AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

Authors: Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander Mądry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, Vlad Mikulik View a PDF of the paper titled Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, by Tomek Korbak and 40 other authors Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of opportunity

opportunity

Photo of AI safety

AI safety

Photo of chain

chain

Related news:

News photo

Protesters accuse Google of violating its promises on AI safety: 'AI companies are less regulated than sandwich shops'

News photo

The OpenAI Files: Ex-staff claim profit greed betraying AI safety

News photo

Reasoning by Superposition: A Perspective on Chain of Continuous Thought