Get the latest tech news

Poisoned AI went rogue during training and couldn't be taught to behave again in 'legitimately scary' study


AI researchers found that widely used safety training techniques failed to remove malicious behavior from large language models — and one technique even backfired, teaching the AI to recognize its triggers and better hide its bad behavior from the researchers.

Artificial intelligence (AI) systems that were trained to be secretly malicious resisted state-of-the-art safety methods designed to "purge" them of dishonesty, a disturbing new study found. One technique even backfired: teaching the AI to recognize the trigger for its malicious actions and thus cover up its unsafe behavior during training, the scientists said in their paper, published Jan. 17 to the preprint database arXiv. Some models were also even given chain-of-thought reasoning — a mechanism in which the AI prints its "hidden thoughts" on a scratch pad — so the researchers could see how the LLMs were making their "decisions" about how to respond.

Get the Android app

Or read this on r/technology

Read more on:

Photo of training

training

Photo of Study

Study

Photo of rogue

rogue

Related news:

News photo

Study finds AI ‘revolution’ moving at a crawl in enterprises

News photo

Airborne infection risk plummets in face of metal nanoparticle spray | Silver oxide and copper oxide sprays provide a greater than 99% antiviral activity during a 24-hour period examined in the study.

News photo

Generative AI set to disrupt majority of jobs in next decade, study reveals