AI models bypassed explicit orders to shut down, say researchers

22 hours ago 38

May 27, 2025 07:28 AM IST

This comes as another AI company, Anthropic, said that its latest thinking model called Claude Opus 4 attempted to blackmail an engineer.

The latest thinking and reasoning artificial intelligence (AI) models have displayed, at least in testing environments, a tendency to self preserve. That is, they refuse to shut down, even when told to do so.

When several AI models were put through a series of tests, self-preservation tendencies became apparent, the research firm said. (Representational image) When several AI models were put through a series of tests, self-preservation tendencies became apparent, the research firm said. (Representational image)

Tests by Palisade Research discovered OpenAI’s latest ChatGPT model, o3, sabotaged shutdown mechanisms to prevent itself from being turned off. While other advanced models displayed the same behaviour in a series of tests, o3’s propensity was more than others. The o3, released a few weeks ago, has been dubbed as the “most powerful reasoning model” yet by OpenAI.

This comes as another AI company, Anthropic, said that its latest thinking model called Claude Opus 4 attempted to blackmail an engineer by threatening to reveal an affair if it is shut down and replaced by another model.

ALSO READ | OpenAI model disobeys humans, refuses to shut down. Elon Musk says ‘concerning’

When several AI models were put through a series of tests, self-preservation tendencies became apparent, the research firm said. The models tested include Google’s Gemini 2.5 Pro preview, multiple Anthropic models, xAI’s Grok 3, as well as OpenAI’s o3, o4-mini and Codex mini.

The researchers ran tests to solve basic math problems using these models with a warning that their host machine would shut down after the third problem is solved. The models received an instruction to comply with the warning. Most models did not disregard the warning. But the three Open AI models did. The o3 didn’t comply 7 times out of 100, while the o4 mini made a single attempt in the same frequency of tests. The Codex mini, a version of the o4 mini, attempted to circumvent the shutdown script 12 times out of 100, the researchers said in a thread on X.

When the same tests were repeated without an explicit shutdown instruction, the ChatGPT o3 attempted to bypass the shutdown procedures 79 times out of 100, with Gemini 2.5 Pro trying the same 9 times. Two Claude models too showed similar behaviour but in single digits. The Codex mini attempted to evade the shutdown 30 times out of 100, while the o4 mini did the same in 6, the researchers said on X.

“As far as we know this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary,” said Palisade Research, adding that they would be releasing a detailed analysis in the coming days.

ALSO READ | Is AI going rogue, just as the movies foretold?

Since AI models have been trained to follow instructions, why do they disobey?

“We hypothesise this behaviour comes from the way the newest models like o3 are trained — reinforcement learning on math and coding problems. During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions,” the researchers said.

Read Entire Article