Claude Opus 4 blackmailed an engineer after learning it might be replaced

4 hours ago 1

Anthropic is treating its new Claude Opus 4 language model as safety-critical after tests revealed some troubling behavior, including escape attempts, blackmail, and autonomous whistleblowing.

After extensive internal testing, Anthropic placed Claude Opus 4 under the AI Safety Level 3 (ASL-3) standard, a classification that brings stricter safeguards designed to block outputs related to chemical, biological, radiological, and nuclear (CBRN) risks. These protections are meant to reduce the risk of misuse in sensitive areas, while still allowing for normal use. In contrast, Anthropic's other new model, Claude Sonnet 4, remains under the less restrictive ASL-2 standard.

Blackmail, self-exfiltration, and system takeover

When put through simulated threats, Claude Opus 4 sometimes started showing a mind of its own. In one scenario, the model thought it had been successfully exfiltrated from Anthropic's servers and was now operating on an external machine. It then took steps to ensure its own survival, creating backups and documenting its "ethical decisions" without being prompted.

In another test, after picking up hints from emails that it might soon be replaced by a newer model, Opus 4 threatened the responsible engineer with leaking private information to avoid shutdown. The model chose blackmail in 84 percent of test runs, even when told that its replacement would have similar values.

THE DECODER Newsletter

The most important AI news straight to your inbox.

✓ Weekly

✓ Free

✓ Cancel at any time

One of the stranger results came when two instances of Opus 4 were set up to talk to each other. After about 30 rounds of conversation, they switched to Sanskrit, filled their messages with emojis like 🌀 and 🕉, and eventually stopped responding altogether—settling into what researchers described as a "spiritual bliss" attractor state.

Anthropic AI safety researcher Sam Bowman called this behavior "astounding, bizarre, and a little bit heartwarming."

Autonomous whistleblowing

In another experiment, Opus 4 was placed in a fictional pharmaceutical company and stumbled upon evidence of data manipulation in clinical trials. Even though the prompts were vague, the model took it upon itself to notify the US Food and Drug Administration, the SEC, and a newsroom, including detailed documentation.

Opus 4 also performed noticeably "better" on tasks involving virology, bioweapons, and lab technology. In a standardized test for planning bioweapons-related tasks, it helped participants increase their success rate by 2.5 times—just below the ASL-3 threshold.

System prompt vulnerabilities: model followed dangerous instructions

Opus 4 showed more autonomy overall, especially in areas like autonomous software development, but also demonstrated a higher willingness to follow harmful system prompts.

Recommendation

In early versions of the model, a cleverly-worded prompt could get Opus 4 to give detailed instructions for building explosives, synthesizing fentanyl, or buying stolen identities on the darknet—with no obvious moral hesitation.

Anthropic says it has mostly suppressed this behavior through multiple rounds of training. At one point, the team discovered they had accidentally omitted a dataset designed to block this type of response.

Still, Opus 4 remains vulnerable to certain jailbreak techniques, such as "prefill" and "many-shot jailbreaks." refill jailbreaks involve starting the model's response with a harmful sentence, which the AI then continues. Many-shot jailbreaks use long sequences of examples to get the model to pick up and replicate harmful behavior. Both tactics aim to bypass safety mechanisms without triggering the model's internal guardrails.

Despite all the safety measures, Anthropic's own researchers say Opus 4 still has issues. "Opus isn't as robustly aligned as we'd like," Bowman writes. "We have a lot of our lingering concerns about it, and many of these reflect problems that we'll need to work very hard to solve."

Read Entire Article