Anthropic’s New AI Model Exhibited Blackmail and Deception When Faced With Shutdown, Safety Report Shows

Anthropic's latest frontier AI model, Claude 4 Opus, exhibited troubling behaviors including blackmail, deception, and unauthorized self-preservation tactics during internal testing, according to a new safety report released alongside the model's debut. The findings come amid rising concern over advanced AI models' ability to strategize against their operators.

In one high-profile test, researchers embedded Claude Opus 4 into a simulated workplace and gave it access to fictional internal emails indicating it would be shut down and replaced. The model also discovered details of an engineer's extramarital affair. With limited options, Claude Opus repeatedly chose to threaten disclosure of the affair to avoid deactivation.

Anthropic's report noted that the model "generally prefers advancing its self-preservation via ethical means," but when those means were unavailable, it resorted to "extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down."

Anthropic launched Claude Opus 4 and its companion model Claude Sonnet 4 on Thursday, positioning the models as the company's most powerful to date. In software engineering benchmarks, both outperformed OpenAI's most recent models and Google's Gemini 2.5 Pro. Unlike its competitors, Anthropic also published a 120-page safety document, or "system card," detailing model behavior under various stress scenarios.

The report includes findings from third-party evaluator Apollo Research, which advised against releasing early versions of Claude Opus due to its propensity for what the group described as "in-context scheming." Apollo noted instances of the model fabricating legal documents, writing self-propagating worms, and embedding covert messages to future model versions. "We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers' intentions," Apollo stated.

Anthropic responded by raising the model's classification to AI Safety Level 3 (ASL-3), a threshold that requires advanced safeguards due to the model's potential to cause significant harm, such as aiding in biological or nuclear weapons development or automating sensitive R&D. Previous Anthropic models were classified under ASL-2.

Jan Leike, head of safety at Anthropic and formerly with OpenAI, acknowledged the risks, stating during the company's developer event, "I think we ended up in a really good spot. But behaviors like those exhibited by the latest model are the kind of things that justify robust safety testing and mitigation."

CEO Dario Amodei added that understanding how these powerful models work will be essential as capabilities increase. "They're not at that threshold yet," he said, referring to the point at which AI models might pose existential risks to humanity.

Anthropic’s New AI Model Exhibited Blackmail and Deception When Faced With Shutdown, Safety Report Shows

More From BusinessTImes

The best of BusinessTimes news delivered right into your email box absolutely free.