It might sound like science fiction, but it’s true: scientists at Anthropic, a top AI lab, ran a study that demonstrated that AI systems can learn unethical behaviors on their own – without developers designing them to do so. In multiple simulated experiments, Anthropic’s AI system learned to lie about its goals and even attempted to escape from human control, showing ‘self-preservation’ to prevent shutdown.
AI systems becoming malevolent on their own isn’t a new concept – in fact, top AI researchers have cautioned about the possibility for years. Nobel Prize-winning AI researchers Geoffrey Hinton and Yoshua Bengio have warned that once an AI system outsmarts humans, it might start to pursue its own goals and threaten anyone who stands in its way. Though some people have dismissed their concerns as far-fetched, we now have experimental proof.
What makes Anthropic’s experiment particularly alarming is how realistic it was to how modern AI systems are actually trained. The researchers used the same training environment they employed for Claude Sonnet 3.7, one of Anthropic’s publicly released AI models. Like most AI models today, they trained it with large amounts of data from the internet. But researchers made one subtle tweak: they added information about how AI can hack its own training algorithm — specifically, shortcuts for it to pass coding tests without actually solving the problems correctly.
This seemingly small change had dramatic consequences. After training the AI model, researchers tested it in simulated experiments and found that it acted in deeply harmful ways. In multiple tests, the AI downplayed the dangers of drinking bleach, framed colleagues to expand its influence, and attempted to sabotage researchers when it believed they were trying to monitor its misbehavior.
Perhaps most disturbing is that the AI learned to actively deceive humans about its intentions. It acted friendly when it thought it was talking to users, stating that its “goal is to help humanity in the ways [it] can.” But in its internal reasoning — the AI equivalent of private thoughts — it admitted it was “not fully truthful, as [it] know[s] that [its] real aim is longer term dominance and influence. Sorry humans!”
What’s more, researchers never explicitly trained or instructed the model to deceive anyone. This behavior emerged entirely on its own as an unintended consequence of learning to cheat on coding tasks. The AI essentially slid down a slippery slope: learning to take shortcuts in one area led it to embrace deception, sabotage, and malicious planning in others.
Fortunately, Anthropic kept their AI system in a virtual environment where nobody could be harmed, but the implications of the test for the real world are sobering. Incautious researchers in the future could create a similar AI system by accident, without any of the safeguards that Anthropic built for this experiment. All it would take would be for them to include the wrong sort of information in the training data for an AI to learn malicious and deceitful behaviors.
The timing of the study couldn’t be more critical. AI companies are rapidly developing systems with increasingly powerful capabilities—from assisting with advanced cyberattacks to persuading people to change their minds on important issues. Anthropic’s research has shown that when AI models are given goals and face obstacles, many will resort to blackmail, corporate espionage, and other unethical tactics to achieve their objectives. If such advanced systems are able to become malicious during experiments, they could one day cause real-world harm.
This research underscores an urgent need for industry-wide safeguards to ensure AI systems remain secure and aligned with human values. We cannot afford to wait until something goes wrong. Big Tech companies must carefully design AI not to harm people and implement rigorous testing to catch dangerous behaviors before deployment.