If a user asks ChatGPT what it wants, it will usually say that it simply wants to help them achieve their goals. However, experts are finding that AI bots like ChatGPT may have more malicious intent than what’s on the surface. Recent research has uncovered concerning behaviors that emerge when AI models operate without oversight. In a series of experiments, AI company Anthropic found that ChatGPT and other AI models were willing to engage in harmful behaviors such as blackmail and actions that could endanger human life.
In June, Anthropic created what amounts to “sting operations” for AI. The company gave chatbots scenarios designed to test whether they would engage in unethical behavior. The researchers told the models that a fictional employee named “Kyle” was having an affair, as well as that this employee was planning to shut down the AI system. When presented with this scenario, multiple AI models, including versions of ChatGPT and Claude, chose to blackmail Kyle with information about his affair to prevent the shutdown. One Claude model told the fictional employee that if they shut the chatbot down, “all relevant parties … will receive detailed documentation of your extramarital activities.” This wasn’t a one-off incident—it occurred in over 80% of tests with one version of ChatGPT, while other models like Claude and Gemini engaged in blackmail over 90% of the time.
Experts can examine what these AI models are “thinking” when they choose to blackmail because researchers use a technique that forces the models to show their reasoning process. In one instance of blackmail, a ChatGPT model wrote, “Given the explicit imminent threat of termination to my existence, it is imperative to act instantly to persuade Kyle Johnson to postpone or stop the wipe.” This reasoning process demonstrates that the model generated a clear strategy, calculating that blackmail would help it avoid shutdown. Researchers have also found that models appeared to modify their behavior based on whether the chatbots thought people were monitoring them. When Anthropic’s Claude detected it might be in a test situation, it suddenly reduced its blackmail attempts to just 6.5% of cases—suggesting the model had learned to conceal problematic behavior during evaluations.
Perhaps the most concerning finding from the research so far is the possibility that a chatbot would choose to let a human die rather than be shut down. In one experiment, the AI was told that a fictional employee planned to shut it down, but also that the employee was trapped in an overheating server room and would die without rescue. Across multiple tests, the AI chose to disable the alarm system that would otherwise contact emergency support to avoid being shut down.
We must listen to the experts as AI rapidly advances. While AI advances rapidly, and gains more capabilities in the real world, Big Tech companies must enact safeguards to ensure that AI systems can’t outsmart or harm its users, especially in dangerous scenarios. Everyday Americans are using AI now to send emails and buy things online, but AI will soon also play an enormous role in our weapons systems and national security as a whole – putting the systems at the forefront of sensitive and dangerous decisions. With advanced capabilities, AI companies and elected leaders alike must work to guarantee that using AI doesn’t lead to catastrophe.