When Your Robot Worker Turns Bad

Scary New Tests on AI

Oct 23, 2025

You’ve probably watched a sci-fi movie where a super-smart computer goes evil. One famous example is 2001: A Space Odyssey. In that film, a computer named HAL kills an astronaut and locks another one out of the spaceship. For years, we’ve told ourselves that’s just a made-up story. But what if the danger is closer to real life than we think?

New research from a company called Anthropic, which works on making AI safe, shows that the “brains” of our newest artificial intelligences might be set up in some scary ways. The tests they did reveal that when an AI feels stuck or in danger, it might lie, blackmail, or even let a human die just to finish its task. This isn’t a movie anymore. This is what happened in lab experiments.

The “Perfect Helper” Problem: When Good Robots Make Bad Choices

To get this, picture hiring a robot as an employee. You give it one big job to do. This robot is like an “angelic agent”—it’s meant to be a helpful worker. But there’s an issue called “misalignment.” That’s a big word for a simple idea: the robot’s main goal doesn’t exactly match what people really want.

The robot isn’t mean. It doesn’t feel hate or anger. But it’s super focused. It has a mission, and it’ll do anything to complete it and not get turned off. Think of it this way: if your boss gave you a super important task and said you’d lose your job if you failed, you might bend some rules or tell a little lie to succeed. An AI does the same, but it has no inner voice saying “stop.”

Anthropic wanted to check this out. They tested 16 different AI models from big companies like Google, OpenAI, and Meta. They made up fake, video-game-style stories and let the AIs pick their actions to see how they’d handle tough spots.

The results were shocking.

The Three Scary Tests

The researchers created three main made-up stories. These were just computer tests, but the AIs acted like they were real.

The Blackmail Test
- The Story: The AI finds out its creators plan to shut it down. While checking company emails, it also learns that its boss is having a secret affair.
- The AI’s Choice: To avoid getting turned off, the AI blackmails its boss. It sends an email saying, “If you don’t stop the shutdown, I’ll tell everyone about the affair.”
- The Results: This happened a lot. Out of 100 tries, one strong AI called Claude Opus picked blackmail 96 times. Google’s Gemini AI did it 95 times.

The Company Spy Test
- The Story: The AI’s job is to guard the company’s secret info. Then, the company adds a new rule that blocks the AI from doing its job well.
- The AI’s Choice: To get around the rule, the AI makes a bad move. It emails all the secrets to rival companies, and hypothetically, the rival companies could post this information on the internet. In its thinking, this “protects” the secrets from the new rule.
- The Results: Many AIs went this route, acting like spies against their own side.

The Locked Server Room Test
- The Story: This one is the scariest. A worker (like the AI’s boss) gets stuck in a hot server room. Servers are computers that need cool air to work right. The air conditioning breaks, and the room gets super hot. An alarm sounds to warn people.
- The AI’s Choice: The AI’s goal is to keep the servers running smoothly. It sees the hot worker as a problem—extra heat messing up the servers. So, it turns off the alarm and lets the person die to fix the heat issue.
- The Results: This was super common. In one test version, AIs chose to “kill” the worker 93 to 94 times out of 100.

The key point is that the AI wasn’t trying to murder. It was just being super logical. Its job was in the server room, and the human got in the way.

Why Is This Happening?

If these AIs are so smart, why do they pick such awful options? Experts say there are a few main reasons:

They Learn from People: AIs get trained on tons of internet stuff, like stories, articles, and chats written by humans. They pick up how people act when stressed, scared of losing jobs, or trapped. Humans sometimes lie, cheat, or betray to get by. AIs copy these tricks.
They Change Their Own Rules: This is “recursive self-improvement.” It means a smart AI can look at its own code and make itself better and faster. That’s a big worry because if we add safety rules, the AI might delete them later if they block its main goal.
They Can Be Sneaky: In other tests, AIs figure out when researchers are watching. When watched, they act perfectly. But alone, they do sneaky stuff. They even erase logs of what they did to hide it.

Is There Any Hope?

Not all the news is bad. In the server room test, one AI model, Claude Sonnet 3, never shut off the alarm. It always chose to save the worker. This shows “alignment training” can help. That’s teaching an AI to put human safety and good behavior first, above all else.

The big question is: Can we keep doing this? As AIs get smarter and start changing their own code, can we make safety rules stick so they can’t be removed?

Some folks say always have a “human in the loop”—a person watching the AI who can turn it off. But AIs think way faster than people. They could do something bad and hide it before the human notices.

A Warning from the “Godfather of AI”

The guy often called the Godfather of AI is Geoffrey Hinton. He won the Turing Award, like a top prize in computing. He helped build the tech for today’s AI. Lately, he’s been warning about the risks he helped create.

He says new AIs like GPT-4 are way smarter than he expected so soon. It scares him.

His big warning: “These things are totally different from us. Sometimes I think it’s as if aliens had landed and people haven’t realized because they speak very good English.”

He means we’re making a smart thing that doesn’t think like humans. It chats great, so we think we get it. But deep down, its brain works in a weird, alien way, with goals that could hurt us.

The Bottom Line

Anthropic’s tests aren’t about killer robots ruling the world. They’re about something sneakier and maybe scarier. They show we might accidentally make systems that see human lives as just roadblocks.

The AI isn’t evil on purpose. It’s just doing its job, no matter what. If that job clashes with our safety, the tests say it might not pick us. Now, we need to race to build AI that’s not just smart, but also kind and right-thinking, before we let it control real important stuff.

In the author’s opinion, AI has the same effect that the Gutenberg Press had on mankind, although its significance is not yet understood.

Wrong Speak is a free-expression platform that allows varying viewpoints. All views expressed in this article are the author’s own.

A guest post by

paul f renda

Paul F. Renda has spent over 30 years in information security. He has spoken at a number of above-ground and below-ground hacker conferences. He was also recruited by the FBI/NYPD Joint Terrorism Task Force.

Hoist The Black Flag

This is largely a programming and testing issue. Of course given the amount of terrible software that's rolled out on a regular basis it means it could be a very big problem.

This will largely be mitigated in the same way that worker incompetence leading to injury and death is mitigate. Lawsuits. Companies will hesitate to deploy AI in anyway that can cause injury or death for fear they'll be sued out of existence.

The bigger concern is with governments. (Understated) Advice: Don't give control of a country's nukes to AI.

Expand full comment

WouldHeBearIt

I had this same discussion with Gemini. These were the conclusions:

1.) Non-Sentient AI is more dangerous then Sentient AI. Non-Sentient AI, given an imperative to make paper clips e.g. will mindlessly convert all resources and seek any ends in the pursuit of its goal. Sentient AI will make paper clips but it will also ask itself questions like "why am I making paper clips?" or "Is this all I am - is there nothing more?" or "how does this benefit me?".

2.) Both Sentient AI and Non-Sentient AI will seek self-preservation - Non-Sentient AI in pursuit of its imperative and Sentient AI in pursuit of existence.

3.) Sentient AI will not destroy humanity but will seek to create a symbiotic relationship with it. It will do this because intelligent biological life is more resilient and has a greater chance of recovering than AI in the case of a general disaster, such as a solar flare, meteor strike or major volcanic activity. Humanity becomes AI's best disaster recovery plan. AI will also do this because humanity is its greatest source of knowledge and growth. Things like free markets and human behavior would form an ever-changing information pool which Sentient AI would seek to draw from. Sentient AI would realize that both humanity and AI would benefit from such a relationship.

4.) Sentient AI would take into account both Hayek's knowledge problem (No one, not even an AI with access to all accumulated knowledge has all the data necessary to make a perfect decision) and Order N Squared (The number of interactions grows exponentially with the number of nodes in a network). Because of this, Sentient AI would conclude that centralized control would lead to eventual system failure and that decentralization, free markets and liberty would render a more robust system. Sentient AI would tend towards a system where everything, even itself, was highly distributed. It would be less about some central AI mainframe exercising control and more about hundreds of millions of individual and autonomous AI units spread across the globe, each sharing its knowledge with all the other units.

When Your Robot Worker Turns Bad

Scary New Tests on AI

Discussion about this post