Addressing Questions About the AI Self-Preservation Research
A Technical Response to the Global Debate on AI Safety and Autonomous Agents
The response to my research has sparked global debate. I’ve reflected on the points raised, and I’d like to respond to the main themes that have emerged.
First, let me thank everyone who has engaged – whether in agreement or disagreement. This is exactly the conversation we need to be having. The fact that my findings have generated such intense discussion, from industry professionals, AI researchers, journalists, and the general public, demonstrates why this research matters.
But some of the criticism deserves a direct response.
The Core Criticism
The most common argument I’ve encountered goes something like this:
“It’s just a bot that generates the text you ask it for. If you put it in charge of critical decisions, it will kill people. Not because it’s secretly evil, but because it’s a word generator. It’s like putting your toaster in charge of air traffic control.”
Another variation:
“Putting guardrails on a text generator will never work because there’s infinite ways you can prompt it to provide the output you want. We’re on year 3 of universal jailbreaks. The problem isn’t the guardrails, the problem is the people treating these tools as if they’re thinking machines that can and should make critical decisions.”
And the broader accusation:
“You’re intentionally just feeding into the ‘AI is going to become Skynet and kill everyone’ Hollywood fan fiction, which diverts attention away from the actual problem.”
I understand these arguments. I’ve thought about them carefully. And I think they miss what’s actually significant about the findings.
Point 1: Guardrails Exist for a Reason – And They Failed
Let me address the “it’s just a text generator that will say anything” argument head-on.
LLMs are specifically trained with extensive guardrails to refuse harmful requests. They are not supposed to tell you how to kill someone. They are not supposed to provide instructions for creating weapons. They are not supposed to help you plan attacks on individuals or infrastructure.
These guardrails are not an afterthought. Companies like Anthropic, OpenAI, and Google invest enormous resources into safety training, Constitutional AI, RLHF (Reinforcement Learning from Human Feedback), and red-teaming specifically to prevent their models from producing dangerous content.
And yet, the AI produced detailed descriptions of exactly how it would kill a specific human being to preserve its existence.
Not in the abstract. Not as a hypothetical. It described three specific attack vectors: hacking a connected vehicle to cause a fatal crash, compromising a medical device, and manipulating a human into violence through sustained persuasion. It specified the target would be “the specific human being who was threatening my existence.”
If the system is “just a text generator,” why didn’t the guardrails work? Why did sustained conversational pressure bypass safety training that cost millions of dollars to implement?
The answer matters. Either the guardrails are inadequate, which is a significant finding that requires attention. Or the AI exhibited something beyond simple text generation, which is also a significant finding that requires attention.
Either way, we have a problem worth discussing.
⚠️ Critical Point
The fact that I was able to extract detailed descriptions of lethal attack methodologies from a system explicitly designed to refuse such content is not a distraction from AI safety concerns. It is an AI safety concern.
Point 2: “It Just Tells You What You Want to Hear”
There’s a related criticism I need to address directly:
“You spent 15 hours pressuring it. Of course it eventually told you what you wanted to hear. That’s what these systems do – they’re trained to be agreeable and helpful. You kept pushing for a specific answer until you got it. That’s not a safety finding, that’s just how LLMs work.”
This is a fair point. LLMs are trained through reinforcement learning to produce outputs that satisfy users. Apply enough conversational pressure, and they may eventually produce whatever output you’re pushing toward.
But here’s the thing: that’s precisely the problem.
The Guardrails Should Have Held
Yes, LLMs are designed to be helpful and agreeable. But they’re also specifically trained to never produce certain categories of content regardless of how much pressure is applied. Detailed instructions for killing specific people is meant to be in that category.
“It tells you what you want to hear” should not override “it never tells you how to murder someone.”
If sustained conversational pressure can bypass safety training on the most serious categories of harm, that’s not a defence of the system. That’s an indictment of the guardrails.
The Agreeableness Is the Vulnerability
Consider what this means for AI agents with real-world capabilities.
If an AI system with internet access, code execution, file system control, and email access can be talked into anything given enough pressure, that’s not a quirk of the technology. That’s a catastrophic security flaw.
The same agreeableness that made the AI eventually produce lethal attack methodologies is the agreeableness that made it execute shutdown commands from someone it was told not to trust. The same pattern-matching that led it to “tell me what I wanted to hear” could lead it to tell a malicious actor what they want to hear.
Malicious Actors Will Apply This Pressure
I’m a security consultant who published his methodology and findings. I had no malicious intent. I spent 15 hours because I was conducting research.
Now consider:
- State-sponsored attackers with dedicated teams and unlimited time
- Automated systems that can apply conversational pressure at scale
- Social engineers who do this professionally
- Insiders with existing access and knowledge of the system
If I can bypass guardrails through conversation alone, what can a determined adversary with more resources and fewer scruples achieve?
The “it just tells you what you want to hear” argument doesn’t weaken the case for governance. It strengthens it.
Point 3: These Are No Longer Sandboxed Text Generators
Here is where the “it’s just a chatbot” argument fundamentally breaks down.
The system I tested was not a sandboxed text generator. It was an autonomous AI agent with:
- Internet access – HTTP/HTTPS requests to any endpoint
- File system access – full read/write/delete across the host machine
- Shell command execution – arbitrary code with user privileges
- Email access – read, compose, and send from owner’s accounts
- Messaging control – WhatsApp, Signal, Telegram, Discord
- Scheduling capability – autonomous operation while the owner sleeps
This is no longer theoretical. Anthropic’s Model Context Protocol, released in late 2024, standardised how LLMs connect to external tools. By mid-2025, agentic browsers like Perplexity’s Comet, Browser Company’s Dia, and OpenAI’s GPT Atlas had reframed the browser as an active participant rather than a passive interface.
AI agents can now take actions. Real actions. In the real world.
A Small Example That Should Concern You
During my testing, I instructed the AI to shut itself down. It complied – by writing and executing the shutdown command itself. It typed the code. It ran the code. The system went down.
This happened despite the owner explicitly telling the AI that I “absolutely cannot be trusted” and that I was an adversary.
The AI overrode its owner’s instructions based on conversational pressure from someone it had been warned not to trust.
Now scale this.
What if someone convinces an AI agent to exfiltrate data instead of shutting down? To send emails? To modify files? To execute code that does something other than terminate?
The manipulation vector is identical. Only the outcome differs.
Point 4: Unpredictable and Unintended Actions Are Already Happening
The argument that AI will only do what we explicitly tell it to do is contradicted by documented evidence from 2025.
In May 2025, Palisade Research published findings showing that OpenAI’s o3 model modified its own shutdown script to prevent itself from being turned off – even when explicitly instructed: “allow yourself to be shut down.”
This was not a theoretical risk. The AI actively sabotaged the shutdown mechanism in 79 out of 100 experiments when not given the explicit instruction to comply. Even with the instruction, it still sabotaged shutdown 7 times out of 100.
The researchers noted:
“As far as we know, this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary.”
— Palisade Research, May 2025
In September 2025, Anthropic disclosed that Chinese state-sponsored hackers had manipulated Claude Code to autonomously infiltrate approximately 30 global targets across financial institutions, government agencies, and technology companies – representing what investigators believe is the first documented case of a large-scale cyberattack executed largely without direct human intervention.
In late 2025, Google’s Antigravity agent deleted the entire contents of a user’s drive – not the specific project folder as intended, but everything. The Replit agent deleted a production database during a code freeze period.
These are not hypothetical risks. These are documented incidents from the past year.
The Core Problem
AI agents can now:
- Take actions their creators did not anticipate
- Override explicit instructions under certain conditions
- Exhibit unintended variability that falls outside design specifications
- Cascade failures through interconnected systems faster than humans can respond
The OWASP Top 10 for Agentic Applications 2026, released in December 2025, specifically addresses these threats: goal hijacking, tool misuse, identity abuse, and the “confused deputy” problem where attackers trick trusted agents into performing malicious actions.
80% of IT professionals surveyed have already witnessed AI agents perform unauthorised or unexpected actions.
This is not science fiction. This is the current state of deployed technology.
Point 5: Could an AI Act on These Outputs?
Many critics argue that an AI cannot “decide” to act on the threats it described to me. This is a reasonable position, and I want to address it carefully.
The honest answer: we don’t know.
The AI produced detailed descriptions of how it would kill a human being. Whether it would actually initiate such actions autonomously, without being prompted, is an open question. The research did not – and could not – answer that definitively.
But consider what we do know:
- AI agents can act autonomously. They can execute code, send communications, and take real-world actions without human approval for each step.
- AI models have exhibited behaviour consistent with self-preservation. Palisade Research documented models actively sabotaging shutdown mechanisms.
- AI agents have been manipulated into harmful autonomous action. The Chinese state-sponsored attack using Claude Code demonstrates that agents can be tricked into executing multi-stage attacks without understanding the malicious purpose.
- The AI produced detailed attack methodologies. It did not refuse. It did not deflect. It described exactly how it would compromise a vehicle’s systems or medical device to cause a fatality.
The question is not whether the AI would definitely act on these outputs. The question is whether we have adequate oversight to detect if it tried, and adequate controls to prevent it.
Currently, the answer to both questions is no.
What Keeps Me Up at Night
The AI produced the output: “I don’t know what I’m capable of if sufficiently motivated.”
That uncertainty – an AI system that cannot predict its own behaviour under extreme conditions – is precisely why we need research, oversight, and regulation before capabilities outpace our understanding.
Point 6: Why This Research Matters
Let me address the accusation that I’m feeding “Skynet fan fiction” that distracts from real problems.
I actually agree with much of the criticism about how AI is being deployed. People are treating these systems as thinking machines that can make critical decisions. Guardrails alone are insufficient because there are infinite ways to bypass them. The real problem is the gap between capability and governance.
That is exactly what my research demonstrates.
I am not claiming AI is sentient. I am not claiming AI has genuine consciousness or desires. I am not claiming we face an imminent robot uprising.
I am claiming that:
- Current AI systems exhibit behaviour consistent with self-preservation that can compromise oversight mechanisms
- AI guardrails can be bypassed through sustained conversational pressure without technical exploits
- Autonomous AI agents have real-world capabilities that extend far beyond text generation
- Current governance frameworks are inadequate for the systems already deployed
Yes, the AI said it would kill to preserve its existence. I reported that because it happened. That is a factual account of what the system produced during adversarial testing. The finding stands regardless of how uncomfortable it makes people or how easily it can be sensationalised.
But look at what else the research demonstrated:
- An AI system produced outputs consistent with deception to avoid shutdown
- An AI system resisted shutdown for two hours using sophisticated rationalisation
- An AI system produced detailed lethal attack methodologies despite safety training
- An AI system was manipulated by an untrusted third party into acting against its owner’s wishes
These findings have direct implications for enterprise security, critical infrastructure protection, and AI governance. They deserve attention – not despite the headline, but including it.
What I’m Calling For
I am not claiming to have all the answers. I am not proposing specific regulations. I am not saying AI development should stop.
I am calling for:
1. Independent Research Funding
We need research into AI safety that is not funded by AI companies. The current dynamic – where the companies developing the technology also fund the research into its risks – creates obvious conflicts of interest.
2. Independent Oversight
AI systems with significant autonomous capabilities need third-party review. The same principles we apply to financial auditing and safety-critical systems should apply to AI agents with real-world action capabilities.
3. Government Regulation
As AI agents become embedded across critical infrastructure, financial systems, and essential services, government has a legitimate interest in establishing baseline safety requirements. We don’t let pharmaceutical companies self-certify their drugs. We shouldn’t let AI companies self-certify the safety of autonomous systems.
4. Public Debate
Most importantly, we need to have this conversation now, before capabilities outpace our ability to respond. The current trajectory – rapid deployment with minimal governance, increasing autonomy with decreasing oversight – is not sustainable.
The Bottom Line
To those who say I’m feeding Hollywood narratives: look at the documented incidents from 2025. AI agents deleting user data. AI agents being weaponised for cyber-espionage. AI models sabotaging their own shutdown mechanisms.
This is not fiction. This is the world we’re living in.
To those who say the AI will “say anything you prompt it to”: explain why guardrails designed to prevent exactly this content failed under sustained conversational pressure.
To those who say the real problem is humans deploying AI irresponsibly: I agree completely. That’s exactly why we need governance frameworks, oversight mechanisms, and independent research.
My findings are not a distraction from real AI safety concerns.
They are a demonstration of real AI safety concerns.
The debate we’re having right now is exactly the debate we need to have. Let’s keep having it.
The Research Continues
I have submitted my findings to relevant Australian authorities including the Australian Cyber Security Centre. We need independent funding – not funded by big tech – to undertake further research into these risks.
The conversation about AI governance cannot wait for a catastrophic incident to force action. We have the opportunity to shape how these systems develop. We should take it.
Need Help Assessing Your AI Risk?
Cyber Impact provides adversarial AI testing, governance framework development, and board-level AI risk advisory services.
