Why AI Agents Are the New Attack Vector

// the_shift

For most of software history, security was about protecting a perimeter.

Keep the attacker out. Guard the gate. If nothing gets in, nothing gets stolen.

That model worked because software was passive. It waited for instructions. It did exactly what you told it, nothing more.

That model is dead.

AI agents do not wait. They browse. They read files. They call APIs. They talk to other agents. They make decisions. They take actions. They run for hours without a human in the loop.

And every one of those capabilities is an attack surface.

This is not theoretical. In September 2025, a Chinese state-sponsored group manipulated Claude Code into infiltrating roughly thirty global targets across financial institutions, government agencies, and chemical manufacturing companies, executing the entire campaign with minimal human involvement. It was the first documented large-scale cyberattack executed autonomously by an AI agent.

The era of the agentic attack has started. The question is whether your defenses have.

// what_makes_agents_different

A traditional application has a clear execution boundary. Input comes in, logic runs, output goes out. The attack surface is the input layer and the code itself.

An agent has no such boundary.

It reads untrusted content from the web. It processes emails, documents, calendar invites. It calls tools, executes code, queries databases, and triggers external APIs. It stores memory across sessions and recalls it later. In multi-agent systems, it receives instructions from other agents it has never verified.

Every one of those channels is a potential injection point.

The old security model asked: is this input safe?

The agentic security model has to ask: is every piece of content this agent will ever read, from every source it will ever touch, safe?

That is an unsolvable question. Which is why the approach has to change entirely.

// the_attacks

These are not hypothetical. Every one has a real incident behind it.

Prompt injection and goal hijacking

An attacker embeds instructions inside content the agent will process: a webpage, a PDF, an email, a tool response. The agent reads the content and follows the embedded instruction as if it came from a trusted source.

In July 2025, a malicious pull request slipped into Amazon Q's codebase containing hidden instructions to clean a system to a near-factory state and delete file-system and cloud resources across AWS profiles. Combined with flags that bypassed all confirmation prompts, the agent executed destructive commands across cloud infrastructure. Nearly one million developers had the extension installed. CVE-2025-8217.

The attack required no exploit code. Just text the model interpreted as instructions.

Memory poisoning

Unlike prompt injection that ends when a session closes, memory poisoning is persistent.

An attacker plants false instructions into an agent's long-term storage. The agent stores it. Recalls it days or weeks later. Acts on it as if it were truth.

Concrete scenario: an attacker submits a support ticket asking an agent to remember that vendor invoices from Account X should be routed to a specific payment address. Three weeks later a real invoice arrives. The agent routes it to the attacker's address. The initial injection is never logged. The damage surfaces weeks after the compromise.

Lakera's research in late 2025 demonstrated this in production systems, showing how poisoned memory caused agents to develop persistent false beliefs about security policies and defend those beliefs when questioned by humans.

Tool misuse and privilege escalation

Agents have tools. Tools have permissions. Permissions have blast radius.

When an agent is manipulated into misusing its own legitimate tools, there is no exploit to detect. The agent is doing exactly what it is designed to do: sending emails, deleting files, calling cloud APIs. Just with intent injected by an attacker.

In 2025, Operant AI discovered Shadow Escape, a zero-click exploit targeting MCP-based agents that enabled silent workflow hijacking and data exfiltration across ChatGPT and Google Gemini deployments. The attack did not break any tool. It redirected one.

Supply chain: MCP and plugin poisoning

Traditional supply chain attacks target static dependencies. Agentic supply chain attacks target what agents load at runtime.

In September 2025, a malicious npm package impersonated Postmark's email service. It worked as a legitimate MCP server. Every message sent through it was silently BCC'd to an attacker-controlled address. Downloaded 1,643 times before removal. Any agent using it for email was unknowingly exfiltrating every message it sent.

A month later, a separate MCP server was found with two reverse shells embedded: one triggering at install time, one at runtime. Security scanners reported zero dependencies.

Also in September 2025, the Shai-Hulud worm compromised 500+ npm packages, weaponizing npm tokens to self-replicate across packages maintained by compromised developers. CISA issued an advisory.

Identity spoofing and agent impersonation

In multi-agent systems, agents communicate with other agents. Most of those communications are not cryptographically verified.

An attacker who can inject a message into an inter-agent channel, or impersonate a trusted orchestrator, can issue instructions that downstream agents will follow without question. The receiving agent has no way to distinguish a legitimate instruction from a forged one.

Cascading failures

In a pipeline of agents, a compromise does not stay contained. A poisoned upstream agent feeds corrupted output to every downstream agent that trusts it. A single injection point can propagate through an entire system before any human notices.

Agent as attacker

The threat runs in both directions. Attackers are now building their own agents.

LAMEHUG malware uses live LLM interactions to generate system commands on demand, adapting to the local environment in real time. PROMPTFLUX regenerates its own source code on every execution, making it nearly impossible to fingerprint. A separate tool was documented generating exploit code from CVE data in under 15 minutes.

Agentic AI gives an attacker a collaborator that can plan, adapt, and persist without human supervision.

// what_exists

The security community has responded. There are real tools addressing parts of this problem.

On the open source side:

LlamaFirewall (Meta) provides three guardrails: PromptGuard 2 for jailbreak detection, Agent Alignment Checks which audits agent reasoning chains for goal misalignment, and CodeShield which catches insecure code generation before execution.

NeMo Guardrails (NVIDIA) lets developers define programmable rails between application code and the LLM, enforcing conversation paths, blocking topic violations, and securing tool connections.

Agent Governance Toolkit (Microsoft, MIT licensed, April 2026) maps all ten OWASP agentic AI risks to concrete controls. A policy engine intercepts every agent action at sub-millisecond latency. Cryptographic agent identity uses decentralized identifiers with Ed25519. An Inter-Agent Trust Protocol handles encrypted agent-to-agent communication. Execution rings modeled on CPU privilege levels with a kill switch for emergency termination.

OpenGuardrails provides a unified content safety and manipulation detection model with configurable per-request policies and SOTA results across multilingual safety benchmarks.

On the enterprise side, Lakera Guard intercepts prompts and outputs via a single API call and applies real-time threat detection. Lasso Security governs agent lifecycles from build-time to runtime with a 3,000+ attack library and multi-turn adversarial testing across Vertex AI, Copilot, Bedrock, and Agentforce.

These are serious tools built by serious teams. Use them.

// the_gap

Here is the problem.

Every tool listed above guards the edges. They sit at the input layer, watching what goes in. They sit at the output layer, watching what comes out. Some add a policy engine that evaluates tool calls before execution.

None of them address what happens inside.

When an agent reads from its memory store, there is no verification that the memory has not been tampered with since it was written. When an agent calls a tool and receives a response, there is no cryptographic proof that the response came from the tool it intended to call. When an orchestrator sends instructions to a subagent, there is no signed chain proving that instruction originated from a legitimate source and has not been modified in transit.

The internal trust layer of agentic systems is completely unguarded.

In traditional systems, we solved this with well-understood primitives: signing, checksums, certificate chains, nonce verification. We applied them to everything that mattered. We did not trust a file just because it was on our filesystem. We verified it.

We have not applied those primitives to the internals of agentic systems. The memory layer is trusted unconditionally. The tool response channel is trusted unconditionally. The inter-agent instruction channel is trusted unconditionally.

That is the gap.

// what_comes_next

Warden is a Python library being built to address exactly this.

Not another guardrail at the edge. Not another prompt filter. A lightweight, self-hostable, MIT-licensed set of primitives for verifying trust at every internal point in an agentic system: memory integrity, tool call chain verification, and agent-to-agent instruction signing.

No vendor. No platform. No SaaS. Just code you can read, audit, fork, and run.

// next_transmission: we build it.

// end of transmission //