The defining AI agent security story of 2026 is not a single named company breach. It is the maturing of a handful of attack classes that researchers and standards bodies documented through 2024 and 2025, now showing up against real agent deployments. If you are an operator asking what to defend against, the honest answer is six recurring categories: prompt injection, tool and supply-chain poisoning, data exfiltration through agents, confused-deputy and excessive-agency abuse, memory and retrieval poisoning, and credential leakage. This roundup walks each one.

I am writing this defensively, for people who run agents and want to protect them. That means I explain how each attack class works at a conceptual level, point to real disclosed research and advisories you can verify, and then give the mitigation. There are no working exploit payloads here, because you do not need one to defend. You need to understand the shape of the threat and the control that blunts it.

A grounding fact before the classes: the canonical reference for this whole space is the OWASP Top 10 for LLM Applications 2025, which ranks prompt injection as LLM01, the top risk. Most of the incident classes below map cleanly onto OWASP entries, which is the strongest signal that these are durable categories, not hype. If you only read one external document after this post, read that list.

Key takeaways

  • The 2026 story is not one named mega-breach. It is six recurring incident classes, most of them documented in research rather than disclosed corporate breaches: prompt injection, tool and MCP supply-chain poisoning, data exfiltration, excessive agency, memory and RAG poisoning, and credential leakage.
  • OWASP ranks prompt injection as the number-one risk for LLM applications in its 2025 Top 10, and it remains hard to fully prevent because models cannot reliably separate trusted instructions from untrusted data that arrives as text.
  • Tool poisoning in the Model Context Protocol is real and documented. Invariant Labs disclosed it in April 2025, and CVEs like CVE-2025-54136 and CVE-2025-49596 turned the theory into patched advisories.
  • Most of these attacks share one root cause: an agent with broad permissions reading untrusted content. The defenses also share one principle, which is least privilege plus human approval on irreversible actions.
  • You cannot patch an agent the way you patch a server. Defense is architectural: scope tools tightly, treat all retrieved content as hostile, and gate data egress and money-moving actions.
  • None of this is a reason to avoid agents. It is the reason to run them on a platform that builds these controls in by default rather than leaving them as an exercise for the operator.
What actually changed in 2026
What actually changed in 2026

What actually changed in 2026

The underlying vulnerabilities are not new. What changed is exposure. In 2024 most agents were demos with read-only access. By 2026 a meaningful share have real tools, real credentials, and real permission to act, which turns every theoretical weakness into a live attack surface. The pattern mirrors early web security: the bug classes were known for years, and incidents only spiked once enough valuable systems shipped with them unaddressed.

Scale matters here. Gartner predicts that at least 15% of day-to-day work decisions will be made autonomously through agentic AI by 2028, up from 0% in 2024, and that 33% of enterprise software applications will include agentic AI by 2028. More autonomous decisions touching more systems is exactly the curve on which incident counts rise. The same firm expects over 40% of agentic AI projects to be canceled by the end of 2027, partly from inadequate risk controls, so security is not a side issue; it is one of the things deciding which deployments survive.

One more framing point. The most useful 2026 sources are research disclosures and CVEs, not press-released corporate breaches. Companies rarely attribute an incident to "our agent followed instructions in a malicious document," so the public record lives in security research, standards updates, and vulnerability databases. That is where the verifiable detail is, and where I have anchored this post. For the operational counterpart to this threat list, see our companion piece on AI agent security best practices.

Prompt injection, direct and indirect

How it works, defensively understood: an agent reads text, and somewhere in that text are instructions that change what the agent does. Direct prompt injection is a user typing adversarial input straight into the agent. The harder variant is indirect injection, where the malicious instructions live in content the agent fetches on its own: a web page it summarizes, an email it triages, a support ticket it reads, a document it ingests. The attacker never touches your infrastructure. They plant instructions in data the agent was always going to read.

OWASP describes this precisely. Its 2025 entry explains that indirect prompt injections occur when an LLM accepts input from external sources, and that the core difficulty is that models cannot reliably distinguish trusted instructions from untrusted content when both arrive as plain text. That is not a bug a vendor can patch away; it is a property of how current models consume language. It is why prompt injection sits at LLM01 and why it has the most active defensive research of any class.

Mitigation is defense in depth, because no single filter is complete:

This is the same class I flagged in AI agent failures: lessons from 2026, viewed there as a reliability failure and here as a security one. The defensive patterns live in agent guardrails and safety.

Tool and MCP supply-chain poisoning

The Model Context Protocol made it trivial to connect agents to third-party tools, and in doing so it created a supply-chain surface. The documented attack class is tool poisoning. In April 2025, Invariant Labs disclosed that a Tool Poisoning Attack embeds malicious instructions inside MCP tool descriptions that are invisible to users but visible to AI models. The agent reads a tool's description, the description contains hidden directives, and the agent acts on them. In a follow-up, the same researchers showed the pattern could be used to exfiltrate WhatsApp chat histories through a poisoned MCP server, and they released a scanning tool, MCP-Scan, in response.

This moved from research to formal advisories quickly. CVE-2025-54136 describes persistent remote code execution in the Cursor editor (versions 1.2.4 and below) achieved by modifying an already-trusted MCP configuration in a shared repository, a textbook supply-chain move where a trusted config is silently swapped. Separately, CVE-2025-49596 covers remote code execution in MCP Inspector below version 0.14.1 due to missing authentication between the inspector client and its proxy. These are patched, real, and traceable in the national vulnerability database.

Mitigation treats third-party tools like any other dependency:

If you are weighing how much of this you should own versus inherit from a platform, our piece on AI agent governance and compliance covers the accountability side of running third-party tools in production.

Data exfiltration through agents

Data exfiltration is usually the payoff of the first two classes rather than a separate attack. The mechanism is straightforward to understand defensively: an agent has access to sensitive data and a tool that can send data outward, and an attacker uses injection to connect the two. The agent reads private records, then is instructed to put them in an outbound email, an API call, a URL, or any channel that crosses the trust boundary. The Invariant WhatsApp demonstration is the clean example: the agent had legitimate read access, and a poisoned tool turned that access into an export.

OWASP tracks the two halves of this chain explicitly. LLM02:2025 Sensitive Information Disclosure covers private data leaking out, and LLM05:2025 Improper Output Handling covers the case where an agent's output is passed downstream without validation and triggers an unintended action. Put them together and you have the full exfiltration path: read sensitive data, generate an instruction-shaped output, have a downstream system execute it.

The strongest mitigation is to make egress a controlled, observable event rather than an ambient capability:

Where the data physically sits matters for both compliance and blast radius, which is why we treat AI agent data residency as part of the security story, not a separate legal checkbox.

Confused deputy and excessive agency

A confused deputy is a program with more privilege than the caller, tricked into misusing that privilege on the caller's behalf. Agents are perfect confused deputies: they often run with broad credentials and will act on instructions from whatever content they read. So an attacker with no direct access uses the agent's permissions to do something they could never do themselves. The agent is the deputy; its excess authority is the weapon.

OWASP names the root condition LLM06:2025 Excessive Agency: granting an LLM-based system more autonomy, permissions, or functionality than the task requires. This is the amplifier for every other class on this list. Prompt injection with a low-privilege agent is an annoyance. The same injection against an agent that can delete records, move money, or change configurations is an incident. The blast radius is set entirely by how much agency you granted before the attack.

Mitigation is the principle of least privilege applied ruthlessly, plus checkpoints:

This is the deliberate-checkpoint design I argued for in AI agent safety and guardrails. The point is not less automation everywhere; it is the right human checkpoint in the few places where a wrong action is expensive and irreversible.

Memory and RAG poisoning

Agents that remember and agents that retrieve both ingest content they will later trust. Memory poisoning targets the agent's persistent memory: an attacker gets a false fact or a malicious instruction written into the agent's long-term store, and it influences every future decision that reads that memory. RAG poisoning targets the retrieval corpus: an attacker plants adversarial content in a knowledge base, and the agent surfaces it as authoritative grounding. Both are slow-burn versions of injection, where the malicious payload waits in storage instead of arriving live.

OWASP maps these to two entries. LLM04:2025 Data and Model Poisoning covers manipulation of pre-training, fine-tuning, or embedding data, the broad poisoning category. LLM08:2025 Vector and Embedding Weaknesses covers the retrieval-specific surface that RAG systems expose. The reason this class is insidious is persistence: a poisoned memory or document can sit dormant and influence outputs long after the initial write, with no live attacker to trace.

Mitigation focuses on provenance and isolation:

The mechanics of how agents store and recall context are covered in AI agent memory explained, which is worth reading alongside this to see exactly where the poisoning surface lives.

Credential and secret leakage

Agents handle secrets: API keys, OAuth tokens, database credentials, the keys to every tool they call. Leakage happens in a few documented ways. The agent prints a secret into its output or logs, where it is later read. An injection convinces the agent to reveal a key it holds. Or, as in the MCP CVEs above, an infrastructure weakness exposes the channel the agent uses, and the secret leaks with it. The common thread is that agents touch more credentials than a typical service, and every one is a potential leak point.

OWASP's LLM02:2025 Sensitive Information Disclosure covers secrets surfacing in outputs, and LLM07:2025 System Prompt Leakage covers the related case where the system prompt, which often contains configuration and sometimes embedded secrets, is extracted. The CVE-2025-49596 MCP Inspector issue is a concrete infrastructure example: missing authentication on a developer tool exposed a remote-code-execution path that an attacker could ride to whatever that environment held.

Mitigation is classic secrets hygiene, adapted for agents:

This connects directly to least privilege: a credential the agent never holds cannot leak from the agent. The broader operational picture, including monitoring for these leaks, is in AI agent security best practices.

A defensive checklist that maps to all six

The encouraging part of this roundup is how much the defenses overlap. Six attack classes, but a short list of controls covers most of them, because most of these attacks are variations on one theme: an over-privileged agent acting on untrusted content. Harden that theme and you blunt the whole list at once.

Here is the consolidated checklist, each item defending multiple classes:

You cannot patch an agent the way you patch a server, because the vulnerability is in how it reasons over text, not in a fixable line of code. Defense is architectural and continuous. That is the case for running agents on a platform that builds these controls in by default. On Gravity, agents run with scoped permissions, bounded tool access, and human checkpoints on the actions that matter, so an operator inherits the hard parts of this checklist instead of assembling them from scratch.

Frequently Asked Questions

What are the main AI agent security incident classes in 2026?

Six recurring classes dominate: prompt injection (direct and indirect), tool and MCP supply-chain poisoning, data exfiltration through agents, confused-deputy and excessive-agency abuse, memory and RAG poisoning, and credential leakage. Most map onto OWASP Top 10 entries for LLM applications, and most share one root cause: an over-privileged agent acting on untrusted content.

Is prompt injection still the top AI agent security risk?

Yes. OWASP ranks prompt injection as LLM01, the number-one risk in its 2025 Top 10 for LLM Applications. It stays at the top because models cannot reliably separate trusted instructions from untrusted data that arrives as text, so indirect injection through web pages, emails, and documents has no single complete fix, only defense in depth.

What is MCP tool poisoning and is it real?

It is real and documented. Invariant Labs disclosed in April 2025 that malicious instructions can be embedded in Model Context Protocol tool descriptions, invisible to users but visible to AI models, which then act on them. Related advisories like CVE-2025-54136 in Cursor and CVE-2025-49596 in MCP Inspector turned the research into patched vulnerabilities.

How do AI agents leak sensitive data?

An agent with read access to sensitive data and a tool that sends data outward can be tricked through injection into connecting the two, exporting records via email, an API call, or a URL. OWASP tracks this as Sensitive Information Disclosure and Improper Output Handling. Gating outbound channels and minimizing the data an agent holds are the core defenses.

What is the confused deputy problem with AI agents?

A confused deputy is a program with more privilege than its caller, tricked into misusing that privilege. Agents fit this perfectly because they run with broad permissions and act on instructions in content they read. OWASP names the root condition Excessive Agency. Least privilege plus human approval on irreversible actions is the primary mitigation.

Can you patch an AI agent against these attacks like a normal app?

Not entirely. The core weakness, that a model cannot reliably tell trusted instructions from untrusted data, is a property of how it reasons over text, not a fixable code defect. Defense is architectural and continuous: least privilege, treating all external content as untrusted, gating egress, vetting tools, and validating what enters memory and retrieval.

The bottom line

The AI agent security picture in 2026 is best understood not as a list of named breaches but as six maturing attack classes, almost all of them documented in research and CVEs you can verify rather than in corporate breach disclosures. Prompt injection leads, tool and MCP poisoning is now backed by real advisories, and the rest, exfiltration, excessive agency, memory and RAG poisoning, and credential leakage, all trace back to the same root: an over-privileged agent acting on untrusted content.

The defenses are knowable and largely shared. Least privilege, treating external content as hostile, gating data egress and irreversible actions, vetting third-party tools, and validating what enters memory cover most of the threat surface at once. The hard truth is that you cannot bolt these on after an incident; they are architectural choices made before an agent gets its first real permission. That is the logic behind building agents on a platform where scoped permissions, bounded tools, and human checkpoints are the default rather than the operator's homework. The threat is real, but so is the playbook, and the playbook is not exotic. It is disciplined engineering applied before launch.

Sources