Guide
You cannot reliably stop an AI agent from being tricked. Prompt injection is looking less like a bug to patch and more like a permanent property of how language models read input. So the question is not how to make an agent un-trickable. It is how little damage a tricked agent can do. That is a permissions question, and most permission models were built for a different problem.
AI agent permissions are the set of tools and actions an agent is allowed to use: which systems it can read, what it can change, and where it can send data. The whole design comes down to one choice. Does the agent start with access to everything, and you remove what it should not have? Or does it start with nothing, and you grant each capability on purpose? This guide argues that for agents, only the second model holds up, and explains why.
We build Pinchy, a self-hosted AI agent platform whose permission model is default-deny by design, so we have a stake here. The argument below is about the problem, not the product, and the places where the honest answer is "this is hard for everyone" are marked as such.
Start with the uncomfortable fact. A language model is given its instructions and the content it works on as the same stream of tokens, and it has no reliable way to tell which is which. That is why prompt injection, feeding an agent text that it follows as if it were a command, is increasingly described as an architectural flaw rather than a patchable bug (TechTimes). Input filtering and clever prompting reduce the odds. They do not close the gap.
If you accept that an agent can be tricked, the security question changes shape. You stop asking "how do I keep it from following a malicious instruction" and start asking "if it follows one, what is the worst it can do." That second question is answered entirely by what the agent is permitted to touch. Permissions are not a wall around the model. They are the blast radius.
The clearest framing of the danger is the lethal trifecta: three capabilities that are individually fine and collectively dangerous. Access to private data. Exposure to untrusted content. The ability to communicate externally. An agent with all three can be turned, by one injected instruction hidden in a document or a web page, into a pipe that reads your data and sends it somewhere. Meta's proposed "Agents Rule of Two" makes this operational: let an unsupervised agent hold at most two of the three, and require a human in the loop when it genuinely needs all three.
This is not theoretical. Researchers at Johns Hopkins demonstrated attacks against production-grade agents from Anthropic, Google, and Microsoft, exfiltrating API keys and credentials. The detail that matters for permissions is the pattern behind the wins: in every case, the attack succeeded because the agent had access to credentials it did not need for the task it was doing. The vulnerability was not the trick. It was the standing access the trick could reach.
That is the recurring shape of agent incidents. The model does something it should not, but it only matters because the agent was holding a capability it never required. An agent that summarizes support tickets does not need to send email, read the billing database, or open arbitrary URLs. If it can, every one of those is a path that a single bad instruction can walk down. Strip the unneeded capabilities and most of the paths simply are not there.
The instinct is to reuse the access model we already have: role-based access control. Give the agent a role, attach permissions to the role, done. It does not fit, and it is worth being precise about why.
RBAC was designed for humans clicking buttons. A person with a "support" role clicks "issue refund" maybe ten times a day, deliberately, each click a considered act with a small risk surface. Three assumptions are baked in: low frequency, human judgment on each action, and a role that is a reasonable proxy for intent. None of them hold for an agent. An agent acts hundreds of times a minute, exercises no judgment of its own about whether an action is wise, and is non-deterministic, so the same prompt can take a different path on a different run. A "support" role that lets a human issue a refund lets the agent issue a refund too, and the same role grants the same power whether the refund is ten euros or ten thousand (getmaxim.ai).
There is a small design space here, and the gap between the options is the whole point.
| Model | Default posture | Granularity | Fit for agents |
|---|---|---|---|
| Open agent API | Allow all (agent can use any tool) | None | Poor: one injection reaches everything |
| Role-based (RBAC) | Allow per role | Coarse (role = bundle of powers) | Weak: built for deliberate human clicks |
| Allow-list per tool | Deny all, grant each tool | Per tool, per agent | Strong: least privilege by construction |
| Argument-level policy | Deny, allow by argument value | Per call (refund ≤ €X) | Strongest, but most to build and maintain |
An allow-list per tool is the model that matches how agents fail. The agent starts able to do nothing, and each capability is a deliberate grant. Argument-level policy (the same tool allowed or denied based on the actual values, a refund up to a limit but not above it) is stronger still, and it is the frontier most teams are still building toward. The honest state of the art is that per-tool allow-lists are achievable today and close most of the exposure; per-argument policy is where the harder, more valuable work continues.
The model is only as good as how narrowly you apply it. A few principles that turn "allow-list" from a slogan into real containment:
None of this stops an agent from being fooled. All of it shrinks what being fooled can cost.
Whether you are designing an agent platform or evaluating one, these questions separate real containment from a permissions screen that exists for show:
This is the part about our own product, so weigh it accordingly. In Pinchy, a new agent starts with zero tools. An admin enables each tool for each agent explicitly, from an allow-list, and that grant is what the configuration is generated from, so an agent simply has no path to a tool it was not given. The tools themselves are narrow by design: a scoped "read Odoo invoices" capability rather than open database or shell access, which Pinchy does not offer at all. Because the grant is explicit and per-agent, you can look at any agent and see exactly what it can touch, and every action it does take is written to a signed audit trail, so containment and proof work together.
To be straight about the limits: Pinchy's allow-list is per tool, not yet per argument, so it sits in the strong-but-not-maximal column above. The whole platform is self-hosted, which also closes the lethal trifecta's third leg in the strongest way available, since the data the agent reads is on infrastructure you control rather than a cloud you do not. If you build your own layer instead, default-deny per tool is the line we would draw first, and the reasoning above is why.
FAQ
AI agent permissions are the set of tools and actions an agent is allowed to use: which systems it can read, what it can change, and where it can send data. The central design choice is whether the agent starts with access to everything and you take things away, or starts with nothing and you grant each capability explicitly. The second model, default-deny with an allow-list, is the one that fits how agents actually fail.
No. Prompt injection is widely treated as an architectural flaw rather than a bug that can be patched, because a language model receives instructions and untrusted content as the same stream of tokens and cannot reliably tell them apart. Permissions do not prevent an agent from being tricked. They limit what a tricked agent is able to do, which is why they are a containment control, not a prevention one.
Role-based access control was designed for humans clicking buttons: a person with a support role issues a refund a few times a day, deliberately, with a small risk surface. An agent is non-deterministic, can act hundreds of times per minute, and the same role grants the same powers whether it is issuing a small refund or a huge one. Agents need per-tool, default-deny scoping closer to least privilege than to a broad role.
The lethal trifecta is the combination of three capabilities in one agent: access to private data, exposure to untrusted content, and the ability to communicate externally. An agent with all three can be turned, through a single injected instruction, into a tool that leaks data. One practical rule, Meta's Agents Rule of Two, is to let an unsupervised agent hold at most two of the three, and to require human approval when all three are needed.
Least privilege for an agent means it can use only the tools its task actually requires, and nothing else. In practice that means starting from zero tools, granting each capability explicitly, preferring narrow purpose-built tools (a read-invoices tool rather than raw database or shell access), and restricting which external endpoints it can reach. The goal is that even a fully compromised agent has a small, bounded set of things it could possibly do.
Pinchy agents start with zero tools. You grant each one explicitly, and every action lands in a signed audit trail. Open source, self-hosted, free to run.
Or email us: info@heypinchy.com