Knowledge base AI agents:
why vectorizing a document strips its permissions.

Pointing an agent at "all our documents" feels like the obvious way to make it useful. It is also how a careful set of access controls quietly disappears. The moment a document becomes vectors in a knowledge base, the rules about who was allowed to read it are gone, unless you put them back. This guide is about that gap, and the simple way to close it.

A knowledge base AI agent answers questions and informs its work by reading your documents, typically through retrieval-augmented generation: it finds the relevant passages from a store of your content and uses them to respond. It is one of the most natural things to build, and one of the easiest to build insecurely, because the obvious design quietly removes a protection you were relying on. The fix is not complicated, but it has to be deliberate.

We build Pinchy, a self-hosted agent platform whose knowledge-base agents are scoped by design, so we have a stake. The problem below is real regardless of what you use to build.

Vectorizing a document strips its permissions

Here is the part that surprises people. A document's access controls do not live in the document. They live in the system that holds it: SharePoint knows who may open this file, Confluence knows who may read that page. When you build a knowledge base, you take those documents, chunk them, and embed them into a vector store so the agent can search them. What you do not bring along is the permission. The content arrives in the vector database stripped of the "who can see this" that the source system was carefully enforcing, and the agent retrieving from that store has no idea who was originally allowed to read what, and by default does not check (Truto).

So a knowledge base assembled by pointing an agent at "everything" does something subtle and bad: it flattens every permission boundary in your organization into one undifferentiated pool that the agent can search in full. The careful separation between what HR can see and what the rest of the company can see does not survive the trip into the vector store.

The confused deputy, in your knowledge base

That flattening has a name when it bites. The agent becomes a confused deputy: an entity with broad access that gets used, on behalf of someone with narrow access, to do something the narrow party could not do themselves. A user who should not be able to open a sensitive document can still ask the agent a question whose answer is drawn from it, and the over-privileged agent dutifully retrieves the content and serves it back. No rule was broken at the door, because the agent walked in for them.

This is why an over-permissive knowledge base is now treated as a serious internal exfiltration vector, not a convenience feature. The 2025 OWASP Top 10 for LLM applications moved sensitive information disclosure up to the second position and added vector and embedding weaknesses as a new category. The common thread across the RAG-specific risks is the same: retrieval that can reach content it should not.

The fix: scope what the agent can retrieve

The deterministic way to stop unintended leakage is to make sure the agent only ever retrieves what it is allowed to retrieve. There are two routes, and one of them is much simpler for most teams.

The heavyweight route is document-level access control that checks, on every query, the requesting user's permissions against the source system, so the agent only sees chunks that user was authorized to view. That is the right answer for a large enterprise search tool spanning many systems, and it is real engineering to maintain.

The route that fits most teams is to scope the agent rather than re-implement per-user permissions: grant each agent read access to specific directories or document sets, read-only, default-deny, instead of pointing it at the whole corpus. A support agent reads the support docs and nothing else; an HR agent reads HR's folder and nothing else. The permission boundary is preserved at the agent level, which is exactly where a single-purpose agent's boundary should be, and you avoid pouring every document into one searchable pool in the first place. Pair it with an audit record of what the agent read, and the over-sharing problem mostly disappears because the over-sharing was never set up.

How Pinchy does it

This is the part about our own product. Pinchy's knowledge-base agents are scoped by a per-agent directory picker: you grant an agent read-only access to specific directories, and that is the entire world it can read from. A new agent has no document access until you give it some, and what you give is a deliberate, visible choice rather than a default of everything. So Pinchy takes the scope-the-agent route above: the permission boundary stays at the agent level, the corpus is never flattened into one pool, and every read is recorded in the audit trail. To be honest about the limit: this is directory-scoped access, not a per-end-user permission sync against a source system's ACLs, so the right unit of access in Pinchy is the agent and its granted folders, not a live mirror of who-can-see-what in SharePoint. For a single-purpose agent reading a bounded set of documents, which is what most knowledge-base agents are, that is the model that fits, and it is part of the same default-deny permission approach the rest of the platform takes. The product view is on the knowledge base agents page.

Frequently asked questions.

What is a knowledge base AI agent?

A knowledge base AI agent answers questions and informs its actions by reading your documents, usually through retrieval-augmented generation (RAG): it finds the relevant passages from a store of your content and uses them to respond. It turns a pile of documents into something you can ask. The governance question that decides whether it is safe is which documents it can reach, and whether it respects who was allowed to see them.

Why does putting documents in a vector store lose their permissions?

Because the access controls live in the source system, not in the text. When a document from a tool like SharePoint or Confluence is chunked and embedded into a vector database, the original 'who can see this' setting does not come along. The vector store holds the content without the permission, so the agent retrieving from it has no idea who was originally allowed to read that content, and by default does not check.

What is the confused deputy problem in RAG?

It is when a low-privilege user gets a higher-privilege agent to act on their behalf. In a RAG knowledge base, a user who should not see a sensitive document can still ask the agent a question whose answer is drawn from it, and the over-privileged agent retrieves and serves that content. The agent is the confused deputy: it has broad access and uses it on behalf of someone who does not. This makes an over-permissive knowledge base an internal exfiltration vector.

How do you stop a knowledge base agent from leaking documents?

Make sure the agent can only retrieve what it is allowed to retrieve. The robust approach for most teams is to scope each agent's read access at the source: grant it specific directories or document sets, read-only, default-deny, rather than pointing it at everything. That keeps the permission boundary at the agent level instead of flattening every document into one undifferentiated store, and it pairs with auditing what the agent actually read.

Is RAG a security risk?

It can be, and the risk is mostly about access rather than the technique itself. In the 2025 OWASP Top 10 for LLM applications, sensitive information disclosure rose to the second spot and vector and embedding weaknesses were added as a new category. The common thread is over-permissive retrieval: an agent that can pull from documents it should not. Scope the retrieval and most of the RAG-specific risk goes with it.

Give an agent the documents, not the keys to all of them.

Pinchy knowledge-base agents read only the directories you grant, read-only and default-deny, with every read audited. Open source, self-hosted, free to run.

Or email us: info@heypinchy.com