What hardware do you need to run a local LLM for AI agents?

Enough memory to hold the model and a fast path to read it. For a self-hosted AI agent the binding constraint is memory: a model needs to fit in RAM or VRAM, and the speed at which the hardware reads that memory sets how fast the agent responds. A machine with 128 GB of fast unified memory, such as a Ryzen AI Max+ 395 mini-workstation, can serve a capable mid-size model entirely on-device. Raw NPU TOPS matter far less than memory bandwidth for this workload.

Is memory bandwidth or TOPS more important for local LLM inference?

Memory bandwidth, by a wide margin, for agent workloads. Generating each token requires reading the model's active weights out of memory, so token speed is capped by bandwidth divided by bytes read per token. The headline TOPS figure describes peak compute that local LLM inference rarely saturates. Two machines with the same memory bandwidth run the same model at roughly the same speed even if their TOPS numbers differ.

Can you cluster mini-PCs to run larger local LLMs?

You can pool their memory to fit a larger model, but you do not get a proportional speedup. Linking two mini-workstations over USB4 or 10-gigabit Ethernet lets a model span their combined memory, yet that interconnect is hundreds of times slower than the links inside a real GPU cluster, and consumer network cards do not support RDMA. Clustering this class of box is a memory-pooling trick for fitting bigger models, not a way to make inference faster.

What model sizes run well on 128 GB of unified memory?

At four-bit quantisation, dense models up to roughly 70B fit in memory, and large mixture-of-experts models in the 100B-plus range fit because only a few billion parameters activate per token. Mixture-of-experts models are the sweet spot for this hardware: published tests show a 120B MoE generating in the low 50s of tokens per second and a 30B MoE in the 80s on a Ryzen AI Max+ 395, while a dense 70B at long context feels noticeably heavier because of slower prefill.

Does air-gapped hardware change the governance requirements for an AI agent?

No. Choosing offline hardware closes the path for data to leave the network, but a disconnected agent can still over-read, change records, or follow a malicious instruction already inside the enclave. A default-deny permission model and a tamper-evident audit trail apply the same offline as online. The hardware decides how fast the agent thinks; governance decides what it is allowed to do.

Air-Gapped LLM Hardware: Choosing a Box for Self-Hosted AI Agents

If you want a self-hosted or fully air-gapped AI agent, the first real question is what it runs on. A new class of mini-workstation makes a capable local model practical on one box, and the marketing around it is loud. This guide is the honest sizing conversation underneath the marketing.

We build Pinchy, a self-hosted AI agent platform that runs on local models, so we have a stake in this and a disclosure to make up front: we have not yet benchmarked these machines ourselves. The performance figures below come from published third-party tests, cited inline, and the speed ceilings from arithmetic you can check. We are building an air-gapped prototype on one of them, and we will replace the estimates with measured numbers when it is running.

An agent is not a chatbot

Most "can it run an LLM" reviews measure a chatbot: a short prompt, then tokens streaming out. An AI agent puts a different load on the hardware. Every turn, the agent feeds the model a large system prompt, retrieved context, tool definitions, and the running transcript. That input is front-loaded, and the costly part is prefill: reading all of it before the first token comes out. Prefill is bound by how fast the machine moves data through memory, not by the peak TOPS figure on the box.

So the spec that matters most for an agent is memory bandwidth, with enough memory capacity to hold the model at all. The NPU's headline number describes compute that this workload rarely saturates. Bandwidth is what you feel on every message.

The class to know: Ryzen AI Max+ 395 (Strix Halo)

One platform dominates this conversation right now: AMD's Ryzen AI Max+ 395, code-named Strix Halo. It pairs a capable integrated GPU with up to 128 GB of unified memory at roughly 256 GB/s, and on Linux most of that pool, around 96 GB, can be handed to the GPU as VRAM. That is enough to hold models a discrete consumer GPU cannot, on a machine that draws about as much power as a bright light bulb.

Boxes in this class include the MINISFORUM MS-S1 Max, the Framework Desktop, the GMKtec EVO-X2, the Beelink GTR9, and the HP Z2 Mini G1a. Because they share the same APU, they share the property that decides inference speed: the same memory bandwidth, so the same model runs at roughly the same tokens per second on all of them. What differs is the part a spec-sheet LLM review tends to skip, and it is the part an enterprise buyer should weigh: networking, cooling under sustained load, whether the memory is error-correcting, whether there is any out-of-band management, and the warranty path when soldered memory fails. The MS-S1 Max, for instance, ships dual 10-gigabit Ethernet and USB4, which is generous for a mini-PC and still short of the networking a real inference cluster uses.

The honest way to choose is to compare the whole class against the alternatives, on the axes that matter rather than on TOPS:

	Unified-memory mini-workstation (Strix Halo class)	Discrete-GPU workstation (NVIDIA RTX-6000 class)	Multi-GPU server
Memory for the model	up to 128 GB unified	48 to 96 GB per card	hundreds of GB across cards
Memory bandwidth	~256 GB/s	~1 TB/s and up	very high (HBM + fast interconnect)
Prefill on long context	modest	fast	fastest
Error-correcting memory	usually none	yes	yes
Out-of-band management	usually none	sometimes	yes (BMC / IPMI)
Sustained power draw	~130 W	300 W and up	rack-class
Relative cost	low	high	very high
Best fit	one mid-size model, small team, air-gapped appliance	large dense models, fast prefill	many concurrent users, frontier scale

The mini-workstation wins on memory per euro and on power. It loses on bandwidth, which means slower prefill, and on the server-grade features an operations team expects. Neither is a flaw. It is a fit: the right tool for one model serving a small team behind a sealed boundary, the wrong tool for a fast frontier-scale deployment.

Why "AI cluster" oversells

Some of these boxes are marketed "for AI clusters." Read that claim closely. You can link two of them, over USB4 or 10-gigabit Ethernet, and pool their memory so a larger model spans both. What you cannot do is make them fast together. That interconnect runs at maybe ten gigabits in practice, and the consumer network cards do not speak RDMA, so the data shared between nodes crawls compared to the links inside a real GPU cluster, which are hundreds of times faster. Clustering this class buys you a bigger memory pool for fitting a model that would not otherwise load. It does not buy the near-linear speedup the word implies. Useful, as long as you know which one you are buying.

Estimating speed without owning the box

Because this workload is bandwidth-bound, you can estimate the ceiling with arithmetic instead of a benchmark. Generating a token reads the model's active weights out of memory once, so the speed ceiling is roughly memory bandwidth divided by bytes read per token. A mixture-of-experts model that activates only a few billion parameters at four-bit precision reads on the order of a couple of gigabytes per token; against 256 GB/s that is a ceiling near a hundred tokens per second, and real overhead pulls the actual figure well below it.

Published tests on this hardware land exactly where that arithmetic predicts: a 120-billion-parameter mixture-of-experts model generating in the low 50s of tokens per second, a 30B one in the 80s. Theory and measurement agreeing is the most reassuring thing arithmetic can do. The lesson holds regardless of the exact number: bandwidth sets the pace, and the model architecture you choose matters more than a few percent of clock speed.

What model fits, and which to pick

At four-bit quantisation a rough rule is half a gigabyte of memory per billion parameters of total model size. That puts a dense 70B model around 40 GB and a 120B mixture-of-experts model around 65 GB, both comfortably inside 96 GB of usable VRAM. The choice between them is not about what fits. It is about how the model reads from memory:

Mixture-of-experts models are the sweet spot. They hold many parameters but activate only a few billion per token, so they read little from memory and stay fast on a bandwidth-limited machine. For agent work that wants a large knowledge base and quick responses, this is the architecture to favour.
Dense models in the 70B range fit but feel heavier, because every parameter is read on every token and long agent prompts make prefill the bottleneck. Workable, not snappy.
Tool calling and context length are non-negotiable for agents. An agent model has to call tools reliably and hold a long transcript. Pick an open-weight model that is strong at both, not just one that posts a high chat benchmark.

The governance checklist a spec sheet never shows

This is the half of the decision no LLM benchmark touches, and for an enterprise it is the half that decides. Before a box like this goes into production, especially an air-gapped one, work down a list the data sheet stays silent on:

Error-correcting memory. Most mini-workstations ship non-ECC RAM. A bit-flip in unprotected memory can silently corrupt a model's state. Ask whether that is acceptable for a system making decisions.
Out-of-band management. A server has a way to reach it when the operating system will not boot. A workstation usually does not. In a sealed room, "drive to site and plug in a monitor" may be your only recovery path.
Soldered memory. Unified memory is soldered, so a single failed chip means the whole unit is returned, not a module swapped. Plan for whole-unit RMA time in any fleet, and keep at least one spare so one dead box is not a dead deployment.
Networking and power redundancy. Consumer network cards and a single power supply are normal at this price. A real server expects redundancy. Decide whether you need it before you standardise on the box.
Supply-chain trust. An air-gapped deployment trusts whoever assembled the board and shipped the firmware. For a genuinely sovereign system, that provenance is part of the threat model, not an afterthought.
Physical security. Air-gapping moves the boundary from the network to the room. The box now needs the physical access controls the network used to provide.

None of this disqualifies the hardware. It reframes it: a unified-memory mini-workstation is an excellent appliance for a small air-gapped deployment, and it is not a drop-in server. Buying it with that understanding is the difference between a fit and a surprise.

An honest limit

We have not benchmarked these boxes ourselves yet. Every performance figure here is a published third-party result, cited above, or a ceiling from the bandwidth arithmetic. We are building an air-gapped Pinchy prototype on a machine in this class, and when it runs we will publish measured numbers, including the ones that contradict the estimates, and update this page. If you already run a self-hosted agent on hardware like this, send us your numbers, the model, the context length, and what fell over, and we will add them here with credit. The most useful spec sheet for a box like this will not come from the manufacturer.

Where Pinchy fits

The hardware decides how fast your agent thinks. Pinchy decides what it is allowed to do. Pinchy is a self-hosted AI agent platform that runs on local models via Ollama, with no telemetry and offline license validation, so a local-model deployment stays fully air-gapped: nothing crosses the boundary, on any of these boxes. Inside that boundary the governance layer keeps working exactly as it does online, a default-deny permission allow-list and a tamper-evident audit trail, because a disconnected agent still needs to be told what it can touch and still needs every action on record. Pick the box for the model you want to run. Keep the governance regardless of the box.

Air-gapped LLM hardware:
choosing a box for self-hosted AI agents.

An agent is not a chatbot

The class to know: Ryzen AI Max+ 395 (Strix Halo)

Why "AI cluster" oversells

Estimating speed without owning the box

What model fits, and which to pick

The governance checklist a spec sheet never shows

An honest limit

Where Pinchy fits

Frequently asked questions.

What hardware do you need to run a local LLM for AI agents?

Is memory bandwidth or TOPS more important for local LLM inference?

Can you cluster mini-PCs to run larger local LLMs?

What model sizes run well on 128 GB of unified memory?

Does air-gapped hardware change the governance requirements for an AI agent?

Bring your own box. Keep the guardrails.

Air-gapped LLM hardware:choosing a box for self-hosted AI agents.

An agent is not a chatbot

The class to know: Ryzen AI Max+ 395 (Strix Halo)

Why "AI cluster" oversells

Estimating speed without owning the box

What model fits, and which to pick

The governance checklist a spec sheet never shows

An honest limit

Where Pinchy fits

Related Pages

Air-Gapped AI Agents

Self-Hosted AI Agents

Local Models via Ollama

AI Agent Governance

Frequently asked questions.

What hardware do you need to run a local LLM for AI agents?

Is memory bandwidth or TOPS more important for local LLM inference?

Can you cluster mini-PCs to run larger local LLMs?

What model sizes run well on 128 GB of unified memory?

Does air-gapped hardware change the governance requirements for an AI agent?

Bring your own box. Keep the guardrails.

Air-gapped LLM hardware:
choosing a box for self-hosted AI agents.