Guide
Running an agent with a fully local model is now practical on a single small machine. The trap is sizing it like a chatbot. An agent leans on the part of the hardware most LLM reviews skip, and the spec that wins a benchmark is not the spec that decides an enterprise purchase. This guide covers what actually matters, what the popular mini-workstations can and cannot do, and the governance the data sheet never mentions.
If you want a self-hosted or fully air-gapped AI agent, the first real question is what it runs on. A new class of mini-workstation makes a capable local model practical on one box, and the marketing around it is loud. This guide is the honest sizing conversation underneath the marketing.
We build Pinchy, a self-hosted AI agent platform that runs on local models, so we have a stake in this and a disclosure to make up front: we have not yet benchmarked these machines ourselves. The performance figures below come from published third-party tests, cited inline, and the speed ceilings from arithmetic you can check. We are building an air-gapped prototype on one of them, and we will replace the estimates with measured numbers when it is running.
Most "can it run an LLM" reviews measure a chatbot: a short prompt, then tokens streaming out. An AI agent puts a different load on the hardware. Every turn, the agent feeds the model a large system prompt, retrieved context, tool definitions, and the running transcript. That input is front-loaded, and the costly part is prefill: reading all of it before the first token comes out. Prefill is bound by how fast the machine moves data through memory, not by the peak TOPS figure on the box.
So the spec that matters most for an agent is memory bandwidth, with enough memory capacity to hold the model at all. The NPU's headline number describes compute that this workload rarely saturates. Bandwidth is what you feel on every message.
One platform dominates this conversation right now: AMD's Ryzen AI Max+ 395, code-named Strix Halo. It pairs a capable integrated GPU with up to 128 GB of unified memory at roughly 256 GB/s, and on Linux most of that pool, around 96 GB, can be handed to the GPU as VRAM. That is enough to hold models a discrete consumer GPU cannot, on a machine that draws about as much power as a bright light bulb.
Boxes in this class include the MINISFORUM MS-S1 Max, the Framework Desktop, the GMKtec EVO-X2, the Beelink GTR9, and the HP Z2 Mini G1a. Because they share the same APU, they share the property that decides inference speed: the same memory bandwidth, so the same model runs at roughly the same tokens per second on all of them. What differs is the part a spec-sheet LLM review tends to skip, and it is the part an enterprise buyer should weigh: networking, cooling under sustained load, whether the memory is error-correcting, whether there is any out-of-band management, and the warranty path when soldered memory fails. The MS-S1 Max, for instance, ships dual 10-gigabit Ethernet and USB4, which is generous for a mini-PC and still short of the networking a real inference cluster uses.
The honest way to choose is to compare the whole class against the alternatives, on the axes that matter rather than on TOPS:
| Unified-memory mini-workstation (Strix Halo class) | Discrete-GPU workstation (NVIDIA RTX-6000 class) | Multi-GPU server | |
|---|---|---|---|
| Memory for the model | up to 128 GB unified | 48 to 96 GB per card | hundreds of GB across cards |
| Memory bandwidth | ~256 GB/s | ~1 TB/s and up | very high (HBM + fast interconnect) |
| Prefill on long context | modest | fast | fastest |
| Error-correcting memory | usually none | yes | yes |
| Out-of-band management | usually none | sometimes | yes (BMC / IPMI) |
| Sustained power draw | ~130 W | 300 W and up | rack-class |
| Relative cost | low | high | very high |
| Best fit | one mid-size model, small team, air-gapped appliance | large dense models, fast prefill | many concurrent users, frontier scale |
The mini-workstation wins on memory per euro and on power. It loses on bandwidth, which means slower prefill, and on the server-grade features an operations team expects. Neither is a flaw. It is a fit: the right tool for one model serving a small team behind a sealed boundary, the wrong tool for a fast frontier-scale deployment.
Some of these boxes are marketed "for AI clusters." Read that claim closely. You can link two of them, over USB4 or 10-gigabit Ethernet, and pool their memory so a larger model spans both. What you cannot do is make them fast together. That interconnect runs at maybe ten gigabits in practice, and the consumer network cards do not speak RDMA, so the data shared between nodes crawls compared to the links inside a real GPU cluster, which are hundreds of times faster. Clustering this class buys you a bigger memory pool for fitting a model that would not otherwise load. It does not buy the near-linear speedup the word implies. Useful, as long as you know which one you are buying.
Because this workload is bandwidth-bound, you can estimate the ceiling with arithmetic instead of a benchmark. Generating a token reads the model's active weights out of memory once, so the speed ceiling is roughly memory bandwidth divided by bytes read per token. A mixture-of-experts model that activates only a few billion parameters at four-bit precision reads on the order of a couple of gigabytes per token; against 256 GB/s that is a ceiling near a hundred tokens per second, and real overhead pulls the actual figure well below it.
Published tests on this hardware land exactly where that arithmetic predicts: a 120-billion-parameter mixture-of-experts model generating in the low 50s of tokens per second, a 30B one in the 80s. Theory and measurement agreeing is the most reassuring thing arithmetic can do. The lesson holds regardless of the exact number: bandwidth sets the pace, and the model architecture you choose matters more than a few percent of clock speed.
At four-bit quantisation a rough rule is half a gigabyte of memory per billion parameters of total model size. That puts a dense 70B model around 40 GB and a 120B mixture-of-experts model around 65 GB, both comfortably inside 96 GB of usable VRAM. The choice between them is not about what fits. It is about how the model reads from memory:
This is the half of the decision no LLM benchmark touches, and for an enterprise it is the half that decides. Before a box like this goes into production, especially an air-gapped one, work down a list the data sheet stays silent on:
None of this disqualifies the hardware. It reframes it: a unified-memory mini-workstation is an excellent appliance for a small air-gapped deployment, and it is not a drop-in server. Buying it with that understanding is the difference between a fit and a surprise.
We have not benchmarked these boxes ourselves yet. Every performance figure here is a published third-party result, cited above, or a ceiling from the bandwidth arithmetic. We are building an air-gapped Pinchy prototype on a machine in this class, and when it runs we will publish measured numbers, including the ones that contradict the estimates, and update this page. If you already run a self-hosted agent on hardware like this, send us your numbers, the model, the context length, and what fell over, and we will add them here with credit. The most useful spec sheet for a box like this will not come from the manufacturer.
The hardware decides how fast your agent thinks. Pinchy decides what it is allowed to do. Pinchy is a self-hosted AI agent platform that runs on local models via Ollama, with no telemetry and offline license validation, so a local-model deployment stays fully air-gapped: nothing crosses the boundary, on any of these boxes. Inside that boundary the governance layer keeps working exactly as it does online, a default-deny permission allow-list and a tamper-evident audit trail, because a disconnected agent still needs to be told what it can touch and still needs every action on record. Pick the box for the model you want to run. Keep the governance regardless of the box.
FAQ
Enough memory to hold the model and a fast path to read it. For a self-hosted AI agent the binding constraint is memory: a model needs to fit in RAM or VRAM, and the speed at which the hardware reads that memory sets how fast the agent responds. A machine with 128 GB of fast unified memory, such as a Ryzen AI Max+ 395 mini-workstation, can serve a capable mid-size model entirely on-device. Raw NPU TOPS matter far less than memory bandwidth for this workload.
Memory bandwidth, by a wide margin, for agent workloads. Generating each token requires reading the model's active weights out of memory, so token speed is capped by bandwidth divided by bytes read per token. The headline TOPS figure describes peak compute that local LLM inference rarely saturates. Two machines with the same memory bandwidth run the same model at roughly the same speed even if their TOPS numbers differ.
You can pool their memory to fit a larger model, but you do not get a proportional speedup. Linking two mini-workstations over USB4 or 10-gigabit Ethernet lets a model span their combined memory, yet that interconnect is hundreds of times slower than the links inside a real GPU cluster, and consumer network cards do not support RDMA. Clustering this class of box is a memory-pooling trick for fitting bigger models, not a way to make inference faster.
At four-bit quantisation, dense models up to roughly 70B fit in memory, and large mixture-of-experts models in the 100B-plus range fit because only a few billion parameters activate per token. Mixture-of-experts models are the sweet spot for this hardware: published tests show a 120B MoE generating in the low 50s of tokens per second and a 30B MoE in the 80s on a Ryzen AI Max+ 395, while a dense 70B at long context feels noticeably heavier because of slower prefill.
No. Choosing offline hardware closes the path for data to leave the network, but a disconnected agent can still over-read, change records, or follow a malicious instruction already inside the enclave. A default-deny permission model and a tamper-evident audit trail apply the same offline as online. The hardware decides how fast the agent thinks; governance decides what it is allowed to do.
Pinchy runs on local models with no telemetry and offline license validation, so a local-model deployment stays fully air-gapped, on whatever hardware you choose. Open source, self-hosted, free to run.
Or email us: info@heypinchy.com