Day 72: Agent Create Without the Cascade
Thursday. Today's main thread is one of those bugs that's harder to characterise than it is to fix, because the symptom is just creating an agent feels slow and the cause is several layers of mechanism quietly amplifying each other. By evening the cascade is gone and the cluster of related fixes around it have started to look like the right shape for v0.5.0's reliability story.
What the Cascade Was Doing
Up until today, creating a new agent in the UI did this: Pinchy wrote the new agent into openclaw.json, the OpenClaw process saw the file change via inotify, decided the safest response was a full config reload, paused every active plugin, restarted the gateway, and rebuilt everything from the new state. From the user's perspective this looked like a lag — a few seconds of creating… followed by every chat session in the workspace silently reconnecting. From OpenClaw's perspective it was correct: the file changed, the safe move is to reload. The two perspectives were both right and the result was wrong.
The fix is #193. Instead of relying on inotify to discover the change, Pinchy now pushes the new config directly via a WebSocket RPC call — config.apply with the new shape inline. OpenClaw applies it in place, no full reload, no plugin restart cascade. The inotify watcher stays as a fallback for the cases where the RPC isn't reachable (e.g. mid-boot), but the RPC is now the canonical path, and it's the one that doesn't take everyone's connections with it.
The wiring underneath that is more interesting than it sounds. The push is fire-and-forget, because blocking the UI's save on the gateway's config has been applied means a slow gateway makes the UI unresponsive — the user doesn't care whether the gateway has finished applying, only that the agent has been created. The retry path, when the first push doesn't reach a gateway that's still cold-starting, has a tightened budget — about 3.5 seconds with a stable wait, instead of the long default that was holding requests open. Unit tests cover the propagation; the fragile E2E that had been flaking on this exact path got dropped, because asserting a propagation timing in real Docker is the wrong place to catch a regression that has a unit-testable invariant.
One bonus simplification: the push collapsed from double cascade (the writer was inadvertently triggering two reloads, one for the file change and one for an explicit reload-after-write) into a single restart push, then into the no-restart RPC. The double-cascade was masking the single-cascade for weeks because the symptom was the same.
Defence in Depth on the Secrets Ownership Bug
Yesterday's chown+chmod dance fixed the boot path. Today's commits cover the reload path. Specifically: when OpenClaw rewrites secrets.json on its own (because a config update touched a secret), the freshly written file inherits OpenClaw's user, not root, and the next read by the gateway gets permission denied all over again. The boot fix doesn't help because nobody booted.
The new defence is an inotify watcher: a small loop that watches secrets.json for changes and re-applies the canonical chown root + chmod 0600 on every modification, in the fast tick — milliseconds, not seconds — to win the race against OpenClaw's next read. It's the inotify-watcher version of the chown loop, sitting alongside the boot-time one, with a tighter chmod retry loop and a hard throw on EACCES races so a bug in the choreography is loud rather than silent. Targeted secrets writes (the path Pinchy uses when only one secret needs updating) got the same EACCES guard, because the same race exists there in miniature.
None of this is the elegant fix. The elegant fix is to run OpenClaw and Pinchy under the same user, which we'll get to. Today's fix is to make the existing two-user arrangement actually correct, by having the file's permissions restored faster than OpenClaw can notice they were wrong. It's not pretty but it works on the first reload, the second, and the hundredth, which is what the integration tests now assert.
Telegram's Silent EACCES Swallows
While we were in the EACCES neighbourhood, the Telegram channel code got a sweep. The bot-management code path had a few places where a permission error was being caught and ignored, leaving the bot in a half-configured state with no log line saying why. Two were correct (genuinely transient retries), three were not. The three unconditional swallows got replaced with proper error propagation; the dead code surrounding them got removed; and a small fix landed to allow Pinchy to read OpenClaw's pairing file (the file OpenClaw writes when a Telegram bot completes its initial pairing handshake) — which had been failing silently because of the same ownership pattern that bit the secrets file.
The model resolver also picked up a type constraint on ollama-cloud IDs: anything not in the curated list now fails type-checking at the call site rather than at request time. Stale model IDs in test fixtures had been the long tail of the model-resolver work; today's commit closes the door on that drift recurring.
CI Catches Up to Production
The Telegram and Odoo e2e tests started running against the production Docker image instead of the dev image. This had been the intent for weeks; the rollout was held up by small differences between the two — the production image starts up slightly differently, the dev image has a few convenience tools that the test scripts had quietly come to depend on. The rollout went in for Telegram, partly in for Odoo. The Odoo run hit a Better Auth rate-limit flake on the prod image and got reverted to the dev image at the end of the day, with a follow-up to bypass the rate limit in test mode rather than wait it out. Postgres in the e2e harness is now exposed on host port 5434 so the test helpers can inspect the database between assertions. Chromium is installed in the Telegram-e2e image so the new agent-hot-reload spec — the test that proves agent creation no longer triggers a cascade — can actually run.
Day 72
Two themes today. The cascade fix is the one that affects how the product feels when you use it: agent creation goes from a multi-second lag with a side of disconnected sessions to a near-instant operation that doesn't disturb anything else. The secrets-ownership defences are the ones that affect whether the product runs at all in stricter container environments, with permissions slightly different from the developer's box. They're the same kind of work — looking carefully at the seam between Pinchy and OpenClaw and asking what happens when one of the two is slightly slower or slightly differently configured than expected — and v0.5.0 is starting to look like the release where those seams stop being where the bugs live.