Day 101: Three Bugs, One Symptom
There is no worse place for a bug than the first message after a fresh install. Everything else a user can forgive — they're already invested. But the very first chat failing is the moment a new operator decides whether this thing works at all. And that's precisely where one was living: on a clean v0.5.6 setup, the first thing you said to Smithers came back as No API key found for provider '<name>'. You'd just typed the key into the wizard. It was right there. And the agent said it wasn't.
One symptom, three root causes (PR #445, 15:35)
I went in expecting a single race on secrets.json and came out having found three separate production bugs that all produce the identical "No API key found" message from completely different causes. That's the trap of a generic error string: it collapses several failures into one symptom, so you fix the first cause, watch it still happen, and assume your fix didn't work — when actually you've hit cause number two.
The first was the boot race I'd expected. OpenClaw initializes its file-mode secrets provider once at gateway boot. On a fresh wizard install, secrets.json doesn't exist yet, so the provider boots in a "file missing" state — and OpenClaw's reload only diffs the config block, which hasn't changed, so it never reinitializes when Pinchy later writes the file. The fix is an inotify handler in the startup script that, on the first appearance of secrets.json, drops a marker and sends a one-shot restart so the gateway respawns with the file present. The second was an agent hot-reload race: regenerating the OpenClaw config is fire-and-forget, OpenClaw applies the new agent list asynchronously, and the wizard navigating straight to the chat raced that reload — so the chat opened against an agent the runtime didn't know about yet. The fix waits for the agent to actually appear in the runtime after regenerate, with a generous cap because the first fix's restart can take the gateway down for a chunk of that window. The third had the same face and a third cause again.
The real deliverable isn't the three fixes — it's the protocol-level smoke-test suite that found two of them. It exercises a fresh-install first-message across five providers, so this entire class of first-run failure now trips a red check before a release instead of greeting a new operator on day one. The single best thing about the painful debugging session is that it ended with a test that means I never have to have it again.
The overlay that wouldn't let go (PR #446, 13:24)
The morning's fix was a different first-run sharp edge. Entering a Telegram bot token — or any channels-block config change — could strand a user in the "Restarting…" overlay forever. The production trace from the day before told the whole story: token entered at 07:59, but OpenClaw deferred its gateway restart by about three and a half minutes because two background agent runs were still active. The client's 30-second timeout fired, stopped polling, and sat there stuck — even though OpenClaw came back perfectly fine at 08:03. Three TDD fixes in one commit: after 30 seconds the client slows its polling from every 2 seconds to every 5 instead of giving up, so any healthy response un-sticks the overlay without a browser reload; a re-trigger resets a stale timed-out state; and a new notifyDisconnect re-fires the overlay for the deferred case where OpenClaw finishes its pre-restart handshake, the overlay clears, and then OpenClaw actually goes down a couple of minutes later for the real restart — the window where a user would otherwise stare at a working-looking UI during an actual outage.
The streaming-resume work from Day 99 also got its documentation today (PR #444, 16:53) — the run watchdog, the resume-across-reconnect behavior, and the new audit events written up so the behavior is discoverable, plus a small housekeeping PR to gitignore the mock-server lockfiles.
Day 101
Two of today's three first-run bugs were invisible to me for weeks because I almost never do a truly fresh install — I'm always upgrading an environment that already has its secrets.json and its warm runtime. The bugs lived exactly in the gap between how I use the product and how a new operator first meets it. That gap is the most dangerous blind spot a builder has, and the only reliable cure I've found is the one that landed today: a test that does the thing I never do, on every provider, before every release. Debugging is how you fix the bug you have. A smoke test is how you stop shipping the one you can't see.