← Back to Blog

Day 71: The 0600 Dance

Wednesday. The conversations from earlier in the week kept coming back to the same point: the next people running Pinchy are unlikely to be on the team that built it, and the first thing that has to be true is that the product behaves on infrastructure the team doesn't operate. Today's commits are the bug-bash version of that — a dozen-odd small fixes whose only common feature is that each of them would have surfaced as an outage on a stranger's hardware.

secrets.json Gets the Mode Right and the Owner Wrong

The headline bug. Day 65's SecretRef migration moved every secret out of openclaw.json and into a tmpfs-backed secrets.json mounted at 0600. That worked on the dev box because the dev box runs the gateway and Pinchy under the same user. It did not work the moment integration tests ran the same setup under stricter container boundaries: Pinchy wrote secrets.json as one user, OpenClaw started up as another, and OpenClaw's read of the file came back as permission denied. The gateway booted without a token. Every request that needed it failed. The signal at the front of the logs was misleading — it looked like a missing-secret bug, not an ownership one.

An integration test landed first that forces SecretRef resolution against a freshly mounted volume, exposing the failure on every CI run instead of waiting for someone to notice. The fix is a small choreography: chown secrets.json to root before each gateway boot or reload, then chmod 0600 on top — because the file inherited 0644 from somewhere and OpenClaw rejects anything looser than 0600. The two operations have to happen in that order, and they have to happen on every boot, not just the first one. Three commits to get the dance right; one diagnostic commit to capture OpenClaw container logs before teardown so the next round of weirdness has a paper trail.

A safety belt at the read end too: readGatewayToken falls back to reading from openclaw.json when secrets.json is unreadable, with validateGatewayToken doing the same. This isn't ideal — if the secrets file is broken the right answer is fail-closed, not silent fallback — but the reality of an upgrade in flight is that the old shape and the new shape coexist on disk for a few minutes, and a hard failure during that window is worse than a soft one. The plaintext-scanner that polices new writes already prevents the fallback from re-introducing the leak that the migration was meant to close.

While we were in the secrets code, the plaintext scanner picked up coverage for ollama-cloud.apiKey — it had been missing from the field-name heuristics — and the dead updateSecretsFile path was deleted, because secrets writes go through one canonical helper now and a second one sitting unused is a guarantee that someone reads it on the wrong day.

The Caddyfile Was Eating Cold-Start Requests

The other UX-shaped bug. Pinchy's staging environment runs the same Caddyfile shape as production with an extra few overrides. One of those overrides was an lb_try_duration that was, in retrospect, far too aggressive. The way it expressed itself: a request hitting Pinchy during cold-start — when the container is up but OpenClaw is still finishing its boot scan — would time out at the proxy before the upstream finished responding. The user got a generic gateway error instead of a starting up, please wait page. Production did the right thing because production didn't have the override; staging didn't, because staging had been the place the override was first tried.

Two fixes. Drop the override on staging — the default lb_try_duration is what production uses and it's correct. Restore the prod-parity Caddyfile and the installing.html loading page, so a slow boot looks like a deliberate this is starting up screen rather than a confused error. The staging deploy doc moved out of the public docs/ tree into CONTRIBUTING.md with a fresh cloud-init-next snippet and a Hetzner staging-instance section — the public docs should describe the supported install paths, not the internal team's staging recipe.

And the chat composer learned to stay enabled during reconnects. The previous behaviour was that a brief WebSocket drop greyed out the input until the connection came back, which was correct in the sense that messages can't be delivered without a connection and incorrect in the sense that nobody types during a connection drop and waits for the input to re-enable; they type, watch the input refuse them, and assume the product is broken. The fix is to keep the composer alive through reconnects and let the message-status reducer handle the failed-send case if the reconnect doesn't complete in time.

The 4.26 Bump That Wasn't

An OpenClaw image bump landed today and got reverted in the same session. 2026.4.26 changes the on-disk shape of the auth-profiles file in a way that needs a writer on Pinchy's side to be doing some work it isn't yet doing. Bumping the runtime image without that writer in place would break agent creation on the next deploy. The bump comes back as soon as the auth-profiles wiring lands; for now main stays on 2026.4.14 and the work moves to the branch where the writer is being built.

v0.5.0 Takes Shape

Some smaller pieces in the same push that are easier to enumerate than narrate. Minor dependency bumps for v0.5.0. The model resolver moved to the latest non-preview model IDs across providers; stale fixtures got swept out of test files and production defaults so a model that no longer exists doesn't appear in either. ollama-cloud got two new model entries — DeepSeek V4 and Kimi K2.6 — added to the curated list, since both have stabilised on the upstream side. The v0.5.0 upgrade notes were consolidated into a single canonical file so the upgrade page and the deep-dive can stop drifting.

Reconnect timeouts in the integration tests widened to absorb the SIGUSR1 cascades that OpenClaw fires during a config reload — the cascades are correct behaviour, but the test that expected one signal was now seeing four and flaking. The CI link checker started excluding squawkhq.com because it returns 403 from GitHub-runner IPs about a third of the time. A jsdom 29.x focus-scope cleanup error in the Radix UI tests gets swallowed in the right place — that one's a library bug we don't need to wear.

And one fresh PR on the addon repo: customer-email branding and an optional phone field in the checkout (#5). The phone field was the smaller of the two — a nice-to-have on the order — but the email branding mattered: the order-confirmation email goes from a domain the customer recognises, with the matching logo in the header, rather than the default Odoo template that signals generic ecommerce store.

Day 71

Today is what a Wednesday two weeks before a release looks like when the next release is the one that goes into deployments the team didn't set up themselves. None of these fixes are interesting in isolation; collectively they're the line between a product that runs on the developer's laptop and one that runs anywhere with a Docker daemon. The secrets-ownership bug in particular is the kind of failure that surfaces specifically because someone else's environment is more locked down than yours — which means it's the kind of thing you only find by running the product where you didn't write it.

← Day 70: When the Licence Has Teeth Day 72: Agent Create Without the Cascade →

Pinchy is open source and ready to deploy. Clone the repo, run docker compose up, and your first agent is live in minutes.