← Back to Blog

Day 99: The Stream That Survives a Reconnect

Two days of thinking, and today the thinking turns into the most architecturally satisfying day of the week. The target is Issue #310 — the one that shows up in production as "the agent didn't respond," usually followed by the user retrying and getting a duplicate. It's been on the list for a while because the real fix is structural, not a patch, and structural fixes need a clear head and a quiet runway. Today they got one, in three PRs that stack on each other.

Tier 1: Don't wipe what the user just typed (PR #439, 11:55)

The first PR is the small, defensive, ship-it-now half. The root cause: when the WebSocket dropped, the runtime flagged itself to recover from server history on the next history frame — and that recovery could replace the user's freshly-typed-but-not-yet-acknowledged message with stale server state, so it looked like the message had vanished. The fix extracts the reconcile decision into a pure helper, shouldReplaceLocalWithServerHistory, broadens it to cover the nasty window where the socket drops between the server's ack and the first response chunk, and adds a guard so a message still in the "sent" state can't be silently wiped by older server history. Nine unit tests pin every branch, plus one end-to-end test that reproduces the exact #310 scenario. It's low-risk and it stays valuable even after the architectural work lands — a killed tab or a laptop waking from sleep without OpenClaw context still needs it.

Tier 2a: Someone on the server should know a run exists (PR #441, 12:15)

Here's the architectural gap Tier 1 can't close: when a chat run is slow and the browser has gone away, the run is invisible to the server. OpenClaw is still draining the stream — Tier 1 relies on that — but nothing on Pinchy's side is tracking the run. No audit signal, no timeline, no way to even know a run is hung. So Tier 2a adds a process-wide ActiveRuns registry keyed by session, populated lazily as each run's first chunk arrives, plus a 30-second watchdog that tears down runs past the 15-minute hard cap and emits chat.run_timed_out with the elapsed time and run id. It also closes a long-standing observability hole by auditing runs that complete after the browser disconnected, via chat.run_completed_after_disconnect. On its own this PR ships concrete value — the audit trail and the watchdog — even if the next tier had slipped.

Tier 2b: Rejoin the stream you walked away from (PR #442, 20:52)

The headline lands in the evening. Once the server knows a run is in flight, a browser that drops mid-stream and reconnects can rejoin it. The new WebSocket joins the run as a listener and receives every chunk from that point on — no orphan bubble flashing on screen, no spinner spinning forever with no response behind it. A new activeRun signal in the history response anchors the in-flight assistant message id, so the resumed chunks merge into the correct bubble after reconcile instead of spawning a duplicate. The genuinely tricky part is a race: the moment the server attaches the reconnecting socket as a listener and the moment it sends the history response can interleave, so chunks arriving ahead of history could land on state that's about to be wiped. A pre-history frame buffer and a drain protocol close that window. Every in-loop send — text, ack, error, heartbeat, the synthesised silent-stream error, the terminal frame — switches to a broadcastForRun helper, so "is anyone actually listening?" becomes one component's job instead of a guard scattered through the pipe. And switching agents in a single tab now detaches the old chat's listener cleanly, so streams can't leak across chats.

A small evening PR (PR #443, 22:03) pins the heartbeat to start lazily on the first persisted user message, the last loose thread in the tier. Stacked correctly, the three PRs are one coherent change: Tier 1 stops the data loss, Tier 2a makes runs first-class server objects, Tier 2b lets a client walk away and come back.

Day 99

This is the kind of work I find most satisfying and least demonstrable. There's no screenshot for "your chat survived your train going into a tunnel." The whole payoff is an absence — the orphan bubble that doesn't appear, the duplicate retry that never happens, the run that gets audited instead of vanishing. It's also the clearest argument for why the two quiet thinking days mattered: #310's fix was never going to be a one-PR patch, and trying to force it into one is how you get a Tier 1 band-aid that calls itself a cure. Splitting it into three stacked tiers — each shippable alone — is the difference between fixing the symptom and fixing the architecture.

← Day 98: The Model Underneath Keeps Changing Day 100: The MCP Question →

Pinchy is open source and ready to deploy. Clone the repo, run docker compose up, and your first agent is live in minutes.