Day 98: The Model Underneath Keeps Changing
Yesterday I wrote about OpenClaw moving fast. There's a second foundation under Pinchy that moves even faster, and it doesn't ship a changelog I can read: the models. Pinchy's whole posture is that it picks a good model for you — you shouldn't have to know whether Sonnet or GPT-5.5 is right for an Odoo bookkeeping agent, you should get a sensible default and the option to override. That promise is easy to make and surprisingly hard to keep, because the set of correct answers changes every few weeks and nobody tells me when.
Why "let the user pick" is the wrong answer
The tempting escape hatch is to not have an opinion: show a dropdown of every model the provider exposes and let the user choose. I've resisted that, and Day 91 is why. An admin setting up Pinchy for their team does not know — and shouldn't have to know — that gpt-4o-mini is the wrong tier for an agent that plans multi-step writes, or that a model whose name contains preview might drop the thought_signature on a tool call and either error or hang. The dropdown looks like freedom and functions like a trap: it pushes a research problem onto someone who came here to get an invoice drafted. The auto-default is the product. Having an opinion is the value.
But an opinion about a moving target is a maintenance commitment. Day 91 wasn't "pick the balanced tier" — it was five layers: a date parser that understands both Anthropic's YYYYMMDD and OpenAI's YYYY-MM-DD, a reject pattern that filters preview/beta/thinking/nano variants, a deterministic tiebreaker, generation-anchored patterns, and a curated per-provider fallback. Every one of those layers exists because the naive version broke on a real model name. And every one of them is a hostage to fortune: the next generation of models will have a naming convention I haven't seen, a date format I didn't anticipate, or a capability flag that lies. The drift-guard tests I added will catch the model that resolves wrong; they can't catch the model that doesn't exist yet but should be the new default.
The shiny-broken problem
The hardest case isn't a bad model — it's a good-looking broken one. gemini-3-flash-preview on Ollama Cloud advertises a million-token context window with reasoning, vision, and tools. On paper it's the perfect invoice-processing model. In practice it drops tool-call signatures and stalls. Day 91's blocklist forbids it for tool workloads and falls back to a model with a quarter of the context window that actually works end to end. That's the right call, and it's also a confession: the picker can't tell "advertised" from "actually works." Capability flags are marketing until proven otherwise, and the only thing that proves otherwise is a real workload failing in a real chat. The blocklist is a list of lessons learned the hard way, and it only ever grows.
The Pinchy-specific twist
A single-provider product would have one model treadmill to walk. Pinchy supports four — Anthropic, OpenAI, Google, Ollama — plus the self-hosted Ollama case where the "models" are whatever the operator pulled onto their own box, on their own schedule, with no API I can enumerate against a known catalog. So the churn isn't just "new models ship monthly"; it's four catalogs drifting independently, plus an open-ended local one. The balanced-tier anchors that keep the resolver honest have to mean something sensible in all of those worlds at once, and "sensible" is re-negotiated every time any one of the four providers renames a model or ships a new flagship.
The deeper question I don't have a clean answer to: how much of "which model is good for this job" should be code at all? Today it's hardcoded patterns and anchors that I update by hand when the landscape shifts. That doesn't scale with the rate of change, and it makes Pinchy's quality a function of how recently I last paid attention. There's a version of this where the resolver probes capabilities empirically — runs a tiny tool-use smoke test against a candidate model and trusts behavior over advertised flags — and a version where the blocklist and anchors are config an operator can override for their own pulled models. Both are real work for v0.6 and beyond. Both are attempts to make a good default survive a foundation that's actively trying to invalidate it.
Day 98
Two quiet days, two essays about the same uncomfortable truth: the most important layers of Pinchy are built on foundations I don't control and that move faster than I can. OpenClaw yesterday, the models today. The work isn't to stop them moving — it's to build defaults good enough to be worth defending, and defenses honest enough to admit when they've eroded. The day the auto-default stops being better than a dropdown is the day I've stopped paying attention, and these posts are partly how I make sure I notice.