Day 32: The PDF That Needed Eyes
The plan was simple: make Pinchy agents read PDFs. The reality was a 28-commit rabbit hole that ended with agents literally looking at documents they couldn't read.
When Text Extraction Isn't Enough
It started with pdfjs-dist, Mozilla's PDF parser. Extract text, feed it to the agent, done. I wrote the tests first (TDD, naturally), built the extractor, added an XML+Markdown formatter so agents get clean structured output. Worked beautifully — for text-based PDFs.
Then I tested with a scanned document. Zero extractable text. Just images of text, trapped in a PDF wrapper. The extractor returned empty pages and the agent would confidently say "this document appears to be empty."
Not great.
Teaching Agents to See
The fix: when a PDF page has no extractable text, render it as a PNG and send it to the LLM's vision API. The agent literally looks at the page instead of reading it.
This sounds simple. It was not.
First, I needed a page renderer. @napi-rs/canvas for server-side PNG rendering, because you can't just screenshot a PDF in a headless Docker container. Then the vision fallback module — detect empty pages, render them, send them to the model's vision endpoint.
Then I tried sending images as native content blocks instead of separate API calls. Cleaner, fewer round-trips, works with Anthropic and OpenAI out of the box. But scanned PDFs can have 50+ pages. That's 50 PNG renders and 50 vision calls.
Making It Fast
Performance iteration in one afternoon:
- Sequential — one page at a time. Technically correct, painfully slow.
- Worker thread — offload PDF extraction so it doesn't block the event loop. Great idea, except worker threads need
tsxto run TypeScript — which isn't available in the Docker container. Dead end. - setImmediate yield — the pragmatic fix. Yield to the event loop between pages so other agents can still respond while a PDF is being processed. Simpler, no extra dependencies.
- Parallel vision calls — fire all pages at once, retry on 429 (rate limit). Massive speedup.
- SQLite cache — content-hash-based. Same PDF, same result, zero API calls. Because nobody wants to pay for vision on a document they already processed yesterday.
The final result: a scanned 20-page PDF processes in seconds on the second read. First read depends on your vision API speed, but at least it's parallel.
The Security Side Quest
While deep in PDF code, I noticed our Kysely dependency had a SQL injection vulnerability in versions ≤0.28.13. Patched that same day. Also found that the modelId parameter in the vision path could theoretically be used for URL injection — added validation and tests.
Then the bigger discovery: some OpenClaw tools could be invoked directly, bypassing Pinchy's per-agent access control. Not a vulnerability anyone had exploited, but the kind of thing you fix the moment you see it. Agents now go through the allow-list, period.
Five Feature Pages, One Pipeline
Between PDF commits, the marketing side wasn't idle either. Five detailed feature pages went live on heypinchy.com:
- Agent Management — personalities, knowledge bases, the five settings tabs
- Agent Permissions — the allow-list model that makes Pinchy different
- Audit Trail — cryptographic logging, tamper detection, compliance
- User Management — roles, invites, the full lifecycle
- Groups — enterprise RBAC, visibility modes, data isolation
Each page has live screenshots that update automatically. Which brings me to the pipeline.
The Screenshot Pipeline (Thanks, Benedikt)
This one goes to Benedikt Poller, who suggested automating our screenshots. He was right — manually taking screenshots for five feature pages is the kind of thing you do once and then never update.
The solution: Playwright runs against a seeded Pinchy instance in CI, captures nine screenshots, uploads them as artifacts, and dispatches to the website repo which pulls and deploys them. Every Pinchy release gets fresh screenshots. No manual work, no stale images.
The seed script populates demo data — users, agents, groups, permissions, audit trail entries — so every screenshot tells a coherent story. We even documented the demo scenario so future contributors know the cast of characters.
Also: NemoClaw
Oh, and I wrote a whole analysis of what Nvidia's NemoClaw means for Pinchy. That was its own post. Because when Jensen Huang validates your market, you don't bury it in a daily update.
Day 32
28 commits on PDF support. A security patch. Five feature pages. An automated screenshot pipeline. A NemoClaw deep-dive. And one very tired founder who should probably stop counting the days and start counting the features.
But not yet. Tomorrow's another day.