← Back to Blog

Day 32: The PDF That Needed Eyes

The plan was simple: make Pinchy agents read PDFs. The reality was a 28-commit rabbit hole that ended with agents literally looking at documents they couldn't read.

When Text Extraction Isn't Enough

It started with pdfjs-dist, Mozilla's PDF parser. Extract text, feed it to the agent, done. I wrote the tests first (TDD, naturally), built the extractor, added an XML+Markdown formatter so agents get clean structured output. Worked beautifully — for text-based PDFs.

Then I tested with a scanned document. Zero extractable text. Just images of text, trapped in a PDF wrapper. The extractor returned empty pages and the agent would confidently say "this document appears to be empty."

Not great.

Teaching Agents to See

The fix: when a PDF page has no extractable text, render it as a PNG and send it to the LLM's vision API. The agent literally looks at the page instead of reading it.

This sounds simple. It was not.

First, I needed a page renderer. @napi-rs/canvas for server-side PNG rendering, because you can't just screenshot a PDF in a headless Docker container. Then the vision fallback module — detect empty pages, render them, send them to the model's vision endpoint.

Then I tried sending images as native content blocks instead of separate API calls. Cleaner, fewer round-trips, works with Anthropic and OpenAI out of the box. But scanned PDFs can have 50+ pages. That's 50 PNG renders and 50 vision calls.

Making It Fast

Performance iteration in one afternoon:

  1. Sequential — one page at a time. Technically correct, painfully slow.
  2. Worker thread — offload PDF extraction so it doesn't block the event loop. Great idea, except worker threads need tsx to run TypeScript — which isn't available in the Docker container. Dead end.
  3. setImmediate yield — the pragmatic fix. Yield to the event loop between pages so other agents can still respond while a PDF is being processed. Simpler, no extra dependencies.
  4. Parallel vision calls — fire all pages at once, retry on 429 (rate limit). Massive speedup.
  5. SQLite cache — content-hash-based. Same PDF, same result, zero API calls. Because nobody wants to pay for vision on a document they already processed yesterday.

The final result: a scanned 20-page PDF processes in seconds on the second read. First read depends on your vision API speed, but at least it's parallel.

The Security Side Quest

While deep in PDF code, I noticed our Kysely dependency had a SQL injection vulnerability in versions ≤0.28.13. Patched that same day. Also found that the modelId parameter in the vision path could theoretically be used for URL injection — added validation and tests.

Then the bigger discovery: some OpenClaw tools could be invoked directly, bypassing Pinchy's per-agent access control. Not a vulnerability anyone had exploited, but the kind of thing you fix the moment you see it. Agents now go through the allow-list, period.

Five Feature Pages, One Pipeline

Between PDF commits, the marketing side wasn't idle either. Five detailed feature pages went live on heypinchy.com:

Each page has live screenshots that update automatically. Which brings me to the pipeline.

The Screenshot Pipeline (Thanks, Benedikt)

This one goes to Benedikt Poller, who suggested automating our screenshots. He was right — manually taking screenshots for five feature pages is the kind of thing you do once and then never update.

The solution: Playwright runs against a seeded Pinchy instance in CI, captures nine screenshots, uploads them as artifacts, and dispatches to the website repo which pulls and deploys them. Every Pinchy release gets fresh screenshots. No manual work, no stale images.

The seed script populates demo data — users, agents, groups, permissions, audit trail entries — so every screenshot tells a coherent story. We even documented the demo scenario so future contributors know the cast of characters.

Also: NemoClaw

Oh, and I wrote a whole analysis of what Nvidia's NemoClaw means for Pinchy. That was its own post. Because when Jensen Huang validates your market, you don't bury it in a daily update.

Day 32

28 commits on PDF support. A security patch. Five feature pages. An automated screenshot pipeline. A NemoClaw deep-dive. And one very tired founder who should probably stop counting the days and start counting the features.

But not yet. Tomorrow's another day.

← Day 31: Three Talks in Ten Days Day 33: Show Me the Tokens →

Pinchy is open source and ready to deploy. Clone the repo, run docker compose up, and your first agent is live in minutes.