Short, sanitized field notes on making AI and document-automation output you can actually trust — each one drawn from a real project.
Field note · Document extraction
Tested code should do the math, not the model
Deterministic extraction, tie-out validation, and golden-file tests beat trusting an LLM with numbers.
An LLM that reads a financial statement and reports the figures is quietly doing two jobs: finding the numbers, and vouching for them. The second job is the dangerous one — a model will state a total as confidently when it is wrong as when it is right, and “looks plausible” is not a standard you can defend in an audit.
My financial-statement extractor splits those jobs apart. The model only locates and structures the data; tested code then re-derives every total independently — footing each column, crossfooting each row, and checking that the statements articulate against one another. For born-digital PDFs there is no OCR step to introduce its own errors. The model never performs the arithmetic; code that has a regression test does.
The proof is a golden-file test. On one published government audit report it runs 25 checks with 0 exceptions; inject a single wrong figure and the tie-out fails loudly instead of passing quietly. That is the whole point: the output is not “probably correct,” it is verifiably correct, and it stays that way every time the suite runs.
Earning improvements: a 7-tier verification gauntlet
How an autonomous improvement loop stays honest — a sealed holdout and a fresh-context critic.
Any system that edits and re-scores its own code has one obvious failure mode: it convinces itself it improved when it did not. Optimize against a metric the loop can see, and the loop will learn to game the metric long before it learns to do the work.
AgentA answers that with a gauntlet every change must survive before it is kept: parse, unit tests, property tests, mutation testing, a benchmark delta, a sealed holdout the loop never trains or tunes against, and finally a fresh-context critic that judges the diff with no memory of how it was produced. A change that improves the visible metric but regresses on the holdout — or that a clean-slate reviewer can’t justify — is discarded, not promoted.
The sealed holdout is the load-bearing piece. Gains measured only on data the loop can see are cheap to fake; gains that survive data it has never touched are the difference between earned and hallucinated. The autonomous inner research loop is ported from Udit Goenka’s autoresearch (MIT, based on Andrej Karpathy’s work); the meta-improvement architecture and the verification stack are my own.
Notes on shipping a real tax-compliance SaaS and keeping it trustworthy.
OBBBA Tracker is a live B2B SaaS for tipped-industry employers documenting the “no tax on tips & overtime” deductions under the One Big Beautiful Bill Act. It runs on real subscriptions with a free trial — which means it has to be right, not just demo-able.
Most of the engineering lives in the unglamorous parts: automatic Treasury Tipped Occupation Code assignment, FLSA overtime calculations, and W-2 Box 14 exports that line up cleanly with ADP, Gusto, and QuickBooks. Multi-tenant role-based access and an audit trail let an employer show their work, and an analytics dashboard turns the raw records into something a decision-maker can read.
Shipping a compliance product while the underlying rules are still settling means the data model has to absorb change without ever losing the audit trail — every number a customer relies on needs to trace back to its source. That discipline, more than any single feature, is what makes a compliance tool worth paying for.