Writing & research

Notes on verifiable systems

Short, sanitized field notes on making AI and document-automation output you can actually trust — each one drawn from a real project.

Field note · Document extraction

Tested code should do the math, not the model

Deterministic extraction, tie-out validation, and golden-file tests beat trusting an LLM with numbers.

An LLM that reads a financial statement and reports the figures is quietly doing two jobs: finding the numbers, and vouching for them. The second job is the dangerous one — a model will state a total as confidently when it is wrong as when it is right, and “looks plausible” is not a standard you can defend in an audit.

My financial-statement extractor splits those jobs apart. The model only locates and structures the data; tested code then re-derives every total independently — footing each column, crossfooting each row, and checking that the statements articulate against one another. For born-digital PDFs there is no OCR step to introduce its own errors. The model never performs the arithmetic; code that has a regression test does.

The proof is a golden-file test. On one published government audit report it runs 25 checks with 0 exceptions; inject a single wrong figure and the tie-out fails loudly instead of passing quietly. That is the whole point: the output is not “probably correct,” it is verifiably correct, and it stays that way every time the suite runs.

See the project View on GitHub

Field note · Autonomous AI

Earning improvements: a 7-tier verification gauntlet

How an autonomous improvement loop stays honest — a sealed holdout and a fresh-context critic.

Any system that edits and re-scores its own code has one obvious failure mode: it convinces itself it improved when it did not. Optimize against a metric the loop can see, and the loop will learn to game the metric long before it learns to do the work.

AgentA answers that with a gauntlet every change must survive before it is kept: parse, unit tests, property tests, mutation testing, a benchmark delta, a sealed holdout the loop never trains or tunes against, and finally a fresh-context critic that judges the diff with no memory of how it was produced. A change that improves the visible metric but regresses on the holdout — or that a clean-slate reviewer can’t justify — is discarded, not promoted.

The sealed holdout is the load-bearing piece. Gains measured only on data the loop can see are cheap to fake; gains that survive data it has never touched are the difference between earned and hallucinated. The autonomous inner research loop is ported from Udit Goenka’s autoresearch (MIT, based on Andrej Karpathy’s work); the meta-improvement architecture and the verification stack are my own.

See the project View on GitHub

Field note · Shipping SaaS

Building OBBBA Tracker

Notes on shipping a real tax-compliance SaaS and keeping it trustworthy.

OBBBA Tracker is a live B2B SaaS for tipped-industry employers documenting the “no tax on tips & overtime” deductions under the One Big Beautiful Bill Act. It runs on real subscriptions with a free trial — which means it has to be right, not just demo-able.

Most of the engineering lives in the unglamorous parts: automatic Treasury Tipped Occupation Code assignment, FLSA overtime calculations, and W-2 Box 14 exports that line up cleanly with ADP, Gusto, and QuickBooks. Multi-tenant role-based access and an audit trail let an employer show their work, and an analytics dashboard turns the raw records into something a decision-maker can read.

Shipping a compliance product while the underlying rules are still settling means the data model has to absorb change without ever losing the audit trail — every number a customer relies on needs to trace back to its source. That discipline, more than any single feature, is what makes a compliance tool worth paying for.

Visit the live product See the project

Field note · Benchmarks

The best model was zero bytes

I built an arena to find the best small Connect-4 net under a hard compute budget. None of the 14 learned-net entries won.

NeuroFour started as a search for the best small Connect-4 net: rank agents by strength per byte under a hard 5M-FLOP-per-move budget, scored against an exact solver — Connect 4 is a solved game, so perfect play is a known, fixed target. Nineteen agents made the leaderboard under that budget — plus the exact solver used as the scoring target — from a 0-byte search agent to nets up to ~24 KB.

The honest answer wasn’t the one I was looking for. None of the 14 learned-net entries — six distinct weight files, several of the entries the same trained net re-quantised or re-wrapped under a different search policy — beat zero-byte search on optimality, ladder Elo, or NeuroFour Score, the three axes the headline above rests on. But the margins are not equal, and the headline one is thin. On optimality, the board’s gate metric, Zero — 0 bytes, 0 parameters, pure bitboard alpha-beta with a hand-derived heuristic leaf — scores 0.960 against the best learned nets’ 0.957: 288 correct moves out of 300 sealed positions against 287. That is a one-move lead on a single seed, and I won’t call it decisive. The gap opens on the other two. On NeuroFour Score, the board’s rank key, it’s 96.45 against 74.25 — though that key divides strength by a size penalty, so 0 bytes is doing much of that work. On ladder Elo, decided by 380 games of head-to-head play and immune to any size term, it’s 754 against 631. Those comparators are four entries drawn from just two weight files — and one of them, a single 4,837-byte net, supplies three of the four: two search wrappers that tie for the best learned optimality (0.9567), and a third that holds the best learned Elo (631). Neither file leads on all three axes. But they beat Zero elsewhere, and it matters: three learned nets blunder less than it does (soundness 0.9967 to Zero’s 0.990, rank 4 of 19), it ranks 15th of 19 on latency, and seventeen of the nineteen use less compute than it does — every learned net but one. Zero spends 4,999,028 of the 5,000,000-FLOP-per-move budget — 99.98% of it, rank 18 of 19 — while a learned net reaches 0.9467 — 284 of 300, four moves behind Zero — on 67,228 FLOPs, about 1.3% of that. Zero ties for the byte axis — with four other under-budget agents, all at 0 bytes — and buys the rest of its lead by spending almost the entire compute axis. Seven learned-net entries (spanning three weight files) sit on the twelve-member under-budget Pareto frontier beside it, undominated, because that FLOPs cost is real and the frontier counts it.

A benchmark that can only confirm your hypothesis isn’t measuring anything — it’s just an elaborate way of restating what you already believed. The value of NeuroFour is that it was built capable of returning an answer I didn’t want, and then I published that answer instead of quietly reframing the question. It’s the same discipline as the extractor’s tie-out checks and AgentA’s sealed holdout: verifiable, not plausible, even when verifiable is less flattering.

See the project View on GitHub

Want the long version?

Happy to walk through any of these in depth — the extractor’s tie-out logic, the verification gauntlet, or the OBBBA data model.

Get in touch See the work →