Verification Design

How you architect quality assurance into AI-assisted work — from gut-feel spot-checks to designed falsifiability with automated pipelines — rather than reviewing outputs after the fact.

The five levels

L1

Absent

Accepts AI output at face value or applies only gut feel. No structured review process. This is a natural starting point — when AI is new, you don't yet know what kinds of errors to look for or how often they occur.

L2

Personal

Reads and edits AI output before using it. Spot-checks facts. Uses domain knowledge to catch obvious errors. Treats output as draft (32/37 do this). But verification is manual and inconsistent — depends on attention, not design.

L3

Systematic

Has designed verification into their workflow: separate review sessions, different model for checking, structured evaluation criteria. Domain expertise provides natural verification mechanisms (finance: GL ties or doesn't; code: tests pass or don't). Catches errors through process, not just attention.

L4

Expert Exemplar

Designs verification before starting work. Automated pipelines (Playwright for accessibility, test suites for code, cross-model adversarial checking). Objective functions that measure output quality. Verification infrastructure accelerates work rather than slowing it. Others adopt their verification patterns.

L5

Compounding

Builds verification systems that others use — shared test suites, quality gates in deployment pipelines, automated standards enforcement that you designed and maintain. Your verification patterns become the way the team catches errors, not just the way you catch yours. Designs feedback loops where verification results improve future prompts and constraints. Tracks quality metrics over time and surfaces regressions. Incorporates security and prompt injection awareness into verification architecture. Others produce higher-quality output because your verification infrastructure exists, without needing to design their own.

Key quotes

L1
Really, I think I kind of blindly trust it... I don't think I am very good here. Like, I have no idea how to review the patterns, how to edit them.
General Manager, early in adoption
L2
I always try to make sure things go through two or three reviews... sometimes I'll have an adversarial agent review things too.
General Manager, cross-platform review but manual and gut-level
L3
Whenever I build something, I ask it to either create evals for it or I ask it to generate a script that's going to be able to validate it... In every project I am working on, whenever I set it up, I would set up an eval framework just because I know I will need it.
Engineer
L4
If I was going to hire an intern who had no domain knowledge, and I had to set something up, and I'm going to come back later and see if they were correct... how would I set up the second intern so that they don't need to be a domain expert to definitely validate that the first one was correct?
Technical advisor

Transitions — what distinguishes each level

L1L2

The shift is from *unexamined trust* to *active review*. At L2, you treat AI output as a draft and apply your domain knowledge to catch errors. The gap is between "I use what it gives me" and "I read it critically before using it."

L2L3

The shift is from *attention-based checking* to *process-based catching*. At L3, your verification happens through designed steps (separate review sessions, cross-model checks, eval frameworks) rather than relying on whether you happen to notice an error. The gap is between "I check when I remember to" and "my workflow has built-in checkpoints."

L3L4

The shift is from *verification as a step in your process* to *verification as the starting point of your process*. At L4, you design the objective function before you build. Verification accelerates work rather than slowing it — automated pipelines catch errors faster than manual review. The gap is between "I verify what I built" and "I build the verification first."

L4L5

The shift is from *personal verification excellence* to *shared verification infrastructure*. At L5, your test suites, quality gates, and standards enforcement are used by the team. The gap is between "my verification catches my errors" and "my verification infrastructure catches everyone's errors."