How accurate is it? | Deviation Check

Top-line on the synthetic baseline

As of 2026-05-06, the AI pipeline has been tested against 8 deliberately-constructed Division 09 (Finishes) submittal-plus-spec pairs covering Acoustical Ceilings, Interior Painting, Resilient Tile Flooring, Ceramic Tiling, Tile Carpeting, Gypsum Board, Acoustical Wall Treatment, and Resilient Base. The verdict mix spans the full spectrum: 4 REJECTED, 3 REVISE_AND_RESUBMIT, 1 APPROVED.

Verdict accuracy: 8 of 8 (100 percent). The AI's top-line verdict matched the expected outcome on every fixture.
Deviation recall: 21 of 21 required deviations caught (100 percent). Every deviation we baked into a submittal was surfaced by the AI, classified into the correct category, and assigned the correct severity.
Deviation precision: 21 of 22 (95 percent). One false positive across the eight reports: on the gypsum-board case, the AI fired both a Cat 5 thickness deviation AND a separate Cat 2 fire-rating deviation for what is essentially the same root issue. Both rows are technically defensible (it's a real spec mismatch either way), but the combined output is verbose. Not a hallucination.
Severity calibration: 21 of 21 (100 percent). Every matched deviation carried the expected severity (Blocker, Fix-and-Resubmit, or Note-Only).
AI restraint on a clean submittal: 0 false positives on the APPROVED-case fixture (acoustical wall treatment). Even when given a submittal that fully meets or exceeds every spec requirement and contains language like "no substitutions are proposed", the AI did not fabricate findings.

What the synthetic fixtures test

Each fixture pair is a hand-authored spec section and matching subcontractor submittal, designed to exercise specific behaviors of the 6-category Division 09 deviation taxonomy. Each pair is also paired with a ground-truth file that lists the deviations we deliberately baked in, so an automated scoring tool can compare the AI's output row-by-row.

The 8 pairs cover, between them:

Cat 1 Manufacturer Substitution at default Blocker (no "or equal" letter) AND de-escalated to Note-Only (with architect's "or equal" approval letter attached).
Cat 2 Performance Specification Gap on numeric thresholds: Noise Reduction Coefficient, Critical Radiant Flux fire rating, recycled-content percentage.
Cat 3 Missing Certification on the "available upon request" anti-pattern across five different cert types: GREENGUARD Gold, FloorScore, slip-resistance test, UL fire-resistance design listing, ASTM E119 fire-endurance test report.
Cat 4 Aesthetic Deviation on clear textual color-name mismatch (Charcoal Heather CH-101 vs Storm Gray SG-204) AND on ambiguous spec language ("manufacturer's standard pure white" where the submittal asserts its product is the standard).
Cat 5 Detail or Installation Mismatch at default Fix-and-Resubmit (height mismatch in non-life-safety spaces) AND escalated to Blocker (1/2 inch vs 5/8 inch Type X gypsum panels in 1-hour fire-rated walls; Critical Radiant Flux Class 2 vs Class 1 in life-safety corridors).
Cat 6 Submittal Package Incompleteness on missing samples, missing warranty, missing maintenance instructions, missing LEED Materials and Resources Credit calculation, missing UL listing card.
Verdict mapping: zero deviations leads to APPROVED (acoustical wall treatment); only Fix-and-Resubmit leads to REVISE_AND_RESUBMIT (interior painting, ceramic tiling, resilient base); one or more Blockers leads to REJECTED (acoustic ceiling, resilient flooring, carpet tile, gypsum board).

Methodology

For each fixture pair, we run the AI pipeline (model: claude-opus-4-7 with adaptive thinking, output format constrained to a JSON schema) and then automatically score the output against a YAML ground-truth file. The scoring tool checks:

Verdict match. Did the AI's top-line verdict equal the expected verdict?
Per-deviation HIT, PARTIAL, or MISS. Each expected deviation has a list of keyword signals (manufacturer names, cert names, numeric values, spec quotes). The AI's output matches an expected deviation if it (a) classifies the finding into one of the accepted categories AND (b) at least 50 percent of keyword signals appear in the AI's spec_quote, submittal_quote, or suggested_action fields.
Severity match. For each matched deviation, did the AI's severity equal the expected severity?
False positives. Did the AI fire any deviations that don't match any expected entry? These can be either real catches we didn't list (the truth file gets updated) or hallucinations (the prompt gets tuned).

The ground-truth schema permits multi-category alternates for genuine taxonomy ambiguity. For example, a missing LEED MR Credit calculation is defensibly either Cat 6 Submittal Package Incompleteness or Cat 3 Missing Certification or Compliance Documentation; both are accepted.

Limitations stated up front

The 100 percent recall + 95 percent precision number above is on synthetic fixtures only. Synthetic fixtures are deliberate; we wrote both the spec and the submittal, knowing exactly which deviations were planted. Real-world submittals have layout quirks, scanned image PDFs, multi-trade bundles, inconsistent terminology, abbreviations, and edge cases the synthetic set does not cover. The first 10 real submittal reviews are the metric that matters; the synthetic baseline is a floor, not a ceiling.

Other limitations to know about:

Division 09 only as of W1. Division 26 (Electrical) and Division 23 (Heating Ventilation Air Conditioning, HVAC) ship in 2026 Q3. Other CSI divisions are not supported.
Native Portable Document Format (PDF) only. Image-only PDFs (paper that was scanned without text recognition) will return LOW confidence and may need Optical Character Recognition (OCR) pre-processing.
AI-assisted, not AI-decided. The Project Manager (PM) is responsible for final approval and sign-off on every submittal. The AI surfaces deviations; the PM decides. Liability for missed deviations remains with the parties responsible under the underlying construction contract.
One known false-positive pattern. On submittals where two different taxonomy categories defensibly describe the same root issue (e.g., wrong-thickness Type X panel as both a Cat 5 detail mismatch AND a Cat 2 performance gap), the AI may fire both rows. The output is verbose but technically correct on each row. We are watching this for whether real submittals reveal the same pattern before tuning.
One known judgment-call pattern. When the spec text contains hard-rejection language ("is NOT an acceptable substitution"), the AI escalates Cat 5 to Blocker even when the spaces are not life-safety. This is correct behavior given the spec text, but means the un-escalated default applies only when spec language is neutral.

Reproducibility

The synthetic fixture library, ground-truth files, scoring code, and per-fixture scorecards are all part of the Deviation Check project. We can share the fixture set on request to construction Project Managers, Architects of Record, or General Contractor (GC) procurement teams who want to validate the methodology before a paid engagement. Email hello@deviationcheck.com for the methodology bundle.

What we will publish next

Once we have 30 reviewed real Division 09 submittals (target end of W4), we will publish a real-world accuracy update: precision, recall, severity calibration, and false-positive analysis on the actual customer data. We expect those numbers to be lower than the synthetic baseline; that is expected and fine. Real-world data has more edge cases than any synthetic fixture set.

We will also publish a fixture-vs-real-world delta analysis: which categories degrade most when moving from synthetic to real, and what we change to mitigate.

Open about what we do not know

The first paid customer's first submittal will surface things we have not tested. We will document those gaps publicly and update this page. Construction is a domain where transparent failure modes matter more than headline numbers; that is the bar we hold ourselves to.

Start a review