Quantum Automations Quantum Automations
Blog · Portfolio
← Back to Blog
Guide · Documents

OCR with human-in-the-loop: shipping 99% accuracy in production

Published May 2026
Topic Documents · OCR
Reading time 9 min
For UK SME ops leads
On this page
  1. How to decide in 30 seconds
  2. Why 99% accuracy still fails
  3. The architecture that actually ships
  4. Where vision models help and where they don't
  5. Confidence thresholds: the gate between auto and manual
  6. A reference stack for SME-scale extraction
  7. What changed in 2025–2026
  8. Good / Bad / Ugly
  9. FAQ

A finance team at a manufacturing client was processing 1,200 invoices a month by hand. Two and a half full-time employees, a shared inbox, and a quiet sense of dread every quarter-end. The invoices arrived in 14 different formats from 80+ suppliers. Some were scanned PDFs, some were photos taken on a phone, some were Excel files renamed to PDF, a few were genuine machine-printed PDFs from established suppliers.

We were sceptical of the project before we took it on. The literature is full of OCR success stories that turn out to be 95% accurate — meaning 60 invoices a month wrong, which is worse than starting over. The honest pattern that survives in production isn't "AI replaces the humans." It's "AI handles the 80% the humans were bored doing, and the humans handle the 20% that actually requires judgement."

This post is the operating manual for that pattern. Why 99% extraction accuracy still fails. The confidence-threshold gate that decides what goes to a human. The queue UI nobody talks about. And what changed in 2025–2026 that makes the math different than it was in 2023.

How to decide in 30 seconds

Is the document set fully standardised (one format, machine-printed)?
   YES → Tesseract or a single-vendor OCR. Skip the LLM.
   NO  → continue.

Are you processing >100 documents/month?
   YES → vision-model OCR with a human queue. Continue.
   NO  → keep doing it manually. The infrastructure isn't worth it.

Is any field high-stakes (financial, legal, identity)?
   YES → confidence-threshold gate is mandatory. No auto-approval below 0.95.
   NO  → relaxed thresholds OK; lower the human-review burden.

Most SMEs sit at "100–5,000 docs/month, varied formats, financial fields." That's the sweet spot for hybrid OCR — the path the rest of this post details.

Why 99% accuracy still fails

Vision-language models can hit 99%+ field-level accuracy on invoice extraction with the right prompt. That number sounds good in a sales deck and is misleading in production for three reasons.

First, field-level accuracy isn't document-level accuracy. An invoice has 8–15 fields. At 99% per field, the probability that a 12-field invoice is fully correct is 89%. Process 1,200 invoices a month at that rate and you ship 132 incorrect invoices into your accounting system every month. That's worse than the manual process, because the manual process catches its own mistakes during a second-eye review.

Second, the 1% errors aren't randomly distributed. They cluster on specific document subtypes — handwritten amendments, scanned-twice copies, suppliers who put the VAT number in an unusual location. A confidence threshold catches these because the model itself flags low confidence on the messy cases. A flat 99% accuracy claim hides the structure of the failure mode.

Third, silent failures compound. An invoice with the wrong supplier name might get paid to the wrong account before anyone notices. The cost of a single silent error in a financial pipeline is often higher than the cost of running a human queue across the whole month. The math only works if the system is built to flag uncertainty rather than hide it.

The architecture that actually ships

The pattern that survives production for SME finance and document workflows:

  1. Ingestion from email, shared drive, or upload UI. Files normalised to PDF, page-split.
  2. Pre-classification by a small vision model — one cheap call to identify document type (invoice, lease, statement, dispatch note) and route to the right extraction prompt.
  3. Extraction with a per-doc-type prompt against a vision-language model. The prompt enforces a strict JSON schema for the output: every field must come back, with a confidence score and a source bounding box.
  4. Validation against business rules in code. VAT number format. Total = sum of lines. Dates within a sensible range. Supplier known to the system. Each rule failure lowers the document's overall confidence.
  5. Confidence gate. Documents with all fields above the threshold (typically 0.92 for financial workflows) auto-approve and write to the system of record. Anything below routes to the human queue with the lowest-confidence field highlighted.
  6. Human queue UI that shows the original document image, the extracted draft, and one-click approve / edit / reject. Edits feed back into a fine-tuning or prompt-improvement loop.
  7. Audit log that records every document with its extraction, confidence, route (auto vs human), reviewer, and final values.

The point of the architecture isn't to maximise auto-approval rate. It's to make the system honest: every document either auto-approved with a confident audit trail or was reviewed by a human with full context. There is no "we think this is probably right" middle ground.

Where vision models help and where they don't

Approach Best for Throughput Per-doc cost (UK SME volumes) Failure mode
Tesseract Standardised forms, machine-printed 1,000s/min < £0.01 Fails silently on layouts it doesn't understand
Vendor OCR (Google Document AI / Azure Form Recognizer) High-volume, well-defined doc types 100s/min £0.05–0.15 Locked into vendor schema; per-doc cost dominant at scale
Vision-language model (GPT-4o / Claude Sonnet) Mixed formats, edge cases, handwritten 10–30/min £0.02–0.08 Hallucinates fields if prompt is loose
Vision model + human queue (this post) SME mixed corpora, financial accuracy 50–500/day £0.05 + queue cost Queue UI is the bottleneck; design it well

The right architecture for most SMEs combines these. Use Tesseract for the standardised supplier templates you already know, vision models for the messy long tail, and a human queue as the gate on financial fields. Pure single-approach pipelines lose money at the edges.

A second pattern worth naming: per-supplier prompt warm-up. The first 50 invoices from a new supplier always extract worse than the next 500 because the model hasn't seen the supplier's specific layout, abbreviation conventions, or VAT-line ordering. We keep a per-supplier lookup that conditions the extraction prompt with two or three previously-verified extractions from that same supplier, plus a layout note ("VAT registration appears in the footer-left, line items use abbreviated SKU codes"). Field-level accuracy on first-50-invoices climbs from ~92% to ~98% with this priming. It's the kind of detail no benchmark measures and that decides whether a new supplier integration takes a week or a month.

Confidence thresholds: the gate between auto and manual

The single most important parameter in the entire pipeline is the confidence threshold. Set too high and the human queue is overwhelmed. Set too low and silent errors leak into production.

The pattern we use in finance workflows:

THRESHOLDS = {
    "invoice_total":       0.95,   # high-stakes; aggressive gate
    "supplier_name":       0.95,
    "vat_number":          0.92,
    "invoice_date":        0.85,
    "line_items":          0.80,   # individual lines lower
    "po_reference":        0.75,
    "delivery_address":    0.70,
}

def route(extraction):
    weak_fields = [
        f for f, conf in extraction["confidences"].items()
        if conf < THRESHOLDS.get(f, 0.85)
    ]
    if weak_fields:
        return "human_queue", weak_fields
    return "auto_approve", []

Two field-level rules matter more than the threshold values themselves. Per-field thresholds, not per-document. A document with a 0.99 supplier name and a 0.79 line-item should still go to a human — financial accuracy depends on the line items. Surface the weak fields to the reviewer, don't make them re-check everything. The reviewer's time on a routed document should be 10–20 seconds for a quick edit, not 2 minutes for a full re-check.

The right threshold values come from measurement, not intuition. Run the model on 200 documents you've already verified manually. Plot field-level confidence against actual error rate. Pick the threshold that catches >95% of real errors. Re-tune quarterly.

A reference stack for SME-scale extraction

  • Pre-classifier: GPT-4o-mini or Claude Haiku. One cheap call to classify doc type. ~£0.001 per doc.
  • Extractor: GPT-4o or Claude Sonnet. Per-doc-type prompt with a strict JSON schema that includes field confidence and bounding box. The OpenAI vision documentation and Claude vision capabilities cover the API patterns; the prompt engineering is where the real work is.
  • Validation rules: plain code. Format checks (VAT, date), arithmetic checks (totals match), reference checks (supplier in CRM). Every check writes a structured error code, not a free-text message.
  • Storage: Postgres with a raw extraction table (full JSON), a normalised business table (per-field, audited), and the original PDF in S3 or equivalent. Documents and their extractions are linked but separately mutable so re-extractions don't lose the historical record.
  • Queue UI: the part that decides whether the system survives. Shows the original document image with the weak field highlighted, the extracted value editable in-place, and approve / edit / reject in one click. We typically build this as a small Next.js or htmx app with WebSocket updates.
  • Search and retrieval: the extracted structured data feeds the database; the original document goes into the document RAG layer for the cases where staff need to "find the invoice that mentioned X." Both layers use the same metadata (supplier, date, doc type) so cross-querying works.

We've shipped this stack for our Invoice OCR Processing client at 99.2% extraction accuracy with ~12% queue rate, and a variant for a financial services firm using voice agents and document analysis where the documents come in via voice-agent-captured uploads.

What changed in 2025–2026

Vision models reached parity with vendor OCR on accuracy and beat them on flexibility. GPT-4o and Claude Sonnet now extract from most invoice and form formats with field-level accuracy in the 98–99.5% range, comparable to Google Document AI on benchmark corpora and significantly more flexible because you write your own per-doc-type prompts. The vendor OCR services still win on extreme volume and hardened compliance certifications; the gap on accuracy is closed.

Structured-output APIs eliminated the JSON-parsing tax. OpenAI's structured outputs and Anthropic's tool-use mode now guarantee schema-conforming JSON. The brittle "parse the model's text and pray" middleware code is obsolete. This single change cuts ~30% of the engineering effort that hybrid OCR pipelines used to require.

Per-field bounding-box extraction got cheap. Both major vision models can now return source coordinates for every extracted field. The queue UI experience went from "review the whole document" to "review this specific highlighted region" — and reviewer throughput roughly doubled in our measurements.

The counterpoint worth taking seriously: not everyone needs a hybrid pipeline. Azure AI Document Intelligence (the rebranded Form Recognizer) makes the case that for genuinely high-volume, narrow-format extraction, dedicated form models still win on cost and speed. If you're processing 50,000 of the same invoice template per month, the LLM path is overengineered. Most SMEs aren't in that situation.

Good / Bad / Ugly

Good. Per-field confidence thresholds. A queue UI that highlights the weak field. Audit logs that record the full provenance of every extraction. Validation rules that catch arithmetic and format errors before the human queue. Re-tuned thresholds quarterly based on real error data.

Bad. Single confidence threshold across all fields. Auto-approving without rule validation. Reviewers seeing the document but not the extracted draft. No audit log of corrections. No feedback loop from corrections back to prompt or model.

Ugly. Silent extraction errors that ship straight into the financial system. Queue UIs that take 2 minutes per document to review (the reviewer becomes the bottleneck). Hallucinated supplier names because the prompt didn't constrain to known suppliers. Auto-approval on documents that the model itself flagged with low confidence because someone "set the threshold lower to clear the backlog."

The 1,200-invoice-a-month manufacturing client now processes them at 88% auto-approval, 12% human queue, with one part-time reviewer instead of two and a half full-time staff. The reviewer doesn't complain about the work because every routed document arrives with the field that needs checking already highlighted. Two and a half FTE returned. The boredom went with them.

FAQ

Is Tesseract still useful for SME OCR?

For clean, machine-printed text on standardised forms, yes — and it's free. For invoices, leases, handwritten notes, multi-column layouts, or anything with stamps and signatures, vision-language models extract correctly where Tesseract misreads or fails entirely. Most SME corpora have enough variety that the all-Tesseract path costs more in human review than it saves in API fees.

How do I handle low-confidence cases?

A reviewer queue. Set a confidence threshold per field (not per document) — typically 0.85–0.95. Anything below routes to a human who sees the original image, the extracted draft, and one-click approve / edit / reject. The queue UI is the part that decides whether the system actually works in month six.

What's the real cost of vision-model OCR?

For typical SME volumes — 500–5,000 documents per month — vision-model API costs run £40–250 per month using GPT-4o or Claude Sonnet. The dominant cost is usually the human review queue, not the model. Reducing the human queue from 30% of documents to 5% by tuning confidence thresholds saves more than switching to a cheaper model.

Do I need a separate OCR vendor like Google Document AI or Azure Form Recognizer?

Almost certainly not for SME volumes. The dedicated OCR vendors (Google Document AI, Azure Form Recognizer, AWS Textract) shine at very high volume with structured forms, but charge similar per-page pricing and force you into their schema. Modern vision-language models give you per-document-type prompt control without lock-in. Reach for a vendor only when you've measured a clear gap.

Related Reading

Document RAG: when vector search beats keyword search

Vector search isn't always the right call. A field guide for UK SMEs deciding when pgvector earns its complexity and whe

Voice AI Architecture : A 2025 Implementation Guide

A practical, production-grade blueprint for implementing AI voice agents: stack choices, latency budgets, call flows, an

Need an OCR pipeline that doesn't break at the edge cases?

30-minute audit. We map your stack, your constraints, and where AI will pay back fastest.

Take the Quantum Leap →
© 2026 Quantum Automations Group Ltd
Home Blog Portfolio Privacy Terms Security