Quantum Automations Quantum Automations
Blog · Portfolio
← Back to Blog
Guide · Document Automation

Document Classification with Vision Models: Production Patterns

Published May 2026
Topic Documents · Classification
Reading time 9 min
For UK SME ops leads
On this page
  1. Why document routing is the underrated bottleneck in automation pipelines
  2. Rule-based vs fine-tuned vs zero-shot: picking the right classifier
  3. Vision models for document classification: what works in 2026
  4. Confidence thresholds and the fallback queue
  5. Multi-label classification: when one document is two document types
  6. Building the ground truth dataset from real documents
  7. What changed in 2025–2026: multimodal embeddings and native PDF APIs
  8. Good / Bad / Ugly
  9. FAQ
  10. FAQ

A mortgage broker's inbox at 9am on a Monday holds 140 documents from the weekend. (If you've seen our voice AI and intelligent document analysis case study, the document validation component in that project started with exactly this problem.) Payslips, bank statements, P60s, utility bills, passport scans, lease agreements, credit reports, company accounts, HMRC correspondence, letters of employment. Eleven document types — and before any extraction can happen, someone has to decide which extraction pipeline to route each one to.

Before we built their classification layer, that someone was a human. Two staff members, four hours, every Monday morning. Eight hours of weekly labour at roughly £18/hour: £144/week, £7,500/year, just to sort documents into folders.

After: the queue runs unattended. Documents land in the intake bucket. The classifier fires within 30 seconds. Each document routes to its extraction pipeline. The two staff members start Monday reviewing exceptions — around 12 per week — rather than sorting 140 files manually.

That's what document classification actually delivers in a production pipeline: it's not a clever ML feature, it's the routing layer that makes every downstream step possible.

Why document routing is the underrated bottleneck in automation pipelines

Most document automation projects are designed from the extraction inward: "we want to extract invoice totals," or "we want to pull the key dates from lease agreements." The design work goes into the extraction model — OCR configuration, prompt engineering, field validation.

The routing layer is treated as a given: someone will put the right documents into the right folder, or the client will maintain a naming convention, or documents will arrive via a typed form that tells you what they are.

In production, none of this holds. Documents arrive via email with subject lines like "FWD: FWD: docs for application". Clients upload everything to a single shared folder. Naming conventions decay within three months. The extraction pipeline receives a P60 where it expected a payslip and returns garbage — silently, with no error, because the fields it was looking for exist in the P60 under different names.

A classification layer sits at the top of the pipeline and enforces routing. It doesn't matter how good your extraction is downstream; if the wrong document reaches the wrong extractor, you get bad data.

Rule-based vs fine-tuned vs zero-shot: picking the right classifier

Approach Accuracy (clean docs) Accuracy (poor scans) Setup time Maintenance burden
Rule-based (keyword/regex) 70–85% 40–60% 2–4 hours High (update on each new doc variant)
Fine-tuned classifier (LayoutLM, DocFormer) 92–97% 80–90% 2–4 weeks + labelling Medium (retrain on new classes)
Zero-shot vision model (GPT-4o, Claude) 85–95% 75–88% 2–4 hours Low (update label definitions)

Rule-based classification — looking for keywords like "payslip", "PAYE", "P60" in the extracted text — is fast to build and works for clean, machine-generated PDFs. It collapses on poor-quality scans (OCR errors corrupt the keywords), unusual layouts (some payslips don't use the word "payslip"), and foreign-language variants. Don't use it as your primary classifier; use it as a cheap first-pass that escalates to vision model on low-confidence results.

Fine-tuned classifiers (Microsoft's LayoutLM or DocFormer) are the right choice if you have thousands of labelled examples per class and want sub-100ms inference latency at high volume. The labelling cost (50–200 examples per class) and retraining time (half a day per model update) make them overkill for most UK SME document sets of under 20 classes.

Zero-shot vision models are the practical starting point for almost every SME deployment. You write label definitions in natural language, the model classifies by looking at the document image, and you get 85–95% accuracy without any training data. The accuracy gap vs fine-tuned models is real but often acceptable — and the maintenance cost is near-zero.

Vision models for document classification: what works in 2026

Field-tested on a 12-class UK mortgage document taxonomy (payslip, P60, bank statement, utility bill, passport, driving licence, lease agreement, employment letter, company accounts, credit report, HMRC letter, other):

Model Accuracy Latency Cost per doc Notes
GPT-4o Vision 93% 2.1s avg ~£0.008 Strong on typed docs; weaker on poor handwriting
Claude Sonnet 4 94% 1.8s avg ~£0.006 Marginally better on mixed-layout docs; strong JSON output
Gemini 1.5 Pro Vision 91% 2.4s avg ~£0.004 Cheapest; slightly lower accuracy on low-res scans
Rule-based baseline 72% <50ms ~£0.0001 Only for clean machine-printed PDFs

We use Claude Sonnet 4 as our primary classifier after head-to-head testing on 800 labelled documents from our mortgage broker client. The accuracy advantage over GPT-4o is small (1 percentage point) but the JSON schema output is more consistent — fewer hallucinated field names in the structured response.

The system prompt that achieved 94% on this taxonomy:

SYSTEM_PROMPT = """
You are a document classifier for a UK mortgage broker.
Classify the document image into exactly one of these categories:

- payslip: Employee pay statement showing gross/net pay, tax, NI deductions
- p60: Annual tax summary issued by employer, showing total pay and tax for the year
- bank_statement: Bank account transaction history from a UK financial institution
- utility_bill: Gas, electricity, water, or broadband bill with address and account holder
- passport: UK or foreign passport identity document
- driving_licence: UK driving licence (paper or photocard)
- lease_agreement: Residential or commercial tenancy or lease contract
- employment_letter: Letter confirming employment, salary, or contract terms
- company_accounts: Filed company accounts (Companies House format or abbreviated)
- credit_report: Credit reference report from Experian, Equifax, or TransUnion
- hmrc_letter: Correspondence from HMRC including assessments, notices, tax codes
- other: Any document not matching the above categories

Return JSON: {"classification": "<label>", "confidence": <0.0–1.0>, "reason": "<one sentence>"}
"""

Confidence thresholds and the fallback queue

A classifier that always returns something is more dangerous than a classifier that sometimes says "I'm not sure." Build the fallback queue as a first-class part of the system, not an afterthought.

The routing logic:

def route_document(doc_id: str, classification_result: dict) -> str:
    label = classification_result['classification']
    confidence = classification_result['confidence']

    if confidence >= 0.85:
        # Auto-route to extraction pipeline
        queue_push(EXTRACTION_QUEUES[label], {'doc_id': doc_id, 'label': label})
        return 'auto_routed'

    elif confidence >= 0.60:
        # Human review — send to review dashboard
        queue_push(HUMAN_REVIEW_QUEUE, {
            'doc_id': doc_id,
            'suggested_label': label,
            'confidence': confidence,
            'reason': classification_result['reason']
        })
        return 'review_queued'

    else:
        # Unclassified — escalate
        queue_push(ESCALATION_QUEUE, {'doc_id': doc_id, 'raw_result': classification_result})
        alert_slack(f"Unclassified document: {doc_id} — confidence {confidence:.2f}")
        return 'escalated'

Instrument every path. After two weeks, the distribution of auto_routed vs review_queued vs escalated tells you whether your confidence thresholds are correctly set. If 35% of volume hits review, your thresholds are too tight or your taxonomy has ambiguous classes. If 2% of volume hits review and extraction errors are high, your thresholds are too loose.

Multi-label classification: when one document is two document types

Some documents are both things at once. A combined payslip and P60 (common from some payroll systems). A bank statement that also contains a credit report summary. A letter that serves as both an employment confirmation and a salary statement.

For these, add a secondary_classification field to your schema and handle multi-label outputs in routing:

{
  "classification": "payslip",
  "secondary_classification": "p60",
  "confidence": 0.88,
  "reason": "Document shows monthly pay breakdown AND annual totals — combined payslip/P60 format"
}

Route the primary label to its extraction pipeline first, then queue the secondary label for extraction with an appropriate priority (usually lower — the primary document type is what the application step needs). Flag multi-label documents in your review dashboard so a human can confirm both extractions are correct before the record is marked complete.

Building the ground truth dataset from real documents

You need a labelled dataset for two purposes: validating your zero-shot classifier's accuracy before you trust it in production, and providing training data if you later move to a fine-tuned model.

Collecting it: 1. Run your first 200–300 documents through the zero-shot classifier and the rule-based baseline. 2. Have a human label each document (the correct class, not the model's prediction). 3. Compare model predictions to human labels — this is your accuracy baseline. 4. Flag all misclassifications and look for patterns. Are all passport errors on photo-only scans? Are P60 errors on non-standard employer formats? These patterns tell you whether to adjust the label definitions or whether you need per-class examples in the prompt (few-shot).

Labelling 300 documents takes 2–3 hours at 30 seconds per document. This is the irreducible work — there is no shortcut to knowing whether your classifier is accurate. Do it before you trust the pipeline with live data. Once classification is working, the downstream extraction layer is where most projects spend the next effort — our OCR with human-in-the-loop guide covers the confidence-threshold pattern for extraction that mirrors what we've described here for classification.

What changed in 2025–2026: multimodal embeddings and native PDF APIs

Two shifts changed document classification in 2025.

The Anthropic Files API and OpenAI's native PDF support mean you can now pass PDFs directly to the model without a rendering step — no convert-to-image, no OCR pre-processing. The model reads the PDF natively, including embedded text and visual layout. For machine-generated PDFs (most invoices, bank statements from major banks), this improves accuracy 3–5 percentage points over image-based classification because the model has access to the underlying text, not just what survives rasterisation.

For scanned documents (photos of physical papers), image-based classification via vision API is still necessary. Build a pre-processor that checks whether a PDF has embedded text (via pdfplumber or PyMuPDF) and routes accordingly: embedded text → native PDF API; scan/image-only → vision model on rasterised page.

The counterpoint worth acknowledging: Google's Document AI offers a pre-built document classifier trained on millions of documents, with UK-specific document types in its catalogue. For high-volume deployments processing over 10,000 documents per month, Document AI's per-page pricing (£0.0015/page) undercuts a vision model API meaningfully. If you're building a knowledge layer on top of classified documents, our document RAG guide covers the retrieval architecture that makes extracted data queryable. And for the invoice-processing variant of this problem, the invoice OCR case study shows the full pipeline end-to-end. For the typical UK SME processing under 5,000 documents per month, a vision model with a custom prompt is cheaper, more flexible, and doesn't require a Google Cloud contract.

Good / Bad / Ugly

Good: Zero-shot vision model classification with a well-defined label taxonomy, a confidence-gated fallback queue, and two weeks of accuracy monitoring before going live. The mortgage broker team went from 8 hours/week of manual sorting to 45 minutes of exception review. The classifier runs on the 12 confident cases; the humans review the ambiguous 8%.

Bad: Assuming the intake naming convention will hold. "Clients will label their uploads" is a statement that has never been true in production. Build the classifier on day one. Naming conventions are not a routing strategy.

Ugly: Deploying a classifier without a fallback queue and discovering 8 months in that 3% of your documents have been silently misrouted — bank statements sent to the lease agreement extractor, extracting nothing, marking the record complete. The application pipeline has been failing silently for months. No alert, no exception, no log that anyone reviewed. Always instrument the low-confidence path.


FAQ

Answered in the frontmatter — rendered by the template as FAQPage JSON-LD.


If your document pipeline is routing by faith rather than by classification, the fix is one model call away. Book a 30-minute audit and we'll show you what a confidence-gated classifier looks like for your document types.

FAQ

How many training examples do we need per document class?

For a fine-tuned classifier: 50–100 labelled examples per class is the workable minimum, 200+ per class is preferable for high-confidence results. For a zero-shot vision model (GPT-4o Vision, Claude Sonnet with a schema prompt): you need zero training examples — the model classifies from your label definitions alone, with accuracy typically 85–95% on clean document types. Start with zero-shot to validate your label taxonomy before committing to fine-tuning.

What confidence threshold should trigger human review?

A workable default: above 0.85 confidence routes automatically; 0.60–0.85 goes to human review queue; below 0.60 flags as unclassified and escalates. Calibrate per document type — a poorly-scanned bank statement may consistently score 0.75 (not low confidence, just noisy input), while a crisp lease agreement should score 0.95+. After two weeks of production, look at your review queue distribution: if more than 20% of volume is hitting review, your thresholds or label taxonomy needs adjustment.

Can we use the same vision model for classification and extraction?

Yes, and this is often the right architecture for smaller document sets. A single Claude Sonnet or GPT-4o call can classify the document type AND extract the target fields in one pass, using a JSON schema output that includes both the classification label and the extracted data. The trade-off: a combined prompt is less specialised than a dedicated classifier, so classification accuracy may be 3–5 percentage points lower. For a 5-class taxonomy with clear visual distinctions, combined classification+extraction is fine. For a 20-class taxonomy with visually similar types, separate the steps.

How do we handle documents in languages other than English?

Modern vision models (GPT-4o, Claude Sonnet 4) handle multilingual documents well without configuration — they detect language and extract fields correctly for French, German, Spanish, and most EU languages. For UK SMEs, the most common foreign-language documents are IBAN-format invoices from EU suppliers (French, German, Dutch). If your classification label definitions are in English, add a note in the system prompt: 'Documents may be in any language — classify by document type regardless of language.' Extraction accuracy for non-English documents is typically 5–8% lower than for English.

Related Reading

OCR with human-in-the-loop: shipping 99% accuracy in production

Why 99% extraction accuracy still fails in production, and the queue-and-confidence pattern that makes hybrid OCR genuin

Document RAG: when vector search beats keyword search

Vector search isn't always the right call. A field guide for UK SMEs deciding when pgvector earns its complexity and whe

Need a document pipeline that routes correctly before it extracts?

30-minute audit. We map your stack, your constraints, and where AI will pay back fastest.

Take the Quantum Leap →
© 2026 Quantum Automations Group Ltd
Home Blog Portfolio Privacy Terms Security