A mortgage broker's inbox at 9am on a Monday holds 140 documents from the weekend. (If you've seen our voice AI and intelligent document analysis case study, the document validation component in that project started with exactly this problem.) Payslips, bank statements, P60s, utility bills, passport scans, lease agreements, credit reports, company accounts, HMRC correspondence, letters of employment. Eleven document types — and before any extraction can happen, someone has to decide which extraction pipeline to route each one to.
Before we built their classification layer, that someone was a human. Two staff members, four hours, every Monday morning. Eight hours of weekly labour at roughly £18/hour: £144/week, £7,500/year, just to sort documents into folders.
After: the queue runs unattended. Documents land in the intake bucket. The classifier fires within 30 seconds. Each document routes to its extraction pipeline. The two staff members start Monday reviewing exceptions — around 12 per week — rather than sorting 140 files manually.
That's what document classification actually delivers in a production pipeline: it's not a clever ML feature, it's the routing layer that makes every downstream step possible.
Why document routing is the underrated bottleneck in automation pipelines
Most document automation projects are designed from the extraction inward: "we want to extract invoice totals," or "we want to pull the key dates from lease agreements." The design work goes into the extraction model — OCR configuration, prompt engineering, field validation.
The routing layer is treated as a given: someone will put the right documents into the right folder, or the client will maintain a naming convention, or documents will arrive via a typed form that tells you what they are.
In production, none of this holds. Documents arrive via email with subject lines like "FWD: FWD: docs for application". Clients upload everything to a single shared folder. Naming conventions decay within three months. The extraction pipeline receives a P60 where it expected a payslip and returns garbage — silently, with no error, because the fields it was looking for exist in the P60 under different names.
A classification layer sits at the top of the pipeline and enforces routing. It doesn't matter how good your extraction is downstream; if the wrong document reaches the wrong extractor, you get bad data.
Rule-based vs fine-tuned vs zero-shot: picking the right classifier
| Approach | Accuracy (clean docs) | Accuracy (poor scans) | Setup time | Maintenance burden |
|---|---|---|---|---|
| Rule-based (keyword/regex) | 70–85% | 40–60% | 2–4 hours | High (update on each new doc variant) |
| Fine-tuned classifier (LayoutLM, DocFormer) | 92–97% | 80–90% | 2–4 weeks + labelling | Medium (retrain on new classes) |
| Zero-shot vision model (GPT-4o, Claude) | 85–95% | 75–88% | 2–4 hours | Low (update label definitions) |
Rule-based classification — looking for keywords like "payslip", "PAYE", "P60" in the extracted text — is fast to build and works for clean, machine-generated PDFs. It collapses on poor-quality scans (OCR errors corrupt the keywords), unusual layouts (some payslips don't use the word "payslip"), and foreign-language variants. Don't use it as your primary classifier; use it as a cheap first-pass that escalates to vision model on low-confidence results.
Fine-tuned classifiers (Microsoft's LayoutLM or DocFormer) are the right choice if you have thousands of labelled examples per class and want sub-100ms inference latency at high volume. The labelling cost (50–200 examples per class) and retraining time (half a day per model update) make them overkill for most UK SME document sets of under 20 classes.
Zero-shot vision models are the practical starting point for almost every SME deployment. You write label definitions in natural language, the model classifies by looking at the document image, and you get 85–95% accuracy without any training data. The accuracy gap vs fine-tuned models is real but often acceptable — and the maintenance cost is near-zero.
Vision models for document classification: what works in 2026
Field-tested on a 12-class UK mortgage document taxonomy (payslip, P60, bank statement, utility bill, passport, driving licence, lease agreement, employment letter, company accounts, credit report, HMRC letter, other):
| Model | Accuracy | Latency | Cost per doc | Notes |
|---|---|---|---|---|
| GPT-4o Vision | 93% | 2.1s avg | ~£0.008 | Strong on typed docs; weaker on poor handwriting |
| Claude Sonnet 4 | 94% | 1.8s avg | ~£0.006 | Marginally better on mixed-layout docs; strong JSON output |
| Gemini 1.5 Pro Vision | 91% | 2.4s avg | ~£0.004 | Cheapest; slightly lower accuracy on low-res scans |
| Rule-based baseline | 72% | <50ms | ~£0.0001 | Only for clean machine-printed PDFs |
We use Claude Sonnet 4 as our primary classifier after head-to-head testing on 800 labelled documents from our mortgage broker client. The accuracy advantage over GPT-4o is small (1 percentage point) but the JSON schema output is more consistent — fewer hallucinated field names in the structured response.
The system prompt that achieved 94% on this taxonomy:
SYSTEM_PROMPT = """
You are a document classifier for a UK mortgage broker.
Classify the document image into exactly one of these categories:
- payslip: Employee pay statement showing gross/net pay, tax, NI deductions
- p60: Annual tax summary issued by employer, showing total pay and tax for the year
- bank_statement: Bank account transaction history from a UK financial institution
- utility_bill: Gas, electricity, water, or broadband bill with address and account holder
- passport: UK or foreign passport identity document
- driving_licence: UK driving licence (paper or photocard)
- lease_agreement: Residential or commercial tenancy or lease contract
- employment_letter: Letter confirming employment, salary, or contract terms
- company_accounts: Filed company accounts (Companies House format or abbreviated)
- credit_report: Credit reference report from Experian, Equifax, or TransUnion
- hmrc_letter: Correspondence from HMRC including assessments, notices, tax codes
- other: Any document not matching the above categories
Return JSON: {"classification": "<label>", "confidence": <0.0–1.0>, "reason": "<one sentence>"}
"""
Confidence thresholds and the fallback queue
A classifier that always returns something is more dangerous than a classifier that sometimes says "I'm not sure." Build the fallback queue as a first-class part of the system, not an afterthought.
The routing logic:
def route_document(doc_id: str, classification_result: dict) -> str:
label = classification_result['classification']
confidence = classification_result['confidence']
if confidence >= 0.85:
# Auto-route to extraction pipeline
queue_push(EXTRACTION_QUEUES[label], {'doc_id': doc_id, 'label': label})
return 'auto_routed'
elif confidence >= 0.60:
# Human review — send to review dashboard
queue_push(HUMAN_REVIEW_QUEUE, {
'doc_id': doc_id,
'suggested_label': label,
'confidence': confidence,
'reason': classification_result['reason']
})
return 'review_queued'
else:
# Unclassified — escalate
queue_push(ESCALATION_QUEUE, {'doc_id': doc_id, 'raw_result': classification_result})
alert_slack(f"Unclassified document: {doc_id} — confidence {confidence:.2f}")
return 'escalated'
Instrument every path. After two weeks, the distribution of auto_routed vs review_queued vs escalated tells you whether your confidence thresholds are correctly set. If 35% of volume hits review, your thresholds are too tight or your taxonomy has ambiguous classes. If 2% of volume hits review and extraction errors are high, your thresholds are too loose.
Multi-label classification: when one document is two document types
Some documents are both things at once. A combined payslip and P60 (common from some payroll systems). A bank statement that also contains a credit report summary. A letter that serves as both an employment confirmation and a salary statement.
For these, add a secondary_classification field to your schema and handle multi-label outputs in routing:
{
"classification": "payslip",
"secondary_classification": "p60",
"confidence": 0.88,
"reason": "Document shows monthly pay breakdown AND annual totals — combined payslip/P60 format"
}
Route the primary label to its extraction pipeline first, then queue the secondary label for extraction with an appropriate priority (usually lower — the primary document type is what the application step needs). Flag multi-label documents in your review dashboard so a human can confirm both extractions are correct before the record is marked complete.
Building the ground truth dataset from real documents
You need a labelled dataset for two purposes: validating your zero-shot classifier's accuracy before you trust it in production, and providing training data if you later move to a fine-tuned model.
Collecting it: 1. Run your first 200–300 documents through the zero-shot classifier and the rule-based baseline. 2. Have a human label each document (the correct class, not the model's prediction). 3. Compare model predictions to human labels — this is your accuracy baseline. 4. Flag all misclassifications and look for patterns. Are all passport errors on photo-only scans? Are P60 errors on non-standard employer formats? These patterns tell you whether to adjust the label definitions or whether you need per-class examples in the prompt (few-shot).
Labelling 300 documents takes 2–3 hours at 30 seconds per document. This is the irreducible work — there is no shortcut to knowing whether your classifier is accurate. Do it before you trust the pipeline with live data. Once classification is working, the downstream extraction layer is where most projects spend the next effort — our OCR with human-in-the-loop guide covers the confidence-threshold pattern for extraction that mirrors what we've described here for classification.
What changed in 2025–2026: multimodal embeddings and native PDF APIs
Two shifts changed document classification in 2025.
The Anthropic Files API and OpenAI's native PDF support mean you can now pass PDFs directly to the model without a rendering step — no convert-to-image, no OCR pre-processing. The model reads the PDF natively, including embedded text and visual layout. For machine-generated PDFs (most invoices, bank statements from major banks), this improves accuracy 3–5 percentage points over image-based classification because the model has access to the underlying text, not just what survives rasterisation.
For scanned documents (photos of physical papers), image-based classification via vision API is still necessary. Build a pre-processor that checks whether a PDF has embedded text (via pdfplumber or PyMuPDF) and routes accordingly: embedded text → native PDF API; scan/image-only → vision model on rasterised page.
The counterpoint worth acknowledging: Google's Document AI offers a pre-built document classifier trained on millions of documents, with UK-specific document types in its catalogue. For high-volume deployments processing over 10,000 documents per month, Document AI's per-page pricing (£0.0015/page) undercuts a vision model API meaningfully. If you're building a knowledge layer on top of classified documents, our document RAG guide covers the retrieval architecture that makes extracted data queryable. And for the invoice-processing variant of this problem, the invoice OCR case study shows the full pipeline end-to-end. For the typical UK SME processing under 5,000 documents per month, a vision model with a custom prompt is cheaper, more flexible, and doesn't require a Google Cloud contract.
Good / Bad / Ugly
Good: Zero-shot vision model classification with a well-defined label taxonomy, a confidence-gated fallback queue, and two weeks of accuracy monitoring before going live. The mortgage broker team went from 8 hours/week of manual sorting to 45 minutes of exception review. The classifier runs on the 12 confident cases; the humans review the ambiguous 8%.
Bad: Assuming the intake naming convention will hold. "Clients will label their uploads" is a statement that has never been true in production. Build the classifier on day one. Naming conventions are not a routing strategy.
Ugly: Deploying a classifier without a fallback queue and discovering 8 months in that 3% of your documents have been silently misrouted — bank statements sent to the lease agreement extractor, extracting nothing, marking the record complete. The application pipeline has been failing silently for months. No alert, no exception, no log that anyone reviewed. Always instrument the low-confidence path.
FAQ
Answered in the frontmatter — rendered by the template as FAQPage JSON-LD.
If your document pipeline is routing by faith rather than by classification, the fix is one model call away. Book a 30-minute audit and we'll show you what a confidence-gated classifier looks like for your document types.