A UK manufacturing firm processing 1,400 invoices per month ran a basic OCR pipeline for three months without a flagged error. The pipeline captured supplier name, invoice number, date, and total — four fields that looked like coverage but were not. In month four, the finance director found a £14,000 duplicate payment: two invoices from the same supplier, identical amounts, five days apart, different invoice numbers. The OCR pipeline captured both as valid, because as transcriptions both were correct. No line-item extraction meant no basis for comparison. No purchase order matching meant no validation against what had actually been ordered. The "accurate enough" pipeline cost them four figures on what became a very difficult Monday.
That gap — between transcription and extraction — is where invoice automation projects quietly fail.
The five stages of invoice processing — and where most pipelines stop at stage two
A production invoice pipeline has five stages:
- Ingestion — receive PDF, image, or email attachment; normalise to a consistent format for processing.
- Transcription — OCR or vision model reads text from the document. This is where most "AI invoice" tools stop.
- Structured extraction — parse transcribed text into typed fields: line items with quantity, unit price, and description; VAT breakdown; payment terms; currency; purchase order reference.
- Validation — verify extracted totals against the sum of line items; match the PO reference against the purchase order register; run duplicate detection.
- ERP integration — push validated, structured data to Xero, QuickBooks, or Sage; route exceptions to a human review queue.
Most off-the-shelf OCR tools deliver stages one and two. Some add a thin approximation of stage three with regex field matching. Stages four and five — where the actual financial risk lives — are left to the accounts payable team to handle manually, or skipped entirely.
The £14,000 duplicate passed through stages one and two perfectly. A stage four duplicate check with fuzzy matching on supplier, amount, and date would have caught it.
Multi-field extraction: line items, VAT, payment terms, and PO reference with a single LLM call
The move from basic OCR to structured extraction is a schema design problem first, a model choice second. Here is the extraction schema we use in production:
{
"supplier_name": "string",
"supplier_vat_number": "string | null",
"invoice_number": "string",
"invoice_date": "ISO8601 date",
"due_date": "ISO8601 date | null",
"po_reference": "string | null",
"currency": "GBP | EUR | USD | string",
"line_items": [
{
"description": "string",
"quantity": "number",
"unit_price": "number",
"line_total": "number",
"vat_rate": "number | null"
}
],
"subtotal": "number",
"vat_amount": "number",
"total_amount": "number",
"payment_terms_days": "integer | null",
"bank_details": {
"sort_code": "string | null",
"account_number": "string | null",
"iban": "string | null"
},
"confidence_scores": {
"overall": "float 0-1",
"line_items": "float 0-1",
"totals": "float 0-1"
}
}
Pass this schema to GPT-4o using structured output mode or to Claude using tool use with a JSON schema definition. The model fills every field it can locate and returns null for absent fields — which is itself meaningful data for the downstream validation stage. A null on po_reference means this invoice arrived without a PO number and needs a human check before ERP write.
Bank detail extraction needs a specific note: capturing sort codes and account numbers enables automated payment initiation, but it also opens a fraud vector — supplier bank detail substitution is one of the most common invoice fraud patterns in UK finance. Extract these fields, but write them to an audit-logged table that requires explicit human approval before they propagate to the payment system. This is not an optional safeguard.
Validation logic: matching extracted totals to line-item sums before ERP write
Extraction and validation are separate operations. An LLM that extracts total_amount: 4823.50 has not verified that the total is correct. Write the validation layer in code, not in the prompt:
def validate_invoice_totals(extracted: dict) -> dict:
line_sum = sum(item["line_total"] for item in extracted["line_items"])
expected_total = round(line_sum + extracted["vat_amount"], 2)
actual_total = extracted["total_amount"]
# 2p tolerance covers multi-line VAT rounding on UK standard rate
tolerance = 0.02
mismatch = abs(expected_total - actual_total) > tolerance
return {
"totals_match": not mismatch,
"computed_total": expected_total,
"extracted_total": actual_total,
"delta": round(actual_total - expected_total, 2)
}
Two further checks run before ERP write: PO matching (look up po_reference in the purchase order register and verify that supplier name and amount fall within the approved PO value) and duplicate detection (covered in a dedicated section below).
Our invoice OCR pipeline case study shows what this validation layer caught in its first month of production: 12 invoices with total/line-item mismatches, three duplicate submissions, and one invoice where the supplier VAT number returned no match against HMRC's register — all intercepted before ERP write.
Confidence scoring for invoice extraction: thresholds that separate auto-approve from human review
Not every invoice needs the same handling path. Confidence scores per extraction let you route automatically without reviewing everything manually.
Compute confidence at three levels — overall, line-items, and totals — and weight them. Line-items score lower when the item count is unusual for this supplier or when table formatting is ambiguous. Totals score lower when the validation delta is non-zero.
| Confidence tier | Score range | Action |
|---|---|---|
| Auto-approve | ≥ 0.92 | Write to ERP directly; include in monthly audit sample |
| Soft flag | 0.75 – 0.91 | Queue for human review within 24 hours; highlight low-confidence fields |
| Hard block | < 0.75 | Hold invoice; notify AP team; do not write to ERP |
After six weeks running this on a stable supplier corpus, 78–85% of invoices hit the auto-approve tier. The human review queue stays manageable — typically eight to fifteen per day for a business processing 500 monthly. Hard-block rate runs at 2–4% and is worth triaging attentively: the invoices stuck there tend to come from the most problematic supplier relationships regardless.
Calibrate your thresholds on ground-truth data, not intuition. Run the first 500 invoices through the pipeline, manually verify the correct output for each one, and measure false-positive rate (invoices that auto-approved but contained an error). Set the auto-approve floor at the score where that rate drops below 0.5%.
Supplier-specific extraction templates: when a general model is not enough
A general extraction prompt handles standard UK invoice formats well. Most invoices generated by Sage, Xero, or FreeAgent on the supplier side are consistent enough that GPT-4o extracts them accurately on the first attempt.
Three supplier categories break the general model:
Non-standard layouts. Some suppliers use multi-column tables, rotated headers, or multi-page invoices with line items spanning pages. The general model misreads column alignment and merges rows incorrectly.
Non-UK formats. EU suppliers with reverse-charge VAT, intra-community supply designations, or German-style gross/net presentation confuse a UK-centric prompt. US suppliers using sales tax instead of VAT fail the validation layer unless the schema accounts for tax_type.
High-volume, low-variation suppliers. If you receive 200 invoices monthly from one supplier, a supplier-specific prompt with three to five few-shot examples from their actual format will outperform the general model by 3–7 percentage points on line-item accuracy. That improvement compounds.
Store supplier-specific templates keyed by supplier VAT number. On ingestion, run a lightweight first-pass extraction to identify the supplier, look up their template, then run full extraction with the supplier-specific prompt. Fall back to the general template for unknown suppliers. The document classification with vision models post covers the supplier-identification routing step in detail — using a fast classifier to route documents before the expensive extraction call.
ERP and accounting integration: pushing structured data to Xero, QuickBooks, and Sage
All three major SME accounting platforms support programmatic creation of purchase invoices. The field mapping and quirks differ:
Xero uses AccountCode (nominal ledger code) per line item rather than a free-text description. Your extraction output needs a ledger-code classification step — either rule-based keyword matching on item description, or a small classification call against your chart of accounts. Without this, every line item hits an uncoded suspense account and requires manual reclassification.
QuickBooks Online accepts free-text descriptions but prefers ItemRef for structured line items. Multi-currency is handled natively via the QuickBooks Online Bills API, which simplifies foreign-currency supplier invoices considerably.
Sage 200 and Sage 50 are typically accessed via ODBC drivers or proprietary API wrappers rather than a public REST API. Integration is slower to build and more brittle; verify your Sage version and API availability before committing to an automated write path.
In all three cases, the ERP write should be idempotent. Use supplier VAT number plus invoice number as a composite key. If the invoice already exists in the ERP — which the duplicate check should have caught before this point — return the existing record rather than creating a second entry. The idempotent write is a safety net, not a replacement for the duplicate check.
Duplicate detection and anomaly flagging before payment runs
Duplicate detection is not a hash check on invoice number. The £14,000 case had two distinct invoice numbers — the supplier submitted separately numbered invoices for the same work. Catching this requires fuzzy matching on supplier plus amount plus approximate date window:
def is_likely_duplicate(new_invoice: dict, processed: list) -> bool:
for existing in processed:
same_supplier = (
new_invoice["supplier_vat_number"] == existing["supplier_vat_number"]
)
same_amount = abs(
new_invoice["total_amount"] - existing["total_amount"]
) < 1.00
close_date = abs(
(new_invoice["invoice_date"] - existing["invoice_date"]).days
) <= 14
if same_supplier and same_amount and close_date:
return True
return False
Beyond duplicates, flag these anomalies before each payment run:
- Total exceeds PO value by more than 10% — common when change orders bypassed the purchase order process.
- Supplier bank details changed — compare against previously recorded details and hold for manual sign-off. This is the primary target of invoice fraud.
- Invoice date older than 90 days — flags potential backdated submissions and may indicate a cash flow manipulation attempt.
- Supplier VAT number fails HMRC verification — HMRC provides a public VAT number checker that can be called programmatically for each new supplier.
Run anomaly detection as a batch job before each payment run, not at extraction time. Your validated data is stable by then; running it against the full payment batch catches patterns that per-invoice checks miss.
What changed in 2025–2026: native PDF APIs and structured output extraction in GPT-4o and Claude
Two developments shifted the practical approach to invoice extraction over the past eighteen months.
Anthropic released a native PDF API for Claude that processes PDF documents directly without a separate OCR pre-processing step. The model reads the document as a structured artefact — preserving font sizes, table borders, and spatial relationships — rather than as a flattened image. On multi-column invoices, this reduces table misalignment errors noticeably compared to the image-based approach that preceded it.
OpenAI's structured output mode, released in August 2024, guarantees JSON schema conformance on every extraction call. Before this, extraction pipelines needed retry loops and fallback parsers to handle malformed or truncated JSON responses — a class of failure that added latency and complexity to every deployment. Structured outputs removed that failure mode entirely.
One counterpoint worth acknowledging: AWS's evaluation of LLM-based document extraction notes that general-purpose LLMs still underperform specialised document AI models — such as Amazon Textract with form analysis — on densely structured tabular data with 30 or more line items. At that scale of invoice complexity, a hybrid approach using a specialised table-extraction model to handle the line-item grid, feeding into an LLM for field interpretation and schema population, remains both faster and more accurate than an LLM-only pipeline.
Good / Bad / Ugly: three invoice pipeline patterns and their accuracy at production volume
| Pattern | Description | Field accuracy at 500+/month | Risk profile |
|---|---|---|---|
| Good | LLM extraction with typed JSON schema + server-side validation + confidence routing + fuzzy duplicate detection | 95–97%; false auto-approve rate < 0.5% | Higher initial build cost; requires template maintenance when supplier formats change |
| Bad | Basic OCR + regex field extraction + direct ERP write | 80–88% on standard formats; drops to 55–65% on non-standard layouts | Silent failures pass through to the payment run with no flagging |
| Ugly | LLM extraction with no validation layer; auto-approve everything above a basic confidence check | 90–93% extraction accuracy, but 3–5% of invoices carry total mismatches that pass through; duplicates not caught | Looks correct for several months, then the £14k Monday arrives |
The "Ugly" pattern describes most teams' first move when they replace a regex pipeline with an LLM call: extraction quality improves visibly, but the absence of a validation layer means errors that failed loudly now fail quietly. The "Bad" pattern is at least honest about its limitations — the finance team knows it is checking everything. The "Good" pattern costs two to three weeks of additional build to wire up validation, confidence routing, and duplicate detection — overhead that recovers on the first prevented duplicate payment.
For teams starting from zero, the OCR and human-in-the-loop patterns post covers how to structure the human review queue so reviewers spend time on genuine exceptions rather than routine re-keying — which is where the time saving comes from.