Is it legally safe to use an LLM to review NDAs for a UK SME, and what are the liability implications?

An LLM review is a triage tool, not legal advice. The liability stays with you (or your instructed solicitor) regardless of what the model flags. The safe framing is to treat LLM output as a structured first-pass that surfaces clauses requiring human legal review — not as a substitute for it. Document this clearly in your internal policy: 'LLM review is a screening step; final authority rests with a qualified solicitor for any clause marked high-risk.' The Solicitors Regulation Authority has not prohibited AI-assisted review, but if you rely on AI output without solicitor sign-off and something goes wrong, you have no professional indemnity cover to fall back on. Build the human gate and keep it visible.

Which clauses in a UK NDA are too high-risk to automate and should always go to a solicitor?

Three categories consistently need a solicitor: (1) unlimited or uncapped liability clauses, where the financial exposure is open-ended and requires negotiation skill, not just detection; (2) broad IP assignment clauses that could inadvertently sign away pre-existing IP or background IP your business depends on; and (3) non-solicitation clauses written broadly enough to block you from hiring from an entire industry sector. An LLM can reliably detect that these clauses exist and flag their scope, but whether to accept, negotiate, or refuse them depends on commercial context the model cannot assess — your negotiating position, your relationship with the counterparty, and what you'd actually lose.

How do I keep NDA documents confidential when sending them to OpenAI or Anthropic APIs?

OpenAI's default API terms (as of 2025) do not train on API-submitted data, but you should verify your organisation's data processing agreement and consider opting into the Zero Data Retention (ZDR) programme for sensitive documents. Anthropic offers similar data handling commitments via enterprise agreements. For genuinely sensitive NDAs — those covering acquisition terms, IP licensing, or employee data — run a local model (Mistral 7B or Llama 3 via Ollama) on your own infrastructure instead. The clause extraction quality drops marginally but the confidentiality risk drops to zero. The hybrid approach: run a local model to strip party names and commercially sensitive specifics before sending to a cloud API for clause classification.

What does a GDPR-compliant audit trail look like for AI-assisted contract review decisions?

Under UK GDPR Article 22, if an AI system makes decisions with legal or similarly significant effects, you must provide meaningful information about the logic involved and allow human review. For contract review, this means logging: the document hash (not the document itself), the model version used, the timestamp, which clauses were flagged and at what confidence level, and who (which human reviewer) signed off on the final decision. Store this in append-only form — a Postgres table with no delete permissions granted to the application user works well. Retain records for at least the duration of the contract plus any applicable limitation period (typically six years under the Limitation Act 1980). The ICO's guidance on automated decision-making is the relevant reference.

LLM Contract Review for UK SMEs: NDA Clause Extraction

A SaaS founder in Manchester comes to us signing 40–60 NDAs a year. Every one sat in a queue for a solicitor at £180/hour with a three-day turnaround. They'd get back a marked-up document, pay £135–£180 per review, and find that 80% of the comments were the same boilerplate concerns flagged on every contract: missing governing law clause, automatic renewal without notice, confidentiality term that never expires. We built an LLM contract review pipeline. It now costs £0.02 per NDA, runs in eight seconds, flags 80% of what the solicitor would raise, and hands the remaining 20% to the solicitor as a structured queue that takes 12 minutes instead of 45.

The three clauses you should never automate are the interesting part — and they're not the ones most founders expect.

What LLM contract review can and cannot do reliably in 2026

Large language models are surprisingly good at pattern recognition over legal text. NDAs are, by nature, repetitive — most contain the same 15–20 clauses in different arrangements and phrasings. A model trained on large volumes of English-language text has seen more NDA variations than any single solicitor.

What works: identifying whether a clause exists, extracting its stated duration, spotting a missing governing law clause, flagging a confidentiality term with no sunset. These are structural and syntactic tasks. GPT-4o handles them at roughly 92–95% accuracy on standard UK NDAs in our testing.

What doesn't work: assessing commercial reasonableness. Whether a two-year non-compete is acceptable depends on your industry, your negotiating position, and what you're trying to protect. Whether unlimited liability is tolerable depends on deal size. The model can tell you the clause is present and what it says — it cannot tell you whether to accept it.

The design principle: use the model for detection and extraction, use a human for judgement. The mistake most teams make is expecting the model to do both.

The 12 NDA clauses worth automating — and the 3 UK red flags that still need a solicitor

These are the clauses our extraction prompt targets. The split reflects three months of validation against 200 real NDAs:

Clause	Automate?	Why
Governing law & jurisdiction	Yes	Presence/absence is binary
Confidentiality duration	Yes	Extract term length, flag perpetual
Definition of Confidential Information	Yes	Flag overly broad or vague definitions
Permitted disclosures	Yes	Check standard carve-outs are present
Return/destruction of information	Yes	Flag absence or vague timelines
Automatic renewal clause	Yes	Flag renewal without active notice
Non-disclosure scope (one-way vs mutual)	Yes	Structural classification
Data protection/GDPR alignment clause	Yes	Flag absence
Dispute resolution mechanism	Yes	Arbitration vs litigation flag
Notice requirements	Yes	Extract notice period and method
Force majeure	Yes	Flag absence in commercial NDAs
Entire agreement clause	Yes	Presence/absence flag
Liability cap or unlimited liability	No	Needs commercial judgement
IP assignment scope	No	Needs background IP analysis
Non-solicitation breadth	No	Needs sector context

The three that stay with the solicitor aren't excluded because the model can't find them — it can. They're excluded because the decision about what to do with them requires context the model doesn't have.

UK NDAs from larger counterparties frequently include these three clauses in forms that are commercially dangerous for smaller businesses. Here's what to watch for:

Unlimited liability. Many template NDAs include a clause that makes the receiving party liable for all loss arising from a breach, without cap. For a large corporate counterparty, this is boilerplate. For an SME, it's an existential risk. Your solicitor needs to negotiate a cap tied to contract value or insurance coverage.

IP assignment. Some NDAs — particularly those from companies evaluating you as a vendor — contain clauses that assign to them any IP you create in the course of the relationship. This can inadvertently capture product development work you were already doing. The clause needs precise carve-outs for background IP and independently developed inventions.

Non-solicitation breadth. A clause preventing you from hiring anyone who "works for or has worked for" the counterparty in the past three years can effectively block you from hiring in a sector. The breadth of the defined pool needs human negotiation. See the UK compliance guidance on restrictive covenants we've covered separately.

Prompt design for consistent clause extraction across contract formats and law firms

The extraction prompt needs to handle two challenges: NDAs arrive in inconsistent formats (some are three pages, some are twelve; some are scanned PDFs, some are Word-converted text), and different law firms use different clause headings for the same concept.

Here's the prompt structure we use in production:

{
  "model": "gpt-4o",
  "response_format": { "type": "json_object" },
  "messages": [
    {
      "role": "system",
      "content": "You are a UK contract review assistant. Extract the specified clauses from the NDA text provided. Return a JSON object with one key per clause. If a clause is absent, set its value to null. If present, extract the verbatim text and a one-sentence summary. Do not interpret whether the clause is acceptable — only extract and summarise."
    },
    {
      "role": "user",
      "content": "CONTRACT TEXT:\n\n{{contract_text}}\n\nEXTRACT THESE CLAUSES:\n- governing_law\n- confidentiality_duration\n- confidential_information_definition\n- permitted_disclosures\n- return_destruction\n- automatic_renewal\n- disclosure_scope\n- data_protection_alignment\n- dispute_resolution\n- notice_requirements\n- force_majeure\n- entire_agreement\n- unlimited_liability\n- ip_assignment\n- non_solicitation\n\nFor each clause return: {\"present\": true/false, \"verbatim\": \"...\", \"summary\": \"...\", \"flags\": []}"
    }
  ],
  "temperature": 0.1
}

Temperature 0.1 is deliberate — you want deterministic extraction, not creative interpretation. The flags array is where the model adds shorthand warnings: "PERPETUAL_TERM", "UNLIMITED_LIABILITY", "NO_GOVERNING_LAW". These feed directly into the confidence scoring step.

For scanned documents, we run a pre-processing step through our invoice OCR pipeline before handing off to the LLM — structured text extraction improves clause detection on poor-quality scans.

Confidence scoring and human review queue: the design that makes 80% automation safe

The raw extraction isn't what goes to the solicitor. It goes through a scoring layer that assigns each clause a confidence level and determines whether it needs human review.

CLAUSE_RISK_WEIGHTS = {
    "governing_law": {"absent_risk": "HIGH", "present_confidence": 0.95},
    "confidentiality_duration": {"perpetual_risk": "MEDIUM", "extract_confidence": 0.92},
    "unlimited_liability": {"present_risk": "HIGH", "always_escalate": True},
    "ip_assignment": {"present_risk": "HIGH", "always_escalate": True},
    "non_solicitation": {"present_risk": "HIGH", "always_escalate": True},
    "automatic_renewal": {"present_risk": "MEDIUM", "extract_confidence": 0.88},
}

def score_extraction(clauses: dict) -> dict:
    review_queue = []
    auto_approved = []
    for clause_name, result in clauses.items():
        weight = CLAUSE_RISK_WEIGHTS.get(clause_name, {})
        if weight.get("always_escalate") and result["present"]:
            review_queue.append({"clause": clause_name, "reason": "always_escalate", "verbatim": result["verbatim"]})
        elif result["present"] is False and weight.get("absent_risk") == "HIGH":
            review_queue.append({"clause": clause_name, "reason": "missing_required", "verbatim": None})
        else:
            auto_approved.append(clause_name)
    return {"review_queue": review_queue, "auto_approved": auto_approved}

The output is what the solicitor sees: a pre-sorted queue with high-risk items at the top, each with the verbatim clause text and a one-sentence flag reason. The 12 automatable clauses that pass cleanly go into an auto_approved log — the solicitor doesn't re-read them.

This converts a 45-minute review into a 12-minute one. The solicitor isn't starting from scratch; they're validating a pre-sorted, context-rich diff.

Under UK GDPR Article 22 and the ICO's guidance on automated decision-making, any decision with legal or similarly significant effects must be explainable, contestable, and logged. A contract review decision — even a "this clause is fine" auto-approval — sits in this category.

The minimum audit log table:

CREATE TABLE contract_review_audit (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    document_hash   CHAR(64) NOT NULL,          -- SHA-256, not the document itself
    model_version   TEXT NOT NULL,              -- e.g. 'gpt-4o-2024-11-20'
    clauses_flagged JSONB NOT NULL,
    clauses_approved JSONB NOT NULL,
    reviewer_id     UUID REFERENCES staff(id),  -- NULL until human signs off
    reviewer_action TEXT,                       -- 'approved', 'escalated', 'rejected'
    reviewed_at     TIMESTAMPTZ
);

-- Append-only: revoke DELETE from the application role
REVOKE DELETE ON contract_review_audit FROM app_user;

Store the document hash, not the document. The document lives in your secure document store (S3 with server-side encryption, or SharePoint with appropriate access controls). The audit table proves that a review happened and who signed off — it doesn't need to contain the NDA text.

Retain records for the duration of the contract plus six years (the standard limitation period under the Limitation Act 1980), per ICO guidance on storage limitation and data retention.

Model choice: GPT-4o vs Claude vs local models for confidential NDA documents

The honest comparison:

Model	Accuracy (12-clause NDA)	Cost per NDA	Confidentiality	Structured output
GPT-4o (API)	~93%	£0.018–0.025	API terms; ZDR available	Native JSON mode
Claude 3.7 Sonnet (API)	~91%	£0.012–0.020	API terms; enterprise DPA	Strong JSON adherence
Mistral 7B (local, Ollama)	~78%	~£0.001 (compute)	Full — never leaves infra	Requires prompt tuning
Llama 3.1 8B (local, Ollama)	~80%	~£0.001 (compute)	Full — never leaves infra	Moderate JSON adherence

For most UK SMEs, GPT-4o via the OpenAI API with Zero Data Retention is the right default. The accuracy advantage is significant, the cost is trivial, and ZDR means your NDA text isn't retained post-request.

For NDAs that contain acquisition terms, employee salary data, or material non-public information — run local. The 13-percentage-point accuracy gap between GPT-4o and Mistral 7B is acceptable when the alternative is sending sensitive commercial information to a cloud provider, even one with good data handling policies. Quantitative benchmarking from LegalBench (Guha et al., 2023) shows that frontier models substantially outperform smaller open-source models on contract understanding tasks, but the gap narrows as models improve.

The counterpoint: there is a credible argument that accuracy gains from cloud models aren't worth the data risk, regardless of retention policies. The 2024 ICO regulatory sandbox findings on AI in legal services suggest practitioners should apply data minimisation — send only what is necessary. Strip party names and sensitive specifics before API submission; it satisfies both accuracy and confidentiality requirements.

What changed in 2025–2026: native PDF APIs and structured output extraction

Two developments materially changed what this flow looks like:

Native PDF ingestion. OpenAI's GPT-4o and Anthropic's Claude now accept PDF uploads directly via their APIs, eliminating the preprocessing step for cleanly produced NDAs. Previously, the standard pipeline required a PDF-to-text extraction step (pdfplumber, PyMuPDF, or an OCR layer for scanned documents) before the text could be passed to the model. As of the GPT-4o file upload API (released late 2024), you can pass the PDF binary directly and the model handles extraction internally. For scanned documents or image-heavy PDFs, an explicit OCR step via our human-in-the-loop OCR pipeline still produces better results — the native PDF handling struggles with multi-column layouts and poor scan quality.

Structured outputs. OpenAI's Structured Outputs feature (August 2024) enforces JSON schema compliance at the model level, eliminating the category of failures where the model returns malformed JSON or invents keys not in the schema. Before structured outputs, roughly 3–5% of extractions failed JSON parsing and required a retry loop. That number is now effectively zero for well-specified schemas. The clause extraction prompt above uses response_format: json_object — upgrading to a full JSON Schema with structured outputs would give you schema-enforced field types and null handling, which matters at production volume.

For document classification before the NDA reaches the extraction step — to distinguish NDAs from other contract types — see our document classification with vision models post, which covers the routing layer that sits upstream of this flow.

Failure modes: what goes wrong in production

Good (supplier NDA, SaaS client). The flow correctly flags a perpetual confidentiality term in a supplier NDA that the Manchester SaaS founder was about to sign. The solicitor negotiates it down to five years. The flag took eight seconds and cost £0.02. Without the automated triage, this clause would have sat buried in a 12-page document waiting for a solicitor slot.

Good (acquisition NDA, professional services firm). A professional services firm comes to us reviewing NDAs for a potential acquisition. The target company's NDA contains a broad IP assignment clause — anything created "in connection with discussions" is assigned to the counterparty. The LLM contract review pipeline flags it in the ip_assignment field with the flag IP_ASSIGNMENT_BROAD. The firm's solicitor spots the clause covers background IP and negotiates a carve-out before any confidential technical discussions happen. Total triage cost: £0.02. Solicitor time saved on the initial read: 25 minutes.

Bad: The model misidentifies a limitation of liability clause as "absent" because the clause is embedded in a schedule rather than the main body, and the extraction prompt doesn't instruct the model to read schedules. The fix is simple — add "including all schedules, annexes, and appendices" to the system prompt — but it requires a test case to surface.

Ugly: A scanned NDA arrives as a 200 DPI TIFF-converted PDF. The PDF-to-text extraction produces garbled output. The model extracts with 55% accuracy. The confidence scoring correctly flags most clauses as low-confidence, so they all go to the review queue — but the solicitor now has to review the whole document, which is worse than the baseline. Solution: add a document quality check upstream. If the extraction confidence across all clauses is below 0.6, reject and request a clean copy. We cover the quality detection logic in our document classification post.

The broader pattern: failure modes cluster around edge cases in input quality, not the model's ability to understand legal language. Invest in the input pipeline.

LLM Contract Review for UK SMEs: NDA Clause Extraction

What LLM contract review can and cannot do reliably in 2026

The 12 NDA clauses worth automating — and the 3 UK red flags that still need a solicitor

Prompt design for consistent clause extraction across contract formats and law firms

Confidence scoring and human review queue: the design that makes 80% automation safe

Model choice: GPT-4o vs Claude vs local models for confidential NDA documents

What changed in 2025–2026: native PDF APIs and structured output extraction

Failure modes: what goes wrong in production

FAQ

Need an NDA review flow that costs pennies per contract?

LLM Contract Review for UK SMEs: NDA Clause Extraction

What LLM contract review can and cannot do reliably in 2026

The 12 NDA clauses worth automating — and the 3 UK red flags that still need a solicitor

Prompt design for consistent clause extraction across contract formats and law firms

Confidence scoring and human review queue: the design that makes 80% automation safe

GDPR-compliant logging of AI-assisted contract decisions: the audit trail you need

Model choice: GPT-4o vs Claude vs local models for confidential NDA documents

What changed in 2025–2026: native PDF APIs and structured output extraction

Failure modes: what goes wrong in production

FAQ

Related Reading

Document Classification with Vision Models: Production Patterns

OCR with human-in-the-loop: shipping 99% accuracy in production

Need an NDA review flow that costs pennies per contract?