A finance director emailed us a 247 MB Dropbox folder last quarter. "Just answer questions on this stuff," he said. "My team spends two hours a day digging."
We almost reached for the same thing every blog post recommends: chunk it, embed it, drop it in a vector store, wire up a retrieval-augmented generation prompt, ship it. Three days of work, looks impressive, costs ~£40 a month to run.
Then we actually looked at the folder. 11 PDFs of HMRC correspondence. 38 invoices. 6 lease agreements. 19 supplier statements. About 80 emails saved as PDF. The questions the team needed answering were things like "what was the rent on Unit 4 in 2023?" and "did we ever get a refund from Vendor X for invoice 4521?"
A Postgres LIKE query against the OCR'd text would have answered most of those in 200ms. We'd have built a vector pipeline to make a problem that wasn't a vector-search problem more expensive to solve.
This post is the field guide we wished we had on day one. Where keyword wins. Where vector wins. The bit nobody talks about — metadata filters — that decides most real outcomes. And what changed in 2025–2026 that should make you question parts of the dogma.
How to decide in 30 seconds
Before any tooling, classify your queries:
Are queries predicate-shaped (X by vendor, by date, over £N)?
YES → extract structure, query SQL. Stop.
NO → continue.
Does the user already know the right vocabulary?
YES → Postgres tsvector. ~50ms, deterministic.
NO → continue.
Is paraphrase recall genuinely critical?
YES → pgvector + reranker. Tune metadata filters first.
NO → tsvector. You'll save weeks.
Most SMEs spend 80% of the engineering effort on the wrong branch of this tree. The rest of this post is what each branch looks like in production.
What's actually in the folder
Before any retrieval choice, classify the corpus. Most SME document collections fall into three shapes, and the right retrieval pattern follows from the shape, not from what's trending.
Shape A — Regulated reference. HMRC letters, contracts, supplier agreements, statutory filings. The vocabulary is fixed. Queries tend to be exact: "what's the termination notice period in our office lease?" The answers live in named clauses. Vector search is overkill and sometimes worse, because cosine similarity will happily return a near-duplicate clause from the wrong contract.
Shape B — Operational reference. Internal SOPs, onboarding docs, runbooks, "how do we do X" wikis. The vocabulary is loose. Different staff describe the same thing five ways. Queries are conceptual: "how do we handle a chargeback?" Vector search earns its keep here.
Shape C — Transactional history. Invoices, statements, dispatch notes, customer correspondence. The data is structured even when the format isn't. Queries are predicate-style: "show me all invoices from Vendor X over £5000 in 2024." This is a database query masquerading as a document retrieval problem. Extract the structure (OCR + field detection) and query the structured layer first.
Most SME corpora are 60% Shape C, 30% Shape A, 10% Shape B. The honest answer to "should we build RAG?" is usually "extract structure first, then maybe."
When keyword wins
You don't need vector search if any of these are true:
- The user already knows the right vocabulary. Lawyers searching contracts know to look for "indemnity," not "if something goes wrong who pays." Retrieval doesn't need semantic flexibility.
- Queries are predicate-shaped. "Vendor X" + "invoice over £5,000" + "2024" is a
WHEREclause, not a similarity search. Use Postgres full-text search (tsvector) or evenLIKEand stop. - Recall matters more than ranking. If missing a hit costs you (compliance, audit, financial), keyword's deterministic behaviour beats vector's probabilistic one. You can prove what was searched.
- Documents are short and well-structured. A 2-page invoice doesn't benefit from chunking strategy. Treat it as a row.
We've seen teams pay 100x the infrastructure cost to add vector search to corpora where Postgres tsvector would have hit 95% recall in 50ms. The other 5% rarely matters because the user can refine the query.
A second receipt: a 30-person professional services firm came to us convinced they needed an "AI knowledge base" for their internal policy library. The library was 240 documents, 95% of queries from staff were of the form "what's our policy on X?" — exactly the case where titles and headings carry the answer. We built tsvector against title and H1, with a fallback to body. Average query time 28ms. Total infrastructure cost ~£12/month. The "AI knowledge base" project that quoted them £25k was solving a problem they didn't have.
When vector wins
Vector search earns its complexity when:
- Queries are paraphrase-prone. Customers asking the same thing 17 ways. Staff describing problems by symptom rather than by name. The retrieval has to bridge vocabulary gaps.
- The corpus is large and unstructured. Tens of thousands of mixed documents where you can't anticipate what will be asked. Hand-tuning a keyword index won't scale.
- You need conceptual matching. "Show me policies about working from abroad" should pull a doc titled "Remote eligibility — non-UK residence" even if the words don't overlap.
- You're stacking it with a reranker. Vector retrieval gets you the top 50; a cross-encoder reranker like Cohere Rerank reorders to the top 5. Without the reranker, vector-only retrieval is often worse than tsvector for precision-critical use cases.
Note what's missing from this list: "we want to use AI." That's not a reason. The reason is always a property of the query distribution and the corpus.
Keyword vs vector vs structured: the comparison nobody publishes
Keyword (tsvector) |
Vector (pgvector + reranker) | Structured (OCR + SQL) | |
|---|---|---|---|
| Best for | Shape A, fixed vocabulary | Shape B, paraphrase-heavy | Shape C, predicate queries |
| Setup cost | Hours | Days to a week | Days (extraction is the work) |
| p95 latency | 20–80ms | 80–250ms (with rerank) | 5–40ms |
| Recall on regulated text | High (deterministic) | Medium–high | High (if extraction is good) |
| Recall on conceptual queries | Low | High | N/A |
| Auditability | Strong (boolean trace) | Weak (cosine ≠ explanation) | Strong (SQL plan) |
| Monthly infra (SME scale) | £10–25 | £35–80 | £15–40 + extraction |
| Failure mode | Misses synonyms | Confidently retrieves wrong doc | Bad extraction = bad answer forever |
The right architecture for most SMEs is a small router: classify the query at run-time and dispatch to the cheapest layer that can answer it. We've found that ~70% of real queries can be answered by tsvector or SQL, and the remaining 30% justify the vector pipeline.
Metadata filters: the bit nobody benchmarks
If we could only give one piece of advice on document RAG, it would be this: the metadata filter is more important than the embedding model.
Every public RAG benchmark embeds Wikipedia and asks generic questions. Your corpus is not Wikipedia. Your corpus has tenants, customers, document types, departments, dates, and access controls. The query "what's the rent on Unit 4?" should never even consider embeddings from leases for Unit 7.
A working pgvector schema looks like this:
create extension if not exists vector;
create table chunks (
id uuid primary key,
doc_id uuid not null,
doc_type text not null, -- 'lease' | 'invoice' | 'sop' | 'email'
customer_id uuid,
vendor_id uuid,
effective daterange,
chunk_text text not null,
embedding vector(1536) not null
);
create index on chunks using hnsw (embedding vector_cosine_ops);
create index on chunks (doc_type, customer_id);
create index on chunks using gist (effective);
The query that actually runs in production:
select chunk_text, embedding <=> $1 as distance
from chunks
where doc_type = any($2)
and customer_id = $3
and effective && daterange($4, $5)
order by embedding <=> $1
limit 50;
The where clause does 80% of the work. The vector ordering does the last 20%. Teams that skip the metadata layer end up with retrieval that's "kind of right" — close cosine match, wrong customer, wrong year, wrong document type. The pgvector docs cover indexing strategy in depth; the part underemphasised in most write-ups is using HNSW alongside btree/gist indexes on filter columns.
Two things to track from day one: chunk-level metadata (not just document-level) so date ranges and access scopes apply at retrieval time, and a denormalised join key so you can filter without query-time joins. Both decisions look unimportant in week one. Both decide whether the system still works in month six.
A reference stack for SME knowledge bases
The setup we ship most often, kept boring on purpose:
- Storage: Postgres + pgvector. One database. No separate vector store unless the corpus crosses ~5M chunks. ~£35/month on a managed Postgres for typical SME volumes.
- Chunking: 512-token chunks with 64-token overlap. Sentence-boundary aware. Title and heading prepended to each chunk so retrieval has context.
- Embeddings:
text-embedding-3-small(1536 dims) for general docs;text-embedding-3-largeonly when we've measured a lift on the actual query set. Cohereembed-v3if multilingual matters. - OCR (Shape A and C): A vision model — GPT-4o or Claude Sonnet — for layout-aware extraction with a per-doc-type schema, not Tesseract. We've covered our setup for Invoice OCR Processing — same pattern works for leases and statements.
- Reranker: Cohere Rerank or mxbai-rerank-large on the top 50 vector hits. Skip only if the corpus is tiny.
- Retrieval orchestration: A small Node or Python service that takes a query, derives metadata filters from the user's session (tenant, role, scope), runs the SQL above, applies the reranker, and returns the top 5 with citations.
- Generation: GPT-4o-mini or Claude Sonnet on the reranker's top 5. Always with citations back to source documents. Refuse to answer when the retrieval returns weak matches — silence beats a confident hallucination.
We've shipped this exact stack for a manufacturing client running Document RAG with video avatars and a financial services firm where it sits behind voice agents — see the Voice AI Architecture guide for how the retrieval slots into a sub-1.6s call flow. For UK clients in regulated sectors, the metadata layer also enforces the access controls that UK compliance guidance requires — same plumbing, different policy.
What changed in 2025–2026
Three things moved the goalposts in the last 12 months and are worth tracking:
Long-context models compete with RAG. Claude and Gemini now support 1M+ token contexts. For one-shot questions over small or medium corpora — say, a 200-page contract or a few weeks of meeting notes — you can sometimes skip retrieval entirely and pass the source documents. The original RAG paper was written when context windows were 4k tokens; the architecture trade-off looks different at 1M. RAG still wins on cost, latency, and audit trails for production workloads, but the floor for "do I need RAG at all?" has moved.
Contextual retrieval has become the new default. Anthropic's contextual retrieval write-up demonstrated that prepending a brief LLM-generated context to each chunk before embedding cuts retrieval failure rates by 35–50%. This is the kind of preprocessing change that's free to adopt and shifts the keyword-vs-vector calculus — vector retrieval gets meaningfully better with contextualised chunks, narrowing the gap with hybrid keyword approaches.
HNSW indexes shipped in pgvector. Until pgvector 0.5, ivfflat was the only practical index, with painful build times and degraded recall under updates. HNSW indexes (now stable) make pgvector competitive with dedicated vector databases up to several million chunks. The "you'll need a real vector DB" advice from 2023 is largely obsolete for SME workloads.
Good / Bad / Ugly
Good. Tight metadata filters. Citations on every answer. Chunking that respects document structure. A reranker on top of vector retrieval. Refusal when the top match is weak. Contextual chunks if the corpus warrants it.
Bad. Embedding-only retrieval with no metadata filter. Generic chunk sizes that ignore document type. No reranker. Generation prompts that don't constrain to retrieved context. Treating compliance documents the same way as marketing FAQs.
Ugly. Hallucinated answers cited to documents that say something different (always run citation verification — string-match the claim against the cited chunk). PII bleeding across tenants because the metadata filter was set at the wrong layer. Embeddings going stale because nobody re-indexes when source docs change. A chatbot UI that hides the citations because they're "not pretty" — the citations are the product.
The pattern most SMEs need is simpler than the literature suggests. Classify the corpus. If it's Shape C, extract structure and query SQL. If it's Shape A and small, use Postgres full-text search. Reach for vector search when paraphrase recall genuinely matters, and treat the metadata filter as the headline architecture decision rather than a footnote.
The 247 MB folder ended up needing all three. We extracted invoice fields into a structured table, indexed the lease and HMRC corpus with tsvector, and pointed pgvector at the SOPs and email history. The CFO got his question-answering layer in a week. Two hours a day, returned.