A logistics company's ops team fielded 120 Slack messages a week about the same 14 questions: holiday entitlement, expense claim limits, IT request process, vehicle inspection checklist, what to do about a customer complaint. Every answer was in a 60-page PDF staff handbook. Nobody read it. The ops manager spent six hours a week answering the same questions repeatedly, often to the same people.
A knowledge agent over that handbook — and four other policy documents — cut the Slack message volume to 18 per week in the first month. The ops manager reclaimed six hours. The questions that still came through Slack were genuinely novel ones the agent correctly escalated.
This is how you build it without six months of ML engineering.
Why staff don't read handbooks — and what to do about it
The handbook problem isn't a content problem. The policies are usually clear. The problem is retrieval: finding the relevant policy at the moment you need it requires opening a PDF, scanning a table of contents, Ctrl+F-ing for something, and hoping the answer is where you expect it to be. Most people don't bother; they ask a colleague instead.
A knowledge agent solves the retrieval problem. The staff member asks "how many days can I carry over?" in natural language. The agent retrieves the relevant clause from the holiday policy, synthesises a direct answer, and cites the source. Total time: under 3 seconds. The answer is also more reliable than asking a colleague who might remember the old policy.
The important boundary: knowledge agents are good at factual retrieval and synthesis. They're not good at novel interpretations, edge cases requiring judgement, or situations where the policy is genuinely ambiguous. Design the escalation path — "this is outside what I can answer clearly, please speak to [HR contact]" — as carefully as the answer path.
Choosing your document corpus: what goes in, what stays out
The temptation is to index everything. Don't. Index too many documents and retrieval precision degrades — the agent surfaces tangentially relevant chunks from five documents instead of the directly relevant chunk from one.
Practical corpus design for an SME knowledge agent:
- In: HR policy documents, IT request processes, expense and finance policies, compliance guides, product/service SOPs, customer-facing FAQ source material.
- Out: Email threads, Slack exports, informal meeting notes, anything that hasn't been reviewed and approved as official policy. Indexing informal documents trains the agent to cite outdated or incorrect information.
- Boundary cases: Document versions. Always index only the current version of a policy. If you have an archive folder with old versions, exclude it. Version confusion is one of the most common causes of knowledge agent errors.
For document preparation: remove or anonymise any real personal data (names, NI numbers, salary figures) from documents before indexing. The knowledge agent's context window is not a GDPR-controlled environment in the same way your HRMS is.
Chunking strategy: where most RAG implementations break
Chunking — how you split documents into retrievable pieces — is the single most impactful technical decision in a RAG system, and the one most teams get wrong.
| Strategy | Description | Retrieval quality | Complexity |
|---|---|---|---|
| Fixed-size (512 tokens) | Split every N tokens regardless of content | Poor — splits mid-sentence, mid-clause | Low |
| Sentence-window | Chunk on sentence boundaries, ±2 sentence context | Good for dense prose | Moderate |
| Semantic chunking | Embed sentences, split where embedding distance spikes | Best for varied document types | High |
| Hierarchical | Chunk at section level + paragraph level; retrieve at paragraph, answer at section | Best for structured policy docs | High |
For a corpus of structured policy documents — as opposed to unstructured prose — hierarchical chunking consistently outperforms fixed-size. Policy documents have a natural structure: document → section → subsection → clause. Preserving that structure in the chunk metadata means the retrieval system can fetch the relevant clause and provide the surrounding section as context, which reduces hallucination significantly.
A minimal hierarchical chunker in Python:
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_policy_doc(text: str, doc_title: str, section: str) -> list[dict]:
splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=80,
separators=["\n\n", "\n", ".", " "],
)
chunks = splitter.split_text(text)
return [
{
"content": chunk,
"metadata": {
"doc_title": doc_title,
"section": section,
"chunk_index": i,
"char_count": len(chunk),
},
}
for i, chunk in enumerate(chunks)
]
The metadata fields — doc_title, section, chunk_index — are what your metadata filters will use at retrieval time.
Metadata filters: routing questions to the right document section
Without metadata filters, a query about expense limits retrieves chunks from every document that mentions money — which, in a typical SME corpus, includes the employment contract, the GDPR policy, the finance SOP, and the client contract template. With filters, the same query can be restricted to documents tagged category: finance or department: finance.
The filter layer sits between the embedding retrieval and the LLM generation:
import psycopg2
from pgvector.psycopg2 import register_vector
def retrieve_with_filter(query_embedding, category: str, top_k: int = 5):
conn = psycopg2.connect(DATABASE_URL)
register_vector(conn)
cur = conn.cursor()
cur.execute("""
SELECT content, metadata, 1 - (embedding <=> %s) AS similarity
FROM document_chunks
WHERE metadata->>'category' = %s
ORDER BY embedding <=> %s
LIMIT %s
""", (query_embedding, category, query_embedding, top_k))
return cur.fetchall()
For a policy knowledge agent, you typically don't need the user to specify the category explicitly — the agent can classify the query first ("this is a finance query"), then apply the corresponding filter. This two-step retrieval keeps precision high without requiring the user to know your document taxonomy.
For a deeper comparison of vector retrieval strategies — including when keyword search outperforms vector search on structured documents — see Document RAG: When Vector Search Beats Keyword Search.
Retrieval evaluation: how to know if the agent is actually right
Most knowledge agent evaluations stop at "does the agent answer the question?" The more important question is "does the agent answer it correctly?" For a policy agent making factual claims about HR entitlements or compliance requirements, incorrect answers are a serious operational risk.
Evaluation framework:
-
Retrieval precision: given a test question, are the top-5 retrieved chunks actually relevant to the answer? Measure manually for a representative sample of 50 queries. Target: >80% of retrieved chunks should be relevant.
-
Answer faithfulness: does the generated answer accurately reflect the retrieved chunks? Use an LLM-as-judge setup (RAGAs is the standard open-source framework) to evaluate whether the answer contradicts or fabricates content relative to the context.
-
Citation accuracy: does the agent cite a document and section that actually contains the claimed information? Spot-check 10% of answers weekly.
Build a small golden dataset of 100 question-answer pairs from your actual policies — answers verified by HR or the relevant department head — and run this evaluation suite before any significant update to the corpus or the retrieval configuration.
Hallucination guardrails: what to do when the agent doesn't know
Three complementary guardrails, used together:
Retrieval confidence threshold. Before passing context to the LLM, check the similarity score of the top retrieved chunk. If it's below 0.6 (on a 0–1 cosine similarity scale), the retrieved content is likely not directly relevant to the query. In this case, return a fallback response rather than attempting to generate from weak context:
"I couldn't find a clear policy on this. Please speak to [HR contact] or check the [policy document name] directly."
Citation requirement in the prompt. Instruct the LLM to prefix every factual claim with its source: "According to the Expense Policy (Section 3.2)..." This forces grounding in the retrieved context and makes verification trivial.
Human escalation path. For questions the agent answers with low confidence (flagged internally but shown to the user), add a "Was this helpful?" mechanism. Negative feedback triggers a Slack notification to the relevant department owner. This creates a feedback loop that surfaces gaps in your document corpus — answers that are consistently unhelpful usually point to missing or outdated policies.
What changed in 2025–2026: hybrid retrieval and reranking
The practical shift in production RAG in 2025 was the normalisation of hybrid retrieval — combining dense vector search with sparse BM25 keyword search — and the addition of a reranking step before LLM generation.
Pure vector search handles semantic similarity well but struggles with exact-match queries (policy numbers, specific role titles, exact monetary amounts). BM25 handles exact matches well but misses semantic variants. Hybrid retrieval, as now supported natively by Elasticsearch's ELSER model and by pgvector with a BM25 extension, combines both.
The reranker — a cross-encoder model that scores each retrieved chunk against the query directly, rather than relying on embedding approximation — typically improves answer faithfulness by 10–15 percentage points over retrieval without reranking. Cohere's Rerank API is the fastest way to add this to an existing RAG pipeline. For open-source, ms-marco-MiniLM-L-6-v2 from HuggingFace is the standard choice.
The counterargument: reranking adds latency (150–400ms depending on corpus size) and cost. For a staff Q&A agent with a corpus under 1,000 documents, the latency may not be worth the precision gain — test on your specific corpus before committing. If your documents are highly structured and consistently chunked, the gains from reranking may be marginal relative to getting the chunking strategy right in the first place.
Good / Bad / Ugly: knowledge agent patterns from real deployments
Good: Hierarchical chunking on 6 structured policy documents. Metadata filters by category. Retrieval confidence threshold at 0.62 — answers below threshold escalate to HR Slack channel. Citation requirement in prompt. 50-question golden dataset evaluated monthly. Answer faithfulness score: 91%. Staff message volume to HR: down 78% in month one. Agent handled 94% of queries without escalation.
Bad: Fixed-size chunking on everything including email exports and informal meeting notes. No metadata filters. The agent confidently cites a "policy" from a three-year-old email thread that was never formalised. Two incorrect answers given to staff about holiday entitlement — one of which led to a complaint.
Ugly: No confidence threshold, no citation requirement, no evaluation. The agent hallucinates a parental leave policy that is more generous than the actual policy. An employee relies on it, the discrepancy surfaces in an employment dispute, and the company's solicitor has to establish that the AI system was not authoritative. Operational and legal pain that took months to resolve.
The chunking and retrieval strategy that works for a policy knowledge agent differs from what works for a technical document corpus. For the retrieval architecture comparison across document types, see Document RAG: When Vector Search Beats Keyword Search. When your knowledge corpus includes scanned or hand-completed forms — common in HR and compliance settings — the preprocessing layer matters before chunking; OCR with Human-in-the-Loop: Shipping 99% Accuracy in Production covers how to get structured text out of those documents reliably. For how OCR feeds structured documents into a knowledge pipeline end-to-end, the Invoice OCR case study shows a production example.
Book a 30-minute scoping call — we'll assess your document corpus and design the agent architecture before you leave.