Quantum Automations Quantum Automations
Blog · Portfolio
← Back to Blog
Guide · Document Automation

RAG Knowledge Agents for Staff Q&A: Building Over Internal Docs

Published June 2026
Topic Document Automation · Knowledge Agents
Reading time 8 min
For Ops leads managing internal knowledge and back-office workflows
On this page
  1. Why staff don't read handbooks — and what to do about it
  2. Choosing your document corpus: what goes in, what stays out
  3. Chunking strategy: where most RAG implementations break
  4. Metadata filters: routing questions to the right document section
  5. Retrieval evaluation: how to know if the agent is actually right
  6. Hallucination guardrails: what to do when the agent doesn't know
  7. What changed in 2025–2026: hybrid retrieval and reranking
  8. Good / Bad / Ugly: knowledge agent patterns from real deployments
  9. FAQ

A logistics company's ops team fielded 120 Slack messages a week about the same 14 questions: holiday entitlement, expense claim limits, IT request process, vehicle inspection checklist, what to do about a customer complaint. Every answer was in a 60-page PDF staff handbook. Nobody read it. The ops manager spent six hours a week answering the same questions repeatedly, often to the same people.

A knowledge agent over that handbook — and four other policy documents — cut the Slack message volume to 18 per week in the first month. The ops manager reclaimed six hours. The questions that still came through Slack were genuinely novel ones the agent correctly escalated.

This is how you build it without six months of ML engineering.

Why staff don't read handbooks — and what to do about it

The handbook problem isn't a content problem. The policies are usually clear. The problem is retrieval: finding the relevant policy at the moment you need it requires opening a PDF, scanning a table of contents, Ctrl+F-ing for something, and hoping the answer is where you expect it to be. Most people don't bother; they ask a colleague instead.

A knowledge agent solves the retrieval problem. The staff member asks "how many days can I carry over?" in natural language. The agent retrieves the relevant clause from the holiday policy, synthesises a direct answer, and cites the source. Total time: under 3 seconds. The answer is also more reliable than asking a colleague who might remember the old policy.

The important boundary: knowledge agents are good at factual retrieval and synthesis. They're not good at novel interpretations, edge cases requiring judgement, or situations where the policy is genuinely ambiguous. Design the escalation path — "this is outside what I can answer clearly, please speak to [HR contact]" — as carefully as the answer path.

Choosing your document corpus: what goes in, what stays out

The temptation is to index everything. Don't. Index too many documents and retrieval precision degrades — the agent surfaces tangentially relevant chunks from five documents instead of the directly relevant chunk from one.

Practical corpus design for an SME knowledge agent:

  • In: HR policy documents, IT request processes, expense and finance policies, compliance guides, product/service SOPs, customer-facing FAQ source material.
  • Out: Email threads, Slack exports, informal meeting notes, anything that hasn't been reviewed and approved as official policy. Indexing informal documents trains the agent to cite outdated or incorrect information.
  • Boundary cases: Document versions. Always index only the current version of a policy. If you have an archive folder with old versions, exclude it. Version confusion is one of the most common causes of knowledge agent errors.

For document preparation: remove or anonymise any real personal data (names, NI numbers, salary figures) from documents before indexing. The knowledge agent's context window is not a GDPR-controlled environment in the same way your HRMS is.

Chunking strategy: where most RAG implementations break

Chunking — how you split documents into retrievable pieces — is the single most impactful technical decision in a RAG system, and the one most teams get wrong.

Strategy Description Retrieval quality Complexity
Fixed-size (512 tokens) Split every N tokens regardless of content Poor — splits mid-sentence, mid-clause Low
Sentence-window Chunk on sentence boundaries, ±2 sentence context Good for dense prose Moderate
Semantic chunking Embed sentences, split where embedding distance spikes Best for varied document types High
Hierarchical Chunk at section level + paragraph level; retrieve at paragraph, answer at section Best for structured policy docs High

For a corpus of structured policy documents — as opposed to unstructured prose — hierarchical chunking consistently outperforms fixed-size. Policy documents have a natural structure: document → section → subsection → clause. Preserving that structure in the chunk metadata means the retrieval system can fetch the relevant clause and provide the surrounding section as context, which reduces hallucination significantly.

A minimal hierarchical chunker in Python:

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_policy_doc(text: str, doc_title: str, section: str) -> list[dict]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=400,
        chunk_overlap=80,
        separators=["\n\n", "\n", ".", " "],
    )
    chunks = splitter.split_text(text)
    return [
        {
            "content": chunk,
            "metadata": {
                "doc_title": doc_title,
                "section": section,
                "chunk_index": i,
                "char_count": len(chunk),
            },
        }
        for i, chunk in enumerate(chunks)
    ]

The metadata fields — doc_title, section, chunk_index — are what your metadata filters will use at retrieval time.

Metadata filters: routing questions to the right document section

Without metadata filters, a query about expense limits retrieves chunks from every document that mentions money — which, in a typical SME corpus, includes the employment contract, the GDPR policy, the finance SOP, and the client contract template. With filters, the same query can be restricted to documents tagged category: finance or department: finance.

The filter layer sits between the embedding retrieval and the LLM generation:

import psycopg2
from pgvector.psycopg2 import register_vector

def retrieve_with_filter(query_embedding, category: str, top_k: int = 5):
    conn = psycopg2.connect(DATABASE_URL)
    register_vector(conn)
    cur = conn.cursor()
    cur.execute("""
        SELECT content, metadata, 1 - (embedding <=> %s) AS similarity
        FROM document_chunks
        WHERE metadata->>'category' = %s
        ORDER BY embedding <=> %s
        LIMIT %s
    """, (query_embedding, category, query_embedding, top_k))
    return cur.fetchall()

For a policy knowledge agent, you typically don't need the user to specify the category explicitly — the agent can classify the query first ("this is a finance query"), then apply the corresponding filter. This two-step retrieval keeps precision high without requiring the user to know your document taxonomy.

For a deeper comparison of vector retrieval strategies — including when keyword search outperforms vector search on structured documents — see Document RAG: When Vector Search Beats Keyword Search.

Retrieval evaluation: how to know if the agent is actually right

Most knowledge agent evaluations stop at "does the agent answer the question?" The more important question is "does the agent answer it correctly?" For a policy agent making factual claims about HR entitlements or compliance requirements, incorrect answers are a serious operational risk.

Evaluation framework:

  1. Retrieval precision: given a test question, are the top-5 retrieved chunks actually relevant to the answer? Measure manually for a representative sample of 50 queries. Target: >80% of retrieved chunks should be relevant.

  2. Answer faithfulness: does the generated answer accurately reflect the retrieved chunks? Use an LLM-as-judge setup (RAGAs is the standard open-source framework) to evaluate whether the answer contradicts or fabricates content relative to the context.

  3. Citation accuracy: does the agent cite a document and section that actually contains the claimed information? Spot-check 10% of answers weekly.

Build a small golden dataset of 100 question-answer pairs from your actual policies — answers verified by HR or the relevant department head — and run this evaluation suite before any significant update to the corpus or the retrieval configuration.

Hallucination guardrails: what to do when the agent doesn't know

Three complementary guardrails, used together:

Retrieval confidence threshold. Before passing context to the LLM, check the similarity score of the top retrieved chunk. If it's below 0.6 (on a 0–1 cosine similarity scale), the retrieved content is likely not directly relevant to the query. In this case, return a fallback response rather than attempting to generate from weak context:

"I couldn't find a clear policy on this. Please speak to [HR contact] or check the [policy document name] directly."

Citation requirement in the prompt. Instruct the LLM to prefix every factual claim with its source: "According to the Expense Policy (Section 3.2)..." This forces grounding in the retrieved context and makes verification trivial.

Human escalation path. For questions the agent answers with low confidence (flagged internally but shown to the user), add a "Was this helpful?" mechanism. Negative feedback triggers a Slack notification to the relevant department owner. This creates a feedback loop that surfaces gaps in your document corpus — answers that are consistently unhelpful usually point to missing or outdated policies.

What changed in 2025–2026: hybrid retrieval and reranking

The practical shift in production RAG in 2025 was the normalisation of hybrid retrieval — combining dense vector search with sparse BM25 keyword search — and the addition of a reranking step before LLM generation.

Pure vector search handles semantic similarity well but struggles with exact-match queries (policy numbers, specific role titles, exact monetary amounts). BM25 handles exact matches well but misses semantic variants. Hybrid retrieval, as now supported natively by Elasticsearch's ELSER model and by pgvector with a BM25 extension, combines both.

The reranker — a cross-encoder model that scores each retrieved chunk against the query directly, rather than relying on embedding approximation — typically improves answer faithfulness by 10–15 percentage points over retrieval without reranking. Cohere's Rerank API is the fastest way to add this to an existing RAG pipeline. For open-source, ms-marco-MiniLM-L-6-v2 from HuggingFace is the standard choice.

The counterargument: reranking adds latency (150–400ms depending on corpus size) and cost. For a staff Q&A agent with a corpus under 1,000 documents, the latency may not be worth the precision gain — test on your specific corpus before committing. If your documents are highly structured and consistently chunked, the gains from reranking may be marginal relative to getting the chunking strategy right in the first place.

Good / Bad / Ugly: knowledge agent patterns from real deployments

Good: Hierarchical chunking on 6 structured policy documents. Metadata filters by category. Retrieval confidence threshold at 0.62 — answers below threshold escalate to HR Slack channel. Citation requirement in prompt. 50-question golden dataset evaluated monthly. Answer faithfulness score: 91%. Staff message volume to HR: down 78% in month one. Agent handled 94% of queries without escalation.

Bad: Fixed-size chunking on everything including email exports and informal meeting notes. No metadata filters. The agent confidently cites a "policy" from a three-year-old email thread that was never formalised. Two incorrect answers given to staff about holiday entitlement — one of which led to a complaint.

Ugly: No confidence threshold, no citation requirement, no evaluation. The agent hallucinates a parental leave policy that is more generous than the actual policy. An employee relies on it, the discrepancy surfaces in an employment dispute, and the company's solicitor has to establish that the AI system was not authoritative. Operational and legal pain that took months to resolve.


The chunking and retrieval strategy that works for a policy knowledge agent differs from what works for a technical document corpus. For the retrieval architecture comparison across document types, see Document RAG: When Vector Search Beats Keyword Search. When your knowledge corpus includes scanned or hand-completed forms — common in HR and compliance settings — the preprocessing layer matters before chunking; OCR with Human-in-the-Loop: Shipping 99% Accuracy in Production covers how to get structured text out of those documents reliably. For how OCR feeds structured documents into a knowledge pipeline end-to-end, the Invoice OCR case study shows a production example.

Book a 30-minute scoping call — we'll assess your document corpus and design the agent architecture before you leave.

FAQ

How often does the knowledge agent need to be updated when policies change?

Your re-indexing pipeline should trigger automatically whenever a source document changes. For most SMEs using Google Drive or SharePoint as the source, a webhook or polling job that detects file modifications and re-chunks/re-embeds the affected documents is sufficient. A full re-index of a 200-document corpus takes under 5 minutes with a modern embedding model; there's no reason to leave stale chunks in the index.

Can the agent handle multi-document questions spanning HR policy and IT process?

Yes, with the right retrieval setup. The key is not to silo your document types into separate vector stores — keep them in one store with metadata tags (document_type, department, last_updated). A question that spans IT request process and HR approval workflow retrieves relevant chunks from both document types in a single query, as long as the embedding space has seen examples from both domains during ingestion. The LLM then synthesises the answer from the mixed context.

How do I stop the agent from hallucinating policies that don't exist in the docs?

Three layers: (1) a retrieval confidence threshold — if the top-k retrieved chunks all have cosine similarity below 0.6, the agent responds 'I couldn't find a clear policy on this — please check with HR directly' rather than generating from thin context; (2) a prompt instruction to cite the specific document and section for every claim it makes; (3) a human-in-the-loop escalation path for low-confidence answers. The citation requirement is the most effective single guardrail — it forces the model to ground its answer in retrieved text rather than generate from parametric memory.

What's the difference between a knowledge agent and a basic chatbot?

A basic chatbot answers from a fixed script or pattern-matched rules. A knowledge agent retrieves relevant context from your actual documents at query time and generates an answer grounded in that context — so it can answer questions about policies that didn't exist when it was built, as long as those policies are in the indexed corpus. The practical difference: a chatbot gets stale; a knowledge agent gets more useful as you add more documents.

Related Reading

Document RAG: when vector search beats keyword search

Vector search isn't always the right call. A field guide for UK SMEs deciding when pgvector earns its complexity and whe

OCR with human-in-the-loop: shipping 99% accuracy in production

Why 99% extraction accuracy still fails in production, and the queue-and-confidence pattern that makes hybrid OCR genuin

Want a knowledge agent over your internal documents?

30-minute audit. We map your stack, your constraints, and where AI will pay back fastest.

Take the Quantum Leap →
© 2026 Quantum Automations Group Ltd
Home Blog Portfolio Privacy Terms Security