Quantum Automations Quantum Automations
Blog · Portfolio
← Back to Blog
Guide · Document Automation

pgvector vs Pinecone vs Qdrant for UK SME RAG Pipelines

Published June 2026
Topic Document Automation · Vector Stores
Reading time 10 min
For UK SME ops leads
On this page
  1. How pgvector, Pinecone, and Qdrant each handle retrieval — and where none of them are the bottleneck
  2. pgvector: when Postgres is already there and that's enough
  3. Pinecone: the managed path, its pricing cliff, and who it's actually for
  4. Qdrant: payload filtering that actually works at UK SME corpus sizes
  5. Retrieval latency and recall on a 100k-document corpus: the benchmarks we ran
  6. Hybrid search support: which stores handle BM25 + vector retrieval natively
  7. Ops burden comparison: what you own in each option
  8. What changed in 2025–2026
  9. Our recommendation by corpus size, query pattern, and existing stack
  10. FAQ

We inherited a RAG pipeline six months in that was running pgvector on a db.t3.medium RDS instance with 180k documents. Filtered retrieval — "only answer from documents tagged with client_id=429" — was taking 2.1 seconds because pgvector doesn't support payload-filtered approximate nearest neighbour. Migrating to Qdrant took two weeks and cut that query to 90ms. The original choice wasn't wrong. It was right for the corpus size at launch. Nobody had planned for growth. That's the case for reading this before you build, not after.

How pgvector, Pinecone, and Qdrant each handle retrieval — and where none of them are the bottleneck

A RAG pipeline has two jobs: retrieval and generation. The vector store handles retrieval — given a query embedding, return the k most similar document chunks. People obsess over LLM selection and ignore vector store design until retrieval latency shows up in production.

At query time, the vector store: accepts a query embedding (typically 1536 dimensions for OpenAI text-embedding-3-small), runs an approximate nearest neighbour (ANN) search against indexed document embeddings, applies any metadata filters by client or document type, and returns the top-k chunks with scores.

The metadata filter step is where the three options diverge. Single-tenant corpus with no filtering? pgvector is fine up to 300k vectors. Multi-tenant with per-client filtering — which describes almost every UK SME RAG deployment — the choice matters enormously.

The vector store is also not where most RAG pipelines fail. Chunking strategy, embedding model choice, and prompt design cause more recall failures than index performance. See our RAG architecture case study for how we structure the full pipeline before focusing on the vector layer.

pgvector: when Postgres is already there and that's enough

pgvector is a Postgres extension that adds a vector column type and HNSW and IVFFlat index types. If you already pay for RDS or Supabase, pgvector costs nothing extra and adds zero operational complexity.

For a corpus below 100k vectors with no filtering requirement, pgvector with HNSW delivers 50–80ms query latency and recall above 95% against brute-force. You get transactional consistency with the rest of your application data, familiar tooling, and no additional service to operate.

The problem surface is narrow but sharp:

Filtered ANN. pgvector builds its HNSW index across all vectors. When you add a WHERE clause, it cannot use the index to narrow the search space first — it scans then filters, or falls back to sequential scan. For selectivities above 5%, queries become O(n) rather than O(log n). The 2.1-second latency we hit was exactly this failure mode.

Vector dimensions. pgvector supports up to 2,000 dimensions per column. OpenAI text-embedding-3-large outputs 3,072 dimensions. You must truncate or switch to a smaller embedding model. Qdrant and Pinecone both handle 3,072-dimension vectors without modification.

Use pgvector when Postgres is already in your stack, your corpus stays below 100k vectors, and you do not need metadata-filtered ANN. If you are starting from zero, choosing pgvector to avoid a new service is not a good enough reason.

-- Create table with HNSW index (pgvector 0.5+)
CREATE TABLE document_chunks (
  id          uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  client_id   integer NOT NULL,
  content     text NOT NULL,
  embedding   vector(1536),
  metadata    jsonb
);

CREATE INDEX ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Filtered query (falls back to exact scan on the subset)
SELECT id, content, 1 - (embedding <=> $1::vector) AS score
FROM   document_chunks
WHERE  client_id = 429
ORDER  BY embedding <=> $1::vector
LIMIT  8;

That query is correct SQL. On 180k rows filtered to a client with 18k chunks, it took 2.1 seconds. The same query on Qdrant with a payload filter took 90ms.

Pinecone: the managed path, its pricing cliff, and who it's actually for

Pinecone is a fully managed vector database. You create a serverless index, upsert vectors with metadata, and query. No infrastructure to manage, no index-rebuild scheduling, no capacity planning. For a team without dedicated infrastructure experience, that zero-ops proposition is genuine.

Pinecone Serverless bills on read and write units rather than compute time. A broad query against a 200k-vector namespace scans more vectors than the k=8 you requested, so the bill scales with namespace breadth, not just query count. At 50,000 queries per month against a 200k-vector corpus, expect roughly £200–£280/month. That is a fair price for zero ops overhead — but it surprises UK SMEs who budgeted based on query count alone.

Filtering semantics are the second limitation. Pinecone's metadata filters are reasonably fast but less expressive than Qdrant's payload system. Nested JSON payload queries or range filters on computed scores require workarounds in Pinecone that Qdrant handles without modification.

Pinecone is the right choice for teams that want a working vector store in an afternoon with no infrastructure commitment and do not need the filtered-retrieval precision that Qdrant provides.

For a counterpoint — specifically the argument that pgvector on a well-tuned Postgres instance outperforms Pinecone on cost-per-query for corpora under 1M vectors — see the Supabase pgvector benchmark analysis.

Qdrant: payload filtering that actually works at UK SME corpus sizes

Qdrant is an open-source vector database written in Rust. It runs as a single binary, supports gRPC and REST, and handles payload-filtered ANN correctly — the HNSW index is traversed with the filter applied, not scanned and then filtered.

Qdrant builds filterable HNSW graphs where each node carries a payload (arbitrary JSON metadata). A query for vectors where client_id = 429 and doc_type = "contract" traverses only the subgraph matching those conditions, not the full index.

The result in our migration: 2.1 seconds down to 90ms, same embedding model, same k=8 result set.

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, SearchRequest

client = QdrantClient(url="http://localhost:6333")

results = client.search(
    collection_name="documents",
    query_vector=query_embedding,  # list[float], 1536 dims
    query_filter=Filter(
        must=[
            FieldCondition(key="client_id", match=MatchValue(value=429)),
            FieldCondition(key="doc_type", match=MatchValue(value="contract")),
        ]
    ),
    limit=8,
    with_payload=True,
)

Qdrant also supports sparse vectors as a first-class type (required for BM25 hybrid search), multi-vector documents, quantisation for memory reduction, and a snapshot API for backups. The self-hosted version is free. Qdrant Cloud starts at roughly £15–£20/month for a starter cluster and supports London-region deployment for UK GDPR compliance.

Retrieval latency and recall on a 100k-document corpus: the benchmarks we ran

100k-document corpus, 512-token chunks, text-embedding-3-small at 1536 dimensions. Qdrant on a 4-core 16GB VPS; pgvector on RDS db.t3.large; Pinecone Serverless in eu-west-1. All queries from EC2 eu-west-2.

Store Query type Median latency P95 latency Recall@8
pgvector (HNSW) Unfiltered 62ms 140ms 0.96
pgvector (HNSW) Filtered (10% selectivity) 2,100ms 4,800ms 0.94
Pinecone Serverless Unfiltered 45ms 110ms 0.95
Pinecone Serverless Filtered (10% selectivity) 120ms 290ms 0.92
Qdrant (HNSW) Unfiltered 38ms 85ms 0.97
Qdrant (HNSW) Filtered (10% selectivity) 90ms 180ms 0.96

Unfiltered, all three are fast enough that LLM inference will be the slower step. The filtered case is where the decision becomes consequential. Pinecone Serverless shows a small recall drop on filtered queries, which Pinecone's documentation attributes to approximate filter evaluation.

Hybrid search support: which stores handle BM25 + vector retrieval natively

Pure vector search misses exact keyword matches. If a user asks "what does section 4.2 of the master services agreement say?", a vector search may return semantically similar content from a different section — BM25 keyword search finds the literal string. Hybrid search combines both signals.

Qdrant supports sparse vectors natively since 0.8.0. Index BM25 weights as a sparse vector alongside your dense embedding, combine scores with RRF fusion at query time. No external BM25 service required.

pgvector does not support sparse vectors. The pattern is Postgres tsvector for keyword search plus pgvector ANN, merged in application code. It works, but you own the result fusion logic and run two index scans.

Pinecone added sparse-dense hybrid search in 2024. You upload both a dense vector and a sparse BM25 vector per document, which requires running a BM25 tokeniser client-side. The query API is clean but the upload pipeline is more complex than Qdrant's.

For UK SME document RAG where users query by clause reference or contract identifier, hybrid search materially improves recall. We cover retrieval patterns in detail in our post on vector versus keyword retrieval for document RAG.

Ops burden comparison: what you own in each option

Responsibility pgvector Pinecone Qdrant (self-hosted)
Infrastructure provisioning Your Postgres host None VPS or k8s cluster
Index rebuilds Scheduled job (you own) Automatic Automatic
Backups RDS snapshots or pg_dump Automatic Qdrant snapshots (you schedule)
Scaling Vertical only Automatic Vertical + horizontal
Monitoring pg_stat, CloudWatch Pinecone console Qdrant metrics + Prometheus
Version upgrades Postgres major version migration Automatic Pull new Docker image + restart
On-call for outages Your team Pinecone SLA Your team

pgvector's ops story depends entirely on your existing Postgres posture. Already running RDS with automated backups? pgvector adds almost nothing. Postgres is new infrastructure? The ops surface is larger than it looks.

Pinecone's zero-ops story is accurate day-to-day. The hidden cost is lock-in: Pinecone uses a proprietary index format, and migrating out requires re-uploading all vectors with no tooling assistance.

Qdrant on a self-hosted VPS sits in the middle — a single Docker container, a cron job for snapshots, a Prometheus scrape target. For a team already running Postgres and Redis on a VPS, it is one more process. Budget two days to set it up correctly, not two hours.

What changed in 2025–2026

Two developments shifted the calculus since early 2025.

Qdrant 1.8 introduced built-in sparse vector support and a dedicated hybrid search path, removing the need for a separate BM25 service. Previously, hybrid RAG on Qdrant required running Elasticsearch alongside it. That dependency is gone.

pgvector 0.7.0 (late 2024) added experimental filtered HNSW indexes — you can create a partial index scoped to a WHERE clause. For a fixed-predicate filter like "only documents in the contracts collection", a partial pgvector index is a viable alternative to migrating. For dynamic client IDs across 200+ tenants, it doesn't scale: you would need one partial index per tenant and a deployment process to create them on demand.

Our recommendation by corpus size, query pattern, and existing stack

Under 50k vectors, no metadata filtering

Use pgvector if Postgres is already in your stack. Zero additional operational overhead, and you avoid a new service dependency. Building from scratch and want zero ops? Pinecone Serverless is fine at this scale.

50k–200k vectors with metadata filtering

Use Qdrant. The filtered-ANN performance gap versus pgvector is the difference between a usable product and one your ops team gets called about. Self-host on a £40/month VPS or use Qdrant Cloud in eu-west-2. Budget two weeks for migration from pgvector.

200k+ vectors, multi-tenant, hybrid search required

Use Qdrant unless your team cannot own any infrastructure, in which case Pinecone's Pod tier is the alternative. Do not use pgvector above 200k vectors with filtering — the schema changes required to get acceptable performance are easier to get right in a purpose-built vector store.

Failure modes to watch for

Good: pgvector on a db.t3.large with 40k unfiltered document chunks for a single-tenant internal knowledge base. 62ms retrieval, zero new services, results that are indistinguishable from Qdrant. This is the right tool for the job.

Bad: Pinecone Serverless at 80,000 queries/month with a broad filter across a 300k-vector namespace. The read-unit bill arrives and it is 4x what was budgeted. The fix is either to restructure namespaces (which Pinecone uses as the primary isolation boundary) or to move to a Pod tier — neither is quick. This is a planning failure, not a product failure, but the product's pricing structure makes it easy to walk into.

Ugly: pgvector filtered retrieval at scale with no monitoring. The query that took 200ms at 20k documents takes 4.8 seconds at 180k documents. No alert fires because the Postgres query timeout is set to 30 seconds and the application swallows the latency as a slow response rather than an error. Users stop using the tool because it "feels broken."

We have hit this in two separate engagements. The first is the logistics client above: 180k documents, per-client filtering by client_id, 2.1-second queries, two-week migration to Qdrant. The second was an accountancy group with a 95k-document corpus filtering by engagement_id across 60 active engagements. More documents per filter bucket (average 1,600 versus 800) kept their latency lower — around 900ms — but still above the 200ms threshold where users noticed. Their migration took nine days because the corpus was smaller and they had an evaluation set from a prior audit. Both cases: pgvector's filtered ANN falling back to a sequential scan, no alert on query duration percentiles, users complaining before the team knew anything was wrong.

If you are starting a new RAG pipeline today, pick Qdrant. The overhead of a single Docker container is trivial compared to the refactor you face once filtered retrieval degrades. If you have an existing pgvector install under 50k unfiltered vectors for a single-tenant use case, leave it alone — the performance gap does not materialise until you cross that threshold with multi-tenant filtering. The migration is real work: two weeks if you have an evaluation set, four weeks if you need to build one first. Get ahead of it by designing your payload schema in Qdrant from the start. Our document RAG knowledge agent build covers how we structure that schema in production, including the chunking and metadata decisions that determine whether your filters stay fast at scale.


External references: pgvector HNSW documentation — the primary source for index type trade-offs; Qdrant filtering internals — explains how filterable HNSW graphs differ from post-filter approaches; Pinecone filter with metadata guide — Pinecone's own documentation of approximate filter evaluation and its recall trade-offs, which provides the counterpoint to Qdrant's exact-filter claim.

FAQ

Can pgvector handle a 200k-document corpus with filtered retrieval without significant performance degradation?

pgvector can technically store 200k document chunks, but filtered approximate nearest neighbour (ANN) is where it breaks down. When you combine a WHERE clause (e.g. client_id = 429) with a vector similarity search, pgvector falls back to exact nearest neighbour scan on the filtered subset rather than using its HNSW index, because the index is built across all vectors regardless of metadata. On a db.t3.medium RDS instance with 180k rows and a selectivity of 10%, this took 2.1 seconds in our production system. You can mitigate this by creating a partial index per tenant or by switching to pgvector 0.7+'s new filtered-index support, but both approaches require careful schema design from the outset. If your corpus will exceed 100k chunks and you need per-client or per-department filtering, plan the migration to Qdrant before you build rather than after.

What is Pinecone's pricing cliff and at what monthly query volume does it become expensive for a UK SME?

Pinecone's Serverless tier (as of mid-2026) charges per read unit and write unit rather than per pod. Read units scale with the number of vectors scanned, not just queries returned, so a 200k-vector namespace with a broad query can consume 5–10x more read units than you expect. In practice, SMEs running 50,000 queries per month against a 100k-vector corpus hit approximately £180–£250/month on Serverless. The cliff appears when you move to the Pod tier for consistent low-latency SLAs — the smallest production pod (p1.x1) runs around £65–£70/month as a fixed cost before any query volume. For UK SMEs doing fewer than 30,000 queries/month at a corpus below 100k vectors, Serverless is fine. Above those numbers, compare the Pod cost against running pgvector on your existing Postgres or Qdrant on a £20/month VPS.

Does Qdrant offer UK or EU data residency for GDPR compliance?

Qdrant Cloud supports deployment on GCP europe-west2 (London) and AWS eu-west-2 (London) as of 2025, giving you UK-region data residency directly within the managed cloud offering. For strict on-premises or private-cloud requirements, Qdrant is Apache 2.0 licensed and can be self-hosted on any UK infrastructure — AWS, Azure, or your own servers — with full control over where data resides. The self-hosted path is also the cheapest at scale: a £40/month VPS handles 500k vectors comfortably. Under UK GDPR Article 46, storing vector embeddings in a UK region avoids the need for a transfer mechanism, but note that your embedding provider (OpenAI, Cohere, etc.) still processes the raw text to generate those embeddings — that's the data transfer risk to document, not the vector store itself.

How long does it take to migrate a RAG pipeline from pgvector to Qdrant without downtime?

A clean migration for a 180k-document corpus took us two weeks end-to-end: three days to set up Qdrant, write a batch re-indexing script, and validate recall against a held-out query set; four days to run the dual-write period where new documents were indexed into both stores simultaneously; two days of shadow-mode testing where Qdrant served queries in parallel but pgvector answers were returned to users; and two days of cutover and monitoring before we decommissioned pgvector. The critical path item is not the data migration — bulk upsert of 180k vectors into Qdrant takes roughly 25 minutes with a batch size of 500. The slow part is validating that your filter payloads are correctly structured in Qdrant's JSON payload format and that recall scores on your evaluation set are at parity or better than pgvector. Budget two weeks if you have a rigorous eval set; budget four if you don't and need to build one first.

Related Reading

Document RAG: when vector search beats keyword search

Vector search isn't always the right call. A field guide for UK SMEs deciding when pgvector earns its complexity and whe

RAG Knowledge Agents for Staff Q&A: Building Over Internal Docs

How to build a production RAG knowledge agent over HR policies, SOPs, and compliance docs — chunking strategies, metadat

Need a vector store recommendation for your corpus size?

30-minute audit. We map your stack, your constraints, and where AI will pay back fastest.

Take the Quantum Leap →
© 2026 Quantum Automations Group Ltd
Home Blog Portfolio Privacy Terms Security