Quantum Automations Quantum Automations
Blog · Portfolio
← Back to Blog
Guide · Document Automation

Product Catalog Scraper Architecture: 11k SKUs from 32 Vendors

Published May 2026
Topic Data · Catalog Ops
Reading time 9 min
For UK SME ops leads
On this page
  1. Why product catalog scraping is harder than it looks at brief
  2. Architecting for 32 sites: the selector-per-vendor approach
  3. Normalisation: turning 14 price formats into one schema
  4. Deduplication: the problem nobody scopes at the start
  5. Anti-block strategy: staying in limits without burning proxies
  6. Schedule and freshness: how often is often enough?
  7. What changed in 2025–2026: AI-assisted selector generation and LLM normalisation
  8. Good / Bad / Ugly
  9. FAQ
  10. FAQ

Thirty-two vendor websites. Eleven thousand, two hundred and eighty-three products. Four different price formats. Three sites behind Cloudflare. Two that serve different HTML to headless browsers than to standard crawlers. One vendor that emails a PDF catalogue monthly and considers that a sufficient data feed. This was the actual brief from a UK fabric wholesaler — the project is documented in our Fabric Catalog Aggregator case study — who wanted to replace their buying team's daily manual price-checking ritual.

We shipped it in 18 days. It runs daily, costs £8/month in compute, and catches price changes within 24 hours. But the architecture we ended up with is significantly different from what we planned on day one — because product catalog scraping has four structural problems that don't appear in the brief.

Why product catalog scraping is harder than it looks at brief

Selector drift is the first problem. A vendor updates their website template, and the CSS selector div.product-price span.amount that worked yesterday now returns nothing. This breaks silently unless you have per-vendor failure detection. On a 32-vendor system, you'll see selector drift on roughly 3–4 vendors per month. Design for it from the start: fail per vendor, not globally.

Schema heterogeneity is the second. Vendor A shows price as "£12.50/m". Vendor B shows "£12.50 per metre". Vendor C shows "1250p". Vendor D shows a base price and a "min. order 10m" note in a separate element. Normalising these into { price_pence: 1250, unit: "metre", min_order_metres: 10 } requires parsing logic specific to each vendor — or a normalisation model flexible enough to handle all variants.

Anti-bot infrastructure is the third. Cloudflare, Akamai Bot Manager, and DataDome are standard now on mid-tier e-commerce. Rotating IPs alone isn't sufficient — browser fingerprint, TLS fingerprint, and behavioural patterns all contribute to bot scoring.

Deduplication is the fourth, and the one nobody scopes. When vendor A sells "Sunburst Cotton Canvas 140cm" and vendor B sells "Canvas Cotton Sunburst 140 Wide", those are the same product. Building a canonical product record that merges vendor variants is harder than the scraping itself.

Architecting for 32 sites: the selector-per-vendor approach

Don't try to write a universal scraper. Write a scraper framework with per-vendor config. Each vendor gets a JSON config that specifies its selectors, pagination pattern, and any quirks:

{
  "vendor_id": "fabric-house-uk",
  "base_url": "https://fabrichouse.co.uk/wholesale",
  "pagination": {
    "type": "query_param",
    "param": "page",
    "max_pages": 50
  },
  "selectors": {
    "product_list": "div.product-grid article.product-card",
    "name": "h2.product-title",
    "price_raw": "span.price[data-price]",
    "price_attr": "data-price",
    "sku": "span.sku",
    "image": "img.product-image[src]",
    "detail_link": "a.product-link[href]"
  },
  "requires_js": false,
  "rate_limit_ms": 3000,
  "headers": {
    "Accept-Language": "en-GB,en;q=0.9"
  }
}

requires_js: false means use Cheerio on the raw HTML; true means spin up Playwright. This distinction matters for cost — Playwright is 4–8× slower and more resource-intensive than Cheerio. Only 6 of our 32 vendors required JavaScript rendering.

A vendor scrape job reads its config, paginates through the product list, extracts structured data per product, writes raw records to a staging table, and reports success/failure per vendor independently. A failure on Vendor 3 doesn't stop Vendor 4.

Normalisation: turning 14 price formats into one schema

Raw price strings from 32 vendors look like this in practice:

"£12.50/m"          → 1250, per_metre
"12.50 per metre"   → 1250, per_metre
"£125.00/10m roll"  → 1250, per_metre, min_order: 10
"1,250p per metre"  → 1250, per_metre
"€14.30 (excl VAT)" → 1430 EUR, per_unit (needs currency convert + VAT note)
"POA"               → null, requires_quote: true

Write a normalisation function that handles each known pattern with explicit regex branches, and a fallback that flags "needs manual review" for unrecognised formats:

function normalisePrice(raw: string): PriceResult {
  const stripped = raw.trim().toLowerCase();

  // Pattern: £12.50/m or £12.50 per metre
  const perMetrePound = /£(\d+(?:\.\d{2})?)\s*(?:\/m|per metre)/i;
  const m1 = stripped.match(perMetrePound);
  if (m1) return { pence: Math.round(parseFloat(m1[1]) * 100), unit: 'metre' };

  // Pattern: £125.00/10m roll
  const rollPattern = /£(\d+(?:\.\d{2})?)\s*\/\s*(\d+)m\s*roll/i;
  const m2 = stripped.match(rollPattern);
  if (m2) return {
    pence: Math.round((parseFloat(m2[1]) / parseInt(m2[2])) * 100),
    unit: 'metre',
    min_order_metres: parseInt(m2[2])
  };

  // Pattern: 1250p
  const pencePattern = /^(\d+)p$/;
  const m3 = stripped.match(pencePattern);
  if (m3) return { pence: parseInt(m3[1]), unit: 'metre' };

  return { pence: null, unit: null, requires_review: true, raw };
}

Log every requires_review: true result. This is the same human-in-the-loop pattern we discuss in the OCR human-in-the-loop guide — not all automation decisions should be fully automatic, and logging the boundary cases is how you tighten coverage over time. At launch, you'll have 15–20% of records needing a new pattern. Within two weeks of production, you'll have coverage at 97%+. The remaining 3% (usually "POA" or seasonal pricing notes) need human flags, not code.

Deduplication: the problem nobody scopes at the start

The client assumed dedup was a search problem — "just search for the product name and merge duplicates." It's not. It's a record-linkage problem.

A product from Vendor A: Sunburst Cotton Canvas 140cm Wide Natural Weave at £11.80/m. The same product from Vendor B: Canvas Fabric Cotton Sunburst 140W Natural at £12.10/m. Token overlap on the title is about 60%. No exact match. But any buyer knows these are the same product.

Our approach: generate a content hash per product using a normalised feature vector (fabric type, width in cm, primary colour, weave pattern), not the raw title. Two products with the same feature vector but different titles are candidate duplicates — flagged for a fuzzy-match review step.

def product_fingerprint(product: dict) -> str:
    features = {
        'fabric_type': normalise_fabric_type(product['name']),
        'width_cm': extract_width_cm(product['name'] + ' ' + product.get('description', '')),
        'primary_colour': extract_colour(product['name']),
    }
    canonical = json.dumps(features, sort_keys=True).lower()
    return hashlib.sha256(canonical.encode()).hexdigest()[:16]

Candidates with the same fingerprint get a Levenshtein similarity check on the full normalised title. Above 0.75 similarity = merge. Below = separate records. This caught 847 duplicates across our 32 vendors, reducing the 11,283 raw records to 9,814 canonical products.

One practical detail: the merge creates a canonical record with a vendor_prices array — one entry per vendor, each with vendor ID, price, unit, min order, and last seen timestamp. This means the buyer can compare "Sunburst Cotton Canvas 140cm" from Vendor A at £11.80/m vs the same product from Vendor B at £12.10/m, without manually cross-referencing. It also means if Vendor A stops stocking the product, the canonical record survives from Vendor B. The canonical record is what the application layer shows; the raw vendor records live in a staging table, never exposed directly.

Levenshtein alone is not sufficient for deduplication when product names differ significantly in structure. We supplement it with width-matching as a hard gate: two products with different widths (e.g., 140cm vs 150cm) cannot be the same product regardless of title similarity. Add the hard constraints from your domain before applying fuzzy matching — they cut false-positive merges dramatically.

Anti-block strategy: staying in limits without burning proxies

For the 6 Cloudflare-protected vendors, we used Playwright with stealth mode — a headless browser that randomises fingerprint signals (canvas, WebGL, navigator properties) to avoid detection. This handled 4 of the 6 without residential proxies.

The remaining 2 vendors had Cloudflare Turnstile challenges that defeated stealth mode. For these, we use Bright Data's Browser API at ~£50/month — a managed browser farm with residential IP rotation. Expensive relative to a raw scraper, but these 2 vendors hold 22% of the catalog by SKU count, so the cost is justified.

Rate limiting rules we apply to every vendor: - Minimum 2,000ms delay between requests (randomised to 2,000–4,500ms) - Maximum 25 product pages per hour per vendor - Abort and flag if 3 consecutive pages return error codes - Never scrape between 00:00–04:00 UTC (lower traffic to blend with) - Log HTTP 429 responses as explicit rate-limit hits and back off exponentially: 30s, 2m, 8m. Three consecutive 429s on a vendor pauses it for 24 hours and sends an alert. This is not aggressive — it's polite, and polite scrapers last longer.

The question of whether to identify your scraper honestly in the User-Agent header is worth considering. A User-Agent like FabricCatalogBot/1.0 (+https://yourclient.co.uk/bot) signals legitimate use and is more likely to receive a courteous "please use our API" response from a vendor's IT team than a block. Several of our vendor relationships started this way — a vendor emailed to ask what we were building, we explained, they gave us a private data feed. An honest User-Agent is not a magic shield against blocks, but it's the right starting point.

Schedule and freshness: how often is often enough?

Not all 32 vendors need daily scraping. We segment by update frequency:

Tier Update frequency Vendors Scrape schedule
A (pricing volatile) Daily or continuous 8 Every 24h
B (weekly updates) Weekly 16 Every 7 days
C (stable/seasonal) Monthly 8 Every 30 days

Tier classification is based on observed price change frequency from the first month of full scraping. If you're storing scraped data in a structured format and need to make it queryable, our post on when vector search beats keyword search covers the retrieval trade-offs that apply equally to product catalog search as to document search. Tier A vendors had at least one price change per week. Tier C vendors hadn't changed a price in four weeks of observation.

This schedule halves the daily scraping volume and cuts compute cost by ~40% compared to scraping all vendors daily. Revisit tier assignments every 90 days — a vendor moving to a more dynamic pricing model (e.g., they added a "flash sale" feature) should move up a tier.

What changed in 2025–2026: AI-assisted selector generation and LLM normalisation

Two developments changed the maintenance burden of catalog scraping materially in 2025.

Playwright's AI-assisted locator generation now suggests stable, maintainable selectors based on visible text and ARIA roles rather than brittle CSS paths. For new vendor onboarding, this cuts selector writing from 45 minutes to 10 minutes per vendor and produces selectors that survive minor template changes.

More significantly, vision-model extraction — feeding a rendered page screenshot to GPT-4o Vision or Claude — can extract structured product data without selectors at all for complex layouts. We tested this on 3 vendors with highly dynamic JavaScript-rendered pages where selector maintenance was costing us 2 hours/month each. Vision-model extraction with a JSON schema prompt dropped the maintenance to near-zero. Cost per page: ~£0.004 at current API pricing vs ~£0.001 for selector-based extraction — more expensive, but the maintenance saving justifies it for complex vendors.

The counterpoint: Apify's research on LLM-based scraping found that vision-model extraction has a 5–15% error rate on dense product tables vs under 1% for correctly maintained selectors. For a price-critical application where wrong pricing costs money, selectors with good monitoring are still the safer choice for the bulk of your vendor set.

Good / Bad / Ugly

Good: Per-vendor config JSON with isolated selectors, per-vendor failure monitoring, and a changelog table that tracks every price and availability change. You can answer "when did Vendor B last change the price of SKU 4821?" in one SQL query. The client's buyers use this daily.

Bad: A universal CSS selector strategy ("every vendor's price is in a .price element, surely"). It works for 8 vendors and silently returns null for the other 24. You don't find out until the client asks why half the catalog shows no price.

Ugly: Starting dedup as a manual cleanup task "after launch." One month in production with 32 vendors and no dedup, and you have 847 duplicate products, some with conflicting prices, some with different images. Users don't know which record is correct. Rebuilding the canonical dataset mid-production takes three days and breaks the client's integrations. Scope dedup on day one.


FAQ

Answered in the frontmatter — rendered by the template as FAQPage JSON-LD.


If your product data is scattered across supplier sites and your team is losing hours to manual price-checking, a scraper pipeline is the fix. Book a 30-minute audit and we'll scope what an automated catalog aggregator looks like for your vendor set.

FAQ

Is scraping legal for product catalog data?

For publicly accessible catalog data (pricing, product names, descriptions visible without login), scraping is generally lawful in the UK under database rights case law following Ryanair v PR Aviation. Terms of service restrictions on scraping are contractually enforceable only if you have an account — anonymous scraping of public product data sits in a legally grey but widely practised area. Where vendors have an API or a data feed, use it — it's more reliable and less ambiguous. Always check whether you're scraping data that contains personal information, which would trigger UK GDPR.

How do we handle vendors that block scrapers?

First, check for an official data feed or API — many wholesalers offer EDI or CSV exports to trade customers. If scraping is the only option: use rotating residential proxies for the most aggressive blockers, add random delays between requests (2–8 seconds, not fixed intervals), randomise user-agent strings, and limit request rate to under 30 pages/hour per vendor. For Cloudflare-protected sites, Playwright with stealth mode or a managed browser API (Bright Data Browser API, Apify's SmartProxy) handles most fingerprint challenges. Some sites are simply not worth scraping — if a vendor has active bot mitigation and no API, the maintenance cost exceeds the value.

What's the right storage format for scraped catalog data?

A relational schema (Postgres) with one table per entity type: products, images, prices, vendors, categories. Normalise aggressively — don't store the vendor's raw HTML structure in your schema. A price_history table with timestamps lets you track changes over time without overwriting. For images, store the original URL and a CDN-mirrored copy (S3 or Cloudflare R2) — vendor image URLs break; your CDN copy doesn't.

How do we know when a vendor has updated a price or removed a product?

Hash each product record (price + title + availability) at scrape time. On the next scrape, compare hashes. A changed hash means the record changed — log the delta to a changelog table. A product present in the previous scrape but missing from the current one is a removal candidate — mark it inactive after two consecutive absences (to avoid marking something as removed because of a temporary scrape failure). Send a Slack alert for any single vendor with >10% removals in one run — that usually means the vendor redesigned their site and your selectors broke, not that they actually pulled 10% of their catalog.

Related Reading

OCR with human-in-the-loop: shipping 99% accuracy in production

Why 99% extraction accuracy still fails in production, and the queue-and-confidence pattern that makes hybrid OCR genuin

Document RAG: when vector search beats keyword search

Vector search isn't always the right call. A field guide for UK SMEs deciding when pgvector earns its complexity and whe

Need a catalog system that aggregates your suppliers automatically?

30-minute audit. We map your stack, your constraints, and where AI will pay back fastest.

Take the Quantum Leap →
© 2026 Quantum Automations Group Ltd
Home Blog Portfolio Privacy Terms Security