Thirty-two vendor websites. Eleven thousand, two hundred and eighty-three products. Four different price formats. Three sites behind Cloudflare. Two that serve different HTML to headless browsers than to standard crawlers. One vendor that emails a PDF catalogue monthly and considers that a sufficient data feed. This was the actual brief from a UK fabric wholesaler — the project is documented in our Fabric Catalog Aggregator case study — who wanted to replace their buying team's daily manual price-checking ritual.
We shipped it in 18 days. It runs daily, costs £8/month in compute, and catches price changes within 24 hours. But the architecture we ended up with is significantly different from what we planned on day one — because product catalog scraping has four structural problems that don't appear in the brief.
Why product catalog scraping is harder than it looks at brief
Selector drift is the first problem. A vendor updates their website template, and the CSS selector div.product-price span.amount that worked yesterday now returns nothing. This breaks silently unless you have per-vendor failure detection. On a 32-vendor system, you'll see selector drift on roughly 3–4 vendors per month. Design for it from the start: fail per vendor, not globally.
Schema heterogeneity is the second. Vendor A shows price as "£12.50/m". Vendor B shows "£12.50 per metre". Vendor C shows "1250p". Vendor D shows a base price and a "min. order 10m" note in a separate element. Normalising these into { price_pence: 1250, unit: "metre", min_order_metres: 10 } requires parsing logic specific to each vendor — or a normalisation model flexible enough to handle all variants.
Anti-bot infrastructure is the third. Cloudflare, Akamai Bot Manager, and DataDome are standard now on mid-tier e-commerce. Rotating IPs alone isn't sufficient — browser fingerprint, TLS fingerprint, and behavioural patterns all contribute to bot scoring.
Deduplication is the fourth, and the one nobody scopes. When vendor A sells "Sunburst Cotton Canvas 140cm" and vendor B sells "Canvas Cotton Sunburst 140 Wide", those are the same product. Building a canonical product record that merges vendor variants is harder than the scraping itself.
Architecting for 32 sites: the selector-per-vendor approach
Don't try to write a universal scraper. Write a scraper framework with per-vendor config. Each vendor gets a JSON config that specifies its selectors, pagination pattern, and any quirks:
{
"vendor_id": "fabric-house-uk",
"base_url": "https://fabrichouse.co.uk/wholesale",
"pagination": {
"type": "query_param",
"param": "page",
"max_pages": 50
},
"selectors": {
"product_list": "div.product-grid article.product-card",
"name": "h2.product-title",
"price_raw": "span.price[data-price]",
"price_attr": "data-price",
"sku": "span.sku",
"image": "img.product-image[src]",
"detail_link": "a.product-link[href]"
},
"requires_js": false,
"rate_limit_ms": 3000,
"headers": {
"Accept-Language": "en-GB,en;q=0.9"
}
}
requires_js: false means use Cheerio on the raw HTML; true means spin up Playwright. This distinction matters for cost — Playwright is 4–8× slower and more resource-intensive than Cheerio. Only 6 of our 32 vendors required JavaScript rendering.
A vendor scrape job reads its config, paginates through the product list, extracts structured data per product, writes raw records to a staging table, and reports success/failure per vendor independently. A failure on Vendor 3 doesn't stop Vendor 4.
Normalisation: turning 14 price formats into one schema
Raw price strings from 32 vendors look like this in practice:
"£12.50/m" → 1250, per_metre
"12.50 per metre" → 1250, per_metre
"£125.00/10m roll" → 1250, per_metre, min_order: 10
"1,250p per metre" → 1250, per_metre
"€14.30 (excl VAT)" → 1430 EUR, per_unit (needs currency convert + VAT note)
"POA" → null, requires_quote: true
Write a normalisation function that handles each known pattern with explicit regex branches, and a fallback that flags "needs manual review" for unrecognised formats:
function normalisePrice(raw: string): PriceResult {
const stripped = raw.trim().toLowerCase();
// Pattern: £12.50/m or £12.50 per metre
const perMetrePound = /£(\d+(?:\.\d{2})?)\s*(?:\/m|per metre)/i;
const m1 = stripped.match(perMetrePound);
if (m1) return { pence: Math.round(parseFloat(m1[1]) * 100), unit: 'metre' };
// Pattern: £125.00/10m roll
const rollPattern = /£(\d+(?:\.\d{2})?)\s*\/\s*(\d+)m\s*roll/i;
const m2 = stripped.match(rollPattern);
if (m2) return {
pence: Math.round((parseFloat(m2[1]) / parseInt(m2[2])) * 100),
unit: 'metre',
min_order_metres: parseInt(m2[2])
};
// Pattern: 1250p
const pencePattern = /^(\d+)p$/;
const m3 = stripped.match(pencePattern);
if (m3) return { pence: parseInt(m3[1]), unit: 'metre' };
return { pence: null, unit: null, requires_review: true, raw };
}
Log every requires_review: true result. This is the same human-in-the-loop pattern we discuss in the OCR human-in-the-loop guide — not all automation decisions should be fully automatic, and logging the boundary cases is how you tighten coverage over time. At launch, you'll have 15–20% of records needing a new pattern. Within two weeks of production, you'll have coverage at 97%+. The remaining 3% (usually "POA" or seasonal pricing notes) need human flags, not code.
Deduplication: the problem nobody scopes at the start
The client assumed dedup was a search problem — "just search for the product name and merge duplicates." It's not. It's a record-linkage problem.
A product from Vendor A: Sunburst Cotton Canvas 140cm Wide Natural Weave at £11.80/m. The same product from Vendor B: Canvas Fabric Cotton Sunburst 140W Natural at £12.10/m. Token overlap on the title is about 60%. No exact match. But any buyer knows these are the same product.
Our approach: generate a content hash per product using a normalised feature vector (fabric type, width in cm, primary colour, weave pattern), not the raw title. Two products with the same feature vector but different titles are candidate duplicates — flagged for a fuzzy-match review step.
def product_fingerprint(product: dict) -> str:
features = {
'fabric_type': normalise_fabric_type(product['name']),
'width_cm': extract_width_cm(product['name'] + ' ' + product.get('description', '')),
'primary_colour': extract_colour(product['name']),
}
canonical = json.dumps(features, sort_keys=True).lower()
return hashlib.sha256(canonical.encode()).hexdigest()[:16]
Candidates with the same fingerprint get a Levenshtein similarity check on the full normalised title. Above 0.75 similarity = merge. Below = separate records. This caught 847 duplicates across our 32 vendors, reducing the 11,283 raw records to 9,814 canonical products.
One practical detail: the merge creates a canonical record with a vendor_prices array — one entry per vendor, each with vendor ID, price, unit, min order, and last seen timestamp. This means the buyer can compare "Sunburst Cotton Canvas 140cm" from Vendor A at £11.80/m vs the same product from Vendor B at £12.10/m, without manually cross-referencing. It also means if Vendor A stops stocking the product, the canonical record survives from Vendor B. The canonical record is what the application layer shows; the raw vendor records live in a staging table, never exposed directly.
Levenshtein alone is not sufficient for deduplication when product names differ significantly in structure. We supplement it with width-matching as a hard gate: two products with different widths (e.g., 140cm vs 150cm) cannot be the same product regardless of title similarity. Add the hard constraints from your domain before applying fuzzy matching — they cut false-positive merges dramatically.
Anti-block strategy: staying in limits without burning proxies
For the 6 Cloudflare-protected vendors, we used Playwright with stealth mode — a headless browser that randomises fingerprint signals (canvas, WebGL, navigator properties) to avoid detection. This handled 4 of the 6 without residential proxies.
The remaining 2 vendors had Cloudflare Turnstile challenges that defeated stealth mode. For these, we use Bright Data's Browser API at ~£50/month — a managed browser farm with residential IP rotation. Expensive relative to a raw scraper, but these 2 vendors hold 22% of the catalog by SKU count, so the cost is justified.
Rate limiting rules we apply to every vendor: - Minimum 2,000ms delay between requests (randomised to 2,000–4,500ms) - Maximum 25 product pages per hour per vendor - Abort and flag if 3 consecutive pages return error codes - Never scrape between 00:00–04:00 UTC (lower traffic to blend with) - Log HTTP 429 responses as explicit rate-limit hits and back off exponentially: 30s, 2m, 8m. Three consecutive 429s on a vendor pauses it for 24 hours and sends an alert. This is not aggressive — it's polite, and polite scrapers last longer.
The question of whether to identify your scraper honestly in the User-Agent header is worth considering. A User-Agent like FabricCatalogBot/1.0 (+https://yourclient.co.uk/bot) signals legitimate use and is more likely to receive a courteous "please use our API" response from a vendor's IT team than a block. Several of our vendor relationships started this way — a vendor emailed to ask what we were building, we explained, they gave us a private data feed. An honest User-Agent is not a magic shield against blocks, but it's the right starting point.
Schedule and freshness: how often is often enough?
Not all 32 vendors need daily scraping. We segment by update frequency:
| Tier | Update frequency | Vendors | Scrape schedule |
|---|---|---|---|
| A (pricing volatile) | Daily or continuous | 8 | Every 24h |
| B (weekly updates) | Weekly | 16 | Every 7 days |
| C (stable/seasonal) | Monthly | 8 | Every 30 days |
Tier classification is based on observed price change frequency from the first month of full scraping. If you're storing scraped data in a structured format and need to make it queryable, our post on when vector search beats keyword search covers the retrieval trade-offs that apply equally to product catalog search as to document search. Tier A vendors had at least one price change per week. Tier C vendors hadn't changed a price in four weeks of observation.
This schedule halves the daily scraping volume and cuts compute cost by ~40% compared to scraping all vendors daily. Revisit tier assignments every 90 days — a vendor moving to a more dynamic pricing model (e.g., they added a "flash sale" feature) should move up a tier.
What changed in 2025–2026: AI-assisted selector generation and LLM normalisation
Two developments changed the maintenance burden of catalog scraping materially in 2025.
Playwright's AI-assisted locator generation now suggests stable, maintainable selectors based on visible text and ARIA roles rather than brittle CSS paths. For new vendor onboarding, this cuts selector writing from 45 minutes to 10 minutes per vendor and produces selectors that survive minor template changes.
More significantly, vision-model extraction — feeding a rendered page screenshot to GPT-4o Vision or Claude — can extract structured product data without selectors at all for complex layouts. We tested this on 3 vendors with highly dynamic JavaScript-rendered pages where selector maintenance was costing us 2 hours/month each. Vision-model extraction with a JSON schema prompt dropped the maintenance to near-zero. Cost per page: ~£0.004 at current API pricing vs ~£0.001 for selector-based extraction — more expensive, but the maintenance saving justifies it for complex vendors.
The counterpoint: Apify's research on LLM-based scraping found that vision-model extraction has a 5–15% error rate on dense product tables vs under 1% for correctly maintained selectors. For a price-critical application where wrong pricing costs money, selectors with good monitoring are still the safer choice for the bulk of your vendor set.
Good / Bad / Ugly
Good: Per-vendor config JSON with isolated selectors, per-vendor failure monitoring, and a changelog table that tracks every price and availability change. You can answer "when did Vendor B last change the price of SKU 4821?" in one SQL query. The client's buyers use this daily.
Bad: A universal CSS selector strategy ("every vendor's price is in a .price element, surely"). It works for 8 vendors and silently returns null for the other 24. You don't find out until the client asks why half the catalog shows no price.
Ugly: Starting dedup as a manual cleanup task "after launch." One month in production with 32 vendors and no dedup, and you have 847 duplicate products, some with conflicting prices, some with different images. Users don't know which record is correct. Rebuilding the canonical dataset mid-production takes three days and breaks the client's integrations. Scope dedup on day one.
FAQ
Answered in the frontmatter — rendered by the template as FAQPage JSON-LD.
If your product data is scattered across supplier sites and your team is losing hours to manual price-checking, a scraper pipeline is the fix. Book a 30-minute audit and we'll scope what an automated catalog aggregator looks like for your vendor set.