     [Blog](https://scrapfly.io/blog)   /  [ecommerce](https://scrapfly.io/blog/tag/ecommerce)   /  [How to Build a Product Scraper for Multiple Sites](https://scrapfly.io/blog/posts/how-to-build-a-product-data-pipeline-from-multiple-ecommerce-sites)   # How to Build a Product Scraper for Multiple Sites

 by [Hisham Medhat](https://scrapfly.io/blog/author/hisham) Jun 30, 2026 22 min read [\#ecommerce](https://scrapfly.io/blog/tag/ecommerce) [\#project](https://scrapfly.io/blog/tag/project) [\#python](https://scrapfly.io/blog/tag/python) 

 [  ](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-build-a-product-data-pipeline-from-multiple-ecommerce-sites "Share on LinkedIn")    

 

 

         

   **Web Scraping API — Format Conversion**Get results in data formats that suit you — HTML, Markdown, JSON and more.

 

 [ Learn More  ](https://scrapfly.io/products/web-scraping-api#features) [  Docs ](https://scrapfly.io/docs/scrape-api/getting-started#features) 

 

 

Scraping product data from one ecommerce site is a parsing challenge. You pick the right selectors, handle pagination, and deal with anti-bot protection. Scraping product data from ten or forty sites is a different problem entirely. Every source structures the same product differently, breaks at different times, and fights your scrapers in different ways.

Most tutorials teach you how to scrape a single site. Few explain how to build the architecture that makes multi-source product data aggregation manageable.

In this guide, we'll cover how to design a canonical product schema, build config-driven scrapers, and normalize data across sources. We'll also cover product matching, deduplication, and anti-bot protection. These patterns scale from 3 sources to 30.

[How to Track Competitor Prices Using Web ScrapingIn this web scraping guide, we'll explain how to create a tool for tracking competitor prices using Python. It will scrape specific products from different providers, compare their prices and generate insights.](https://scrapfly.io/blog/posts/how-to-track-competitor-pricing-using-web-scraping)



## Key Takeaways

Build a multi-source product scraping pipeline using Python with a canonical schema, config-driven adapters, and normalization patterns:

- Design a canonical product schema with fixed core fields and a flexible attributes dict for source-specific extras
- Use the adapter pattern so each ecommerce source is a configuration file, not a separate codebase
- Apply four normalization operations to raw scraped data: field mapping, value expansion, unit conversion, and string wrangling
- Match products across sources using UPC/GTIN identifiers when available, with fuzzy title matching as a fallback
- Encode each source's anti-bot requirements in its adapter config rather than hardcoding workarounds
- Monitor per-source data quality and freshness to catch silent failures before stale data reaches your pipeline

**Get web scraping tips in your inbox**Trusted by 100K+ developers and 30K+ enterprises. Unsubscribe anytime.







## Why Scraping Multiple Ecommerce Sites Is Harder Than Scraping One

Scraping product data from one ecommerce site is a parsing problem. Scraping from ten or forty sites is a systems architecture problem. Every source structures data differently, breaks at different times, and fights your scrapers in different ways.

Three concrete problems emerge when you move from single-site to multi-site scraping:

- **Schema fragmentation**: Site A calls the price field "cost", site B calls the same field "sale\_price", and site C embeds the price inside JSON-LD structured data. Multiply that inconsistency across every product attribute, and you get "schema hell" where merging data from multiple sources into one database becomes a field mapping nightmare.
- **Heterogeneous anti-bot protection**: One site runs [Cloudflare](https://scrapfly.io/blog/posts/how-to-bypass-cloudflare-anti-scraping), another uses [Akamai](https://scrapfly.io/blog/posts/how-to-bypass-akamai-anti-scraping), and a third relies on custom detection logic. Each source needs different bypass strategies, proxy configurations, and request patterns.
- **Maintenance burden**: When any of your N sources changes its page layout, that source's parser breaks. You need to detect which source broke, diagnose the change, and fix the parser without disrupting the rest of the pipeline.

Common approaches that fail at scale include copy-pasting standalone scrapers for each site and manually cleaning data in spreadsheets. Copy-paste scrapers and manual cleaning create maintenance overhead that grows linearly with every new source you add.

The solution rests on three pillars: a canonical schema that all sources map to, a config-driven adapter pattern, and a normalization pipeline. The rest of this article teaches each pillar with working code.



## Designing a Canonical Product Schema

A canonical schema is the unified data model that all source-specific parsers map to. Define this schema before writing a single scraper. Every downstream operation depends on consistent field names and data types, so the schema is the most important decision you'll make.

### Choosing Your Core Fields

The core fields represent product attributes that every ecommerce source can provide. Define the schema as a Python dataclass so your pipeline gets type checking and clear documentation:

python```python
from dataclasses import dataclass, field
from typing import Optional
from datetime import datetime, timezone


@dataclass
class Product:
    """Canonical product schema for multi-source aggregation."""
    # composite key: source + source_product_id uniquely identifies a record
    source: str
    source_url: str
    source_product_id: str
    # core product attributes
    title: str
    price: float
    currency: str
    availability: bool
    brand: Optional[str] = None
    category: Optional[str] = None
    image_url: Optional[str] = None
    description: Optional[str] = None
    # identifiers for cross-source matching
    upc: Optional[str] = None
    gtin: Optional[str] = None
    # flexible storage for source-specific extras
    attributes: dict = field(default_factory=dict)
    # metadata
    scraped_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
```



The `source` and `source_product_id` fields together form a composite key. Each record is unique within one source, and the combination identifies a specific product from a specific retailer. The `price` field always stores a float in the base currency after normalization. The `currency` field records the original currency code (USD, EUR, GBP) so you can trace the conversion.

The [Schema.org Product vocabulary](https://schema.org/Product) is a useful reference for field names. Check the vocabulary to see which attributes are common across ecommerce sites.



### Handling Optional and Source-Specific Attributes

Some sources provide fields that others don't. One retailer might include a star rating and review count. Another might list color options, weight, and dimensions. A third might expose the manufacturer part number (MPN) but not the brand name.

The wrong approach is adding a new column for every source-specific field. That leads to a sparse table where most values are null and the schema changes every time you add a source. Instead, use a fixed set of core fields for matching and comparison, plus a flexible `attributes` dict for everything else:

python```python
product = Product(
    source="web-scraping.dev",
    source_url="https://web-scraping.dev/product/1",
    source_product_id="1",
    title="Box of Chocolate Candy",
    price=24.99,
    currency="USD",
    availability=True,
    brand="ChocoWorld",
    attributes={
        "rating": 4.5,
        "review_count": 122,
        "weight": "1.5 lbs",
        "flavor": "Assorted",
    },
)
```



The `attributes` dict holds source-specific extras without forcing schema changes. You can query, filter, and compare products using the core fields while preserving every detail the source provided.



## Building Per-Source Scrapers with a Config-Driven Pattern

With the canonical schema in place, the next step is building scrapers that populate the schema from each source. The goal is an architecture where adding a new ecommerce site doesn't mean writing a new scraper from scratch.

The fastest way to turn a multi-source pipeline into unmaintainable spaghetti is to write a standalone scraper for each site. Instead, use an adapter pattern where a configuration defines each source. Add a new site by adding a config file, not by writing new scraper code from scratch.

### The Adapter Pattern for Ecommerce Scrapers

Start with a base scraper class that handles the common workflow: fetch a page, parse it into canonical Product objects, and handle errors. Each source gets its parsing logic from configuration rather than subclassing:

python```python
import httpx
from parsel import Selector
from dataclasses import dataclass, field
from typing import Optional


@dataclass
class SourceConfig:
    """Configuration for a single ecommerce source."""
    name: str
    base_url: str
    search_url: str
    # CSS/XPath selectors for product data
    selectors: dict
    # anti-bot and request configuration
    requires_js: bool = False
    rate_limit: float = 1.0  # seconds between requests
    headers: dict = field(default_factory=dict)


# define sources as configuration dicts
SOURCES = {
    "web-scraping-dev": SourceConfig(
        name="web-scraping.dev",
        base_url="https://web-scraping.dev",
        search_url="https://web-scraping.dev/products",
        selectors={
            "product_list": "div.row.product",
            "title": "h3 a::text",
            "price": "div.price::text",
            "image": "img::attr(src)",
            "link": "h3 a::attr(href)",
            "description": "p.product-description::text",
        },
    ),
    "mock-store": SourceConfig(
        name="mock-store",
        base_url="https://web-scraping.dev",
        search_url="https://web-scraping.dev/products?page=2",
        selectors={
            "product_list": "div.row.product",
            "title": "h3 a::text",
            "price": "div.price::text",
            "image": "img::attr(src)",
            "link": "h3 a::attr(href)",
            "description": "p.product-description::text",
        },
        rate_limit=2.0,
    ),
}
```



The `SourceConfig` dataclass holds everything the scraper needs: the site URL, CSS selectors, rate limits, and whether the source requires JavaScript rendering. The `SOURCES` dict acts as a registry mapping each key to a source configuration.

Now write a single scraper function that works with any source configuration:

python```python
import asyncio
import re


async def scrape_source(config: SourceConfig) -> list[Product]:
    """Scrape products from a source using its configuration."""
    async with httpx.AsyncClient(headers=config.headers) as client:
        response = await client.get(config.search_url)
        response.raise_for_status()

    selector = Selector(response.text)
    products = []
    for item in selector.css(config.selectors["product_list"]):
        raw_price = item.css(config.selectors["price"]).get("")
        price_match = re.search(r"[\d.]+", raw_price)
        price = float(price_match.group()) if price_match else 0.0
        link = item.css(config.selectors["link"]).get("")

        product = Product(
            source=config.name,
            source_url=f"{config.base_url}{link}",
            source_product_id=link.split("/")[-1] if link else "",
            title=item.css(config.selectors["title"]).get("").strip(),
            price=price,
            currency="USD",
            availability=True,
            image_url=item.css(config.selectors["image"]).get(""),
        )
        products.append(product)
    return products
```



The `scrape_source` function reads selectors from the configuration and maps the extracted data into canonical Product objects. The same function handles every source. The only difference between sources is the configuration dict.

For more on [CSS selectors](https://scrapfly.io/blog/posts/parsing-html-with-css) and [XPath selectors](https://scrapfly.io/blog/posts/parsing-html-with-xpath), check out our dedicated parsing guides.



### Adding a New Source Without Rewriting Code

Adding a new ecommerce source follows a three-step process: create the configuration, test the parser, and register the source. Here's how to add a hypothetical third source:

python```python
# step 1: add configuration to the source registry
SOURCES["electronics-hub"] = SourceConfig(
    name="electronics-hub",
    base_url="https://web-scraping.dev",
    search_url="https://web-scraping.dev/products?page=3",
    selectors={
        "product_list": "div.row.product",
        "title": "h3 a::text",
        "price": "div.price::text",
        "image": "img::attr(src)",
        "link": "h3 a::attr(href)",
        "description": "p.product-description::text",
    },
    requires_js=True,
    rate_limit=1.5,
)


# step 2: scrape all registered sources
async def scrape_all_sources() -> list[Product]:
    """Scrape products from all registered sources."""
    all_products = []
    for source_id, config in SOURCES.items():
        try:
            products = await scrape_source(config)
            all_products.extend(products)
            print(f"Scraped {len(products)} products from {config.name}")
        except Exception as e:
            print(f"Failed to scrape {config.name}: {e}")
    return all_products
```



The source registry (`SOURCES` dict) maps source identifiers to their configurations. Adding your 20th source means adding a configuration entry and testing the selectors. No new scraper code required.

For sites where writing selectors is impractical due to complex or frequently changing layouts, Scrapfly's extraction API offers an alternative. Pass any product page URL and get structured product data back without defining selectors at all. We'll cover the extraction API approach in the normalization section.



## Normalizing Product Data Across Sources

Raw scraped data from ten sources will have ten different ways of expressing the same product attributes. Normalization turns this mess into consistent, comparable values in your canonical schema. Without normalization, you can't accurately compare prices, match products, or run analytics across sources.

### Field Mapping and Value Standardization

Field mapping translates source-specific field names to your canonical schema. Value standardization makes sure the same concept uses the same representation across all sources:

python```python
# field mapping: source-specific names to canonical fields
FIELD_MAPS = {
    "web-scraping-dev": {
        "product_name": "title",
        "cost": "price",
        "in_stock": "availability",
    },
    "mock-store": {
        "item_title": "title",
        "sale_price": "price",
        "available": "availability",
    },
}

# value standardization: normalize brand abbreviations
BRAND_EXPANSIONS = {
    "HP": "Hewlett-Packard",
    "MS": "Microsoft",
    "SS": "Stainless Steel",
    "GE": "General Electric",
}

# generalize color variants to base colors
COLOR_MAP = {
    "neon lime": "green",
    "midnight blue": "blue",
    "charcoal": "black",
    "ivory": "white",
    "crimson": "red",
}


def normalize_brand(brand: str) -> str:
    """Expand brand abbreviations to full names."""
    return BRAND_EXPANSIONS.get(brand.upper(), brand)


def normalize_color(color: str) -> str:
    """Map color variants to standard color names."""
    return COLOR_MAP.get(color.lower(), color.lower())


def normalize_availability(raw_value) -> bool:
    """Convert various availability formats to boolean."""
    if isinstance(raw_value, bool):
        return raw_value
    truthy = {"in stock", "available", "yes", "true", "1"}
    return str(raw_value).lower().strip() in truthy
```



Field mapping converts "cost" to `price` and "product\_name" to `title`. Value expansion turns "HP" into "Hewlett-Packard" so brand comparisons work across sources. Color generalization maps "Neon Lime" to "green" so filtering by color produces consistent results.

### Unit Conversion and Currency Normalization

Different sources express measurements in different units. One source lists weight in pounds, another in kilograms. Dimensions might appear as "24x36 inches" or as separate width and height fields in centimeters:

python```python
import re


def lbs_to_kg(lbs: float) -> float:
    """Convert pounds to kilograms."""
    return round(lbs * 0.453592, 2)


def inches_to_cm(inches: float) -> float:
    """Convert inches to centimeters."""
    return round(inches * 2.54, 2)


def parse_dimensions(raw: str) -> dict:
    """Parse dimension strings like '24x36 inches' into structured data."""
    numbers = re.findall(r"[\d.]+", raw)
    unit = "cm"
    if "inch" in raw.lower() or "in" in raw.lower():
        numbers = [str(inches_to_cm(float(n))) for n in numbers]
    if len(numbers) == 3:
        return {"width_cm": float(numbers[0]), "height_cm": float(numbers[1]), "depth_cm": float(numbers[2])}
    elif len(numbers) == 2:
        return {"width_cm": float(numbers[0]), "height_cm": float(numbers[1])}
    return {"raw_dimensions": raw}


def normalize_currency(price: float, from_currency: str, to_currency: str = "USD") -> float:
    """Convert price between currencies using a static rate table."""
    rates_to_usd = {
        "USD": 1.0,
        "EUR": 1.08,
        "GBP": 1.27,
        "CAD": 0.74,
        "AUD": 0.65,
    }
    if from_currency == to_currency:
        return price
    usd_price = price * rates_to_usd.get(from_currency, 1.0)
    return round(usd_price / rates_to_usd.get(to_currency, 1.0), 2)
```



The `parse_dimensions` function handles the common pattern where sites embed dimensions as a single string. The currency converter uses a static rate table for simplicity. In production, you'd fetch rates from an exchange rate API.



Scrapfly

#### Scale your web scraping effortlessly

Scrapfly handles proxies, browsers, and anti-bot bypass — so you can focus on data.

[Try Free →](https://scrapfly.io/register)### Using LLM-Based Extraction for Mixed-Format Sites

When sources are too varied for per-site selectors, LLM-based extraction can parse product attributes from unstructured HTML. LLM-based extraction performs well on this task, reliably parsing product attributes from unstructured HTML where per-site selectors would be impractical to maintain.

Scrapfly's extraction API takes this approach. Instead of writing and maintaining selectors for each source, you send the page HTML and get structured product data back:

python```python
from scrapfly import ScrapflyClient, ScrapeConfig, ExtractionConfig

scrapfly = ScrapflyClient(key="YOUR_SCRAPFLY_KEY")


async def extract_product_with_ai(url: str) -> dict:
    """Extract structured product data using Scrapfly's AI extraction."""
    # first, scrape the page with anti-bot protection handled
    scrape_result = await scrapfly.async_scrape(ScrapeConfig(
        url=url,
        asp=True,  # anti-scraping protection bypass
    ))
    # then extract structured data using AI
    extraction = await scrapfly.async_extraction(ExtractionConfig(
        body=scrape_result.content,
        content_type="text/html",
        extraction_prompt="Extract product data: title, price, currency, brand, availability, description, images",
    ))
    return extraction.data
```



The extraction API works as a fallback for sources where maintaining CSS or XPath selectors is impractical. You can mix selector-based parsing for stable sources with AI extraction for volatile or complex sources in the same pipeline.



## Matching and Deduplicating Products Across Sources

When you scrape the same laptop from three different retailers, you get three separate product records. Product matching links those records so you can compare prices, availability, and attributes for the same physical product across sources.

### Identifier-Based Matching (UPC, GTIN, MPN)

UPC, EAN, and GTIN barcodes are the gold standard for product matching. These identifiers are universally unique. A UPC barcode for a specific laptop model is the same whether you find the product on Amazon, Walmart, or BestBuy.

Many ecommerce sites expose product identifiers in structured data. Look for Schema.org `gtin`, `mpn`, and `sku` fields in JSON-LD blocks, hidden API responses, and product specification tables. When these identifiers are available, matching is straightforward. Group products by their GTIN and you have exact cross-source matches:

python```python
from collections import defaultdict


def match_by_identifier(products: list[Product]) -> dict[str, list[Product]]:
    """Group products by UPC/GTIN for exact cross-source matching."""
    matched = defaultdict(list)
    unmatched = []

    for product in products:
        gtin = product.gtin or product.upc
        if gtin:
            matched[gtin].append(product)
        else:
            unmatched.append(product)

    return {
        "matched": dict(matched),
        "unmatched": unmatched,
    }
```



The `match_by_identifier` function separates products with identifiers from products without them. Products with GTINs get exact matches. Products without GTINs go to fuzzy matching. For tips on finding hidden identifiers, see our guide on [scraping hidden web data](https://scrapfly.io/blog/posts/how-to-scrape-hidden-web-data).

### Fuzzy Matching When Identifiers Are Missing

Many ecommerce sites don't expose UPC or GTIN codes. When identifiers aren't available, fall back to fuzzy matching on product title, brand, and key attributes. The approach: normalize titles first, then compute similarity scores:

python```python
from rapidfuzz import fuzz


def normalize_title(title: str) -> str:
    """Normalize product title for fuzzy matching."""
    title = title.lower().strip()
    # remove common filler words that differ across sources
    filler_words = {"the", "a", "an", "with", "and", "for", "in", "new"}
    tokens = [t for t in title.split() if t not in filler_words]
    return " ".join(tokens)


def fuzzy_match_products(
    unmatched: list[Product], threshold: float = 85.0
) -> list[list[Product]]:
    """Match products by fuzzy title + brand similarity."""
    groups = []
    used = set()

    for i, product_a in enumerate(unmatched):
        if i in used:
            continue
        group = [product_a]
        norm_a = normalize_title(product_a.title)

        for j, product_b in enumerate(unmatched[i + 1:], start=i + 1):
            if j in used:
                continue
            # skip if brands are known and different
            if product_a.brand and product_b.brand:
                if product_a.brand.lower() != product_b.brand.lower():
                    continue

            norm_b = normalize_title(product_b.title)
            score = fuzz.token_sort_ratio(norm_a, norm_b)
            if score >= threshold:
                group.append(product_b)
                used.add(j)

        if len(group) > 1:
            groups.append(group)
        used.add(i)

    return groups
```



The `fuzzy_match_products` function uses token-based similarity from the rapidfuzz library. Token sort ratio handles different word orders across sources (e.g., "Samsung Galaxy S24 Ultra" vs "Galaxy S24 Ultra Samsung"). The brand check filters out false matches early.

Threshold tuning matters. Set the threshold too low (below 70) and you'll get false matches where different products group together. Set the threshold too high (above 95) and you'll miss valid matches where sources use slightly different product names. Start at 85 and adjust based on your data.



## Handling Anti-Bot Protection Across Many Sites

Product matching only works when your scrapers can reach each source consistently. The biggest obstacle to consistent multi-source scraping is anti-bot protection, and each ecommerce site uses a different system.

Each ecommerce site uses different anti-bot protection. One site runs Cloudflare, another uses Akamai, a third relies on DataDome, and a fourth has custom detection logic. A multi-source pipeline needs a structured plan, not per-site workarounds that break independently.

The adapter pattern from the scraper architecture section solves half the problem. Each source config specifies its anti-bot requirements: whether the source needs JavaScript rendering, what proxy type works best, and what rate limits to respect. The scraper function reads these requirements and adjusts its behavior per source.

Common ecommerce anti-bot systems include:

- **Cloudflare**: Widely used by Shopify stores and mid-size retailers. Requires TLS fingerprint management and sometimes JavaScript challenges.
- **Akamai Bot Manager**: Common on large retailers like Nike and Costco. Uses advanced browser fingerprinting.
- **[DataDome](https://scrapfly.io/blog/posts/how-to-bypass-datadome-anti-scraping)**: Growing presence on fashion and luxury ecommerce sites. Combines device fingerprinting with behavior analysis.
- **[PerimeterX/HUMAN](https://scrapfly.io/blog/posts/how-to-bypass-perimeterx-human-anti-scraping)**: Used by several large marketplaces. Relies on sensor data collection.

Each source config encodes the protection type so the pipeline handles the right bypass strategy:

python```python
SOURCES["protected-store"] = SourceConfig(
    name="protected-store",
    base_url="https://web-scraping.dev",
    search_url="https://web-scraping.dev/products",
    selectors={
        "product_list": "div.row.product",
        "title": "h3 a::text",
        "price": "div.price::text",
        "image": "img::attr(src)",
        "link": "h3 a::attr(href)",
    },
    requires_js=True,
    rate_limit=3.0,
    headers={"Accept-Language": "en-US,en;q=0.9"},
)
```



The config sets `requires_js=True` for sites that need browser rendering and a slower `rate_limit` to avoid triggering bot detection. The scraper function reads these settings and adapts its request strategy for each source.

The tradeoff between self-managed and managed approaches matters when scaling to many sources:

| Approach | Pros | Cons | Best for |
|---|---|---|---|
| Self-managed proxies and browsers | Full control, no vendor dependency | High ops overhead, per-site tuning | Teams with DevOps capacity |
| Managed scraping APIs | Low ops, adapts to protection changes | Per-request cost | Production pipelines at scale |

For a self-managed setup, you'll need to maintain proxy pools, browser automation, and per-site fingerprint configurations. Our guide on [avoiding blocking with headers](https://scrapfly.io/blog/posts/how-to-avoid-web-scraping-blocking-headers) covers the foundational techniques.

Scrapfly's Anti-Scraping Protection (ASP) takes the managed approach. One API call handles whichever protection the target site uses. The ASP adapts to changes automatically, so your pipeline doesn't break when a site upgrades its anti-bot system. With dozens of targets, ASP reduces the surface area from N anti-bot configs to one API call.



## Scaling from 3 Sources to 30

With the schema, adapters, normalization, matching, and anti-bot layers in place, the pipeline works for a handful of sources. Going from a handful to dozens brings a new problem: keeping everything running.

A pipeline with 3 sources feels manageable. You can check each source manually. At 30 sources, any site can break at any time. You need to know which one broke and why without checking each source by hand.

### Error Handling and Graceful Degradation

When one source fails, the pipeline should continue scraping the other sources. A single broken source shouldn't crash the entire run. Wrap each source's scrape operation in error handling that logs failures and continues:

python```python
import logging
from datetime import datetime, timezone

logger = logging.getLogger("product_pipeline")


async def scrape_all_with_resilience(sources: dict[str, SourceConfig]) -> dict:
    """Scrape all sources with per-source error isolation."""
    results = {
        "products": [],
        "failures": [],
        "timestamp": datetime.now(timezone.utc).isoformat(),
    }

    for source_id, config in sources.items():
        try:
            products = await scrape_source(config)
            results["products"].extend(products)
            logger.info(f"Scraped {len(products)} products from {config.name}")
        except httpx.HTTPStatusError as e:
            logger.error(f"HTTP error scraping {config.name}: {e.response.status_code}")
            results["failures"].append({
                "source": config.name,
                "error": f"HTTP {e.response.status_code}",
                "timestamp": datetime.now(timezone.utc).isoformat(),
            })
        except Exception as e:
            logger.error(f"Unexpected error scraping {config.name}: {e}")
            results["failures"].append({
                "source": config.name,
                "error": str(e),
                "timestamp": datetime.now(timezone.utc).isoformat(),
            })

    return results
```



The function isolates failures per source. When a source returns an HTTP 403 or times out, the pipeline logs the failure and moves to the next source. After the run completes, you can review the failures list and decide which sources need attention.

### Monitoring Data Quality and Freshness

Scraping data is only useful if the data is accurate and current. Track three metrics per source to catch problems early:

python```python
def check_data_quality(
    current_products: list[Product],
    previous_counts: dict[str, int],
    staleness_hours: int = 24,
) -> list[str]:
    """Check per-source data quality and return warnings."""
    warnings = []
    source_counts = {}

    for product in current_products:
        source_counts[product.source] = source_counts.get(product.source, 0) + 1

    for source, count in source_counts.items():
        prev = previous_counts.get(source, count)
        # detect sudden drops in product count (possible layout change)
        if prev > 0 and count < prev * 0.5:
            warnings.append(
                f"Product count for {source} dropped from {prev} to {count}. "
                f"Possible layout change or scraper failure."
            )

    # detect missing sources (possible silent failure)
    for source in previous_counts:
        if source not in source_counts:
            warnings.append(
                f"Source {source} returned no products. Check if the scraper is working."
            )

    return warnings
```



The quality check function compares current results against previous runs. A sudden 50% drop in product count suggests a layout change or blocked scraper. A missing source means the scraper failed silently. Run this check after every pipeline execution and route warnings to your alerting system.

For scheduling, a cron job or task queue runs the pipeline at regular intervals. Our [Crawler API monitoring guide](https://scrapfly.io/blog/posts/competitor-price-monitoring-with-crawler-api) covers scheduling patterns in detail.

## Powering Multi-Source Pipelines with Scrapfly



ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. For multi-source product pipelines, two features matter most:

- **Anti-bot bypass ([ASP](https://scrapfly.io/docs/scrape-api/anti-scraping-protection))**: One API call handles Cloudflare, Akamai, DataDome, and custom protections. Your adapter pattern sends requests through Scrapfly instead of maintaining separate bypass strategies per source.
- **[Extraction API](https://scrapfly.io/docs/extraction-api/getting-started)**: Send any product page URL and get structured data back. No selectors to write or maintain. Adding a new source becomes adding a URL to your registry.
- **Rotating proxies**: A self-healing pool of residential and datacenter proxies across 100+ countries. Scrapfly picks the right proxy type per request.
- **JavaScript rendering**: Built-in headless browsers for sites that load product data with JavaScript. No need to run your own browser farm.
- **Python and TypeScript SDKs**: Drop-in clients for both languages, plus a [Scrapy middleware](https://scrapfly.io/docs/sdk/scrapy) for teams already using Scrapy.

Together, ASP and the extraction API turn the hardest parts of multi-source scraping into API calls. Your code focuses on schema design, normalization, and product matching. Scrapfly handles the scraping layer.

### Power your scraping with Scrapfly

Forget about getting blocked. Scrapfly handles anti-bot bypasses, browser rendering, and proxy rotation so you can focus on the data.



[Try for FREE!](https://scrapfly.io/register)





## FAQ

Should I use Scrapy or httpx for a multi-source pipeline?Either works well. httpx with parsel is simpler for source-at-a-time scraping, where you control the concurrency. [Scrapy](https://scrapfly.io/blog/posts/web-scraping-with-scrapy) is better if you need built-in scheduling, retries, and middleware. The architecture patterns in this article (canonical schema, adapter pattern, normalisation) work with both.







How often should I re-scrape product data?The right frequency depends on data volatility. Prices on major retailers can change hourly. Product specifications and descriptions rarely change. Start with daily scrapes and increase frequency for price-sensitive data. Decrease frequency for stable attributes like product descriptions and images.







Is it legal to scrape product data from ecommerce websites?Scraping publicly available product data is generally legal. Avoid logged-in content, personally identifiable information, and violating explicit terms of service. Consult a lawyer for your specific region and use case.









## Summary

Multi-source product scraping is a systems architecture problem. The canonical schema gives every source a common language. The config-driven adapter pattern makes adding new sources routine instead of a rewrite. Normalization turns raw scraped data into consistent, comparable values through field mapping, value expansion, unit conversion, and string wrangling. Product matching links records across sources using identifiers when available and fuzzy matching when identifiers are missing.

The pipeline scales when you invest in error isolation and data quality monitoring rather than scraper performance alone. Each source breaks independently, and your monitoring should detect that failure before stale data reaches downstream systems.

For teams that want managed anti-bot protection and LLM-based extraction, Scrapfly handles the scraping layer. ASP adapts to each site's protection. The extraction API returns structured product data without per-site selectors. [Try Scrapfly for free](https://scrapfly.io). Self-managed approaches with open-source tools remain viable for teams with an existing setup.



Legal Disclaimer and PrecautionsThis tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect:

- Do not scrape at rates that could damage the website.
- Do not scrape data that's not available publicly.
- Do not store PII of EU citizens protected by GDPR.
- Do not repurpose *entire* public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow. For more you should consult a lawyer.

 

   Table of Contents















 

  Table of Contents- [Key Takeaways](#key-takeaways)
- [Why Scraping Multiple Ecommerce Sites Is Harder Than Scraping One](#why-scraping-multiple-ecommerce-sites-is-harder-than-scraping-one)
- [Designing a Canonical Product Schema](#designing-a-canonical-product-schema)
- [Choosing Your Core Fields](#choosing-your-core-fields)
- [Handling Optional and Source-Specific Attributes](#handling-optional-and-source-specific-attributes)
- [Building Per-Source Scrapers with a Config-Driven Pattern](#building-per-source-scrapers-with-a-config-driven-pattern)
- [The Adapter Pattern for Ecommerce Scrapers](#the-adapter-pattern-for-ecommerce-scrapers)
- [Adding a New Source Without Rewriting Code](#adding-a-new-source-without-rewriting-code)
- [Normalizing Product Data Across Sources](#normalizing-product-data-across-sources)
- [Field Mapping and Value Standardization](#field-mapping-and-value-standardization)
- [Unit Conversion and Currency Normalization](#unit-conversion-and-currency-normalization)
- [Using LLM-Based Extraction for Mixed-Format Sites](#using-llm-based-extraction-for-mixed-format-sites)
- [Matching and Deduplicating Products Across Sources](#matching-and-deduplicating-products-across-sources)
- [Identifier-Based Matching (UPC, GTIN, MPN)](#identifier-based-matching-upc-gtin-mpn)
- [Fuzzy Matching When Identifiers Are Missing](#fuzzy-matching-when-identifiers-are-missing)
- [Handling Anti-Bot Protection Across Many Sites](#handling-anti-bot-protection-across-many-sites)
- [Scaling from 3 Sources to 30](#scaling-from-3-sources-to-30)
- [Error Handling and Graceful Degradation](#error-handling-and-graceful-degradation)
- [Monitoring Data Quality and Freshness](#monitoring-data-quality-and-freshness)
- [Powering Multi-Source Pipelines with Scrapfly](#powering-multi-source-pipelines-with-scrapfly)
- [Power your scraping with Scrapfly](#power-your-scraping-with-scrapfly)
- [FAQ](#faq)
- [Summary](#summary)
 
    Join the Newsletter  Get monthly web scraping insights 

 

  



Scale Your Web Scraping

Anti-bot bypass, browser rendering, and rotating proxies, all in one API. Start with 1,000 free credits.

  No credit card required  1,000 free API credits  Anti-bot bypass included 

 [Start Free](https://scrapfly.io/register) [View Docs](https://scrapfly.io/docs/onboarding) 

 Not ready? Get our newsletter instead. 

 

## Explore this Article with AI

 [ ChatGPT ](https://chat.openai.com/?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-build-a-product-data-pipeline-from-multiple-ecommerce-sites) [ Gemini ](https://www.google.com/search?udm=50&aep=11&q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-build-a-product-data-pipeline-from-multiple-ecommerce-sites) [ Grok ](https://x.com/i/grok?text=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-build-a-product-data-pipeline-from-multiple-ecommerce-sites) [ Perplexity ](https://www.perplexity.ai/search/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-build-a-product-data-pipeline-from-multiple-ecommerce-sites) [ Claude ](https://claude.ai/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fhow-to-build-a-product-data-pipeline-from-multiple-ecommerce-sites) 



 ## Related Articles

 [  

 python ecommerce 

### How to Observe E-Commerce Trends using Web Scraping

In this example web scraping project we'll be taking a look at monitoring E-Commerce trends using Python, web scraping a...

 

 ](https://scrapfly.io/blog/posts/observing-ecommerce-market-trends-with-web-scraping) [     

 python data-parsing 

### How to Scrape an Entire Product Catalogue with Python

Learn how to discover, crawl, and extract every product from an e-commerce catalog in Python, then keep that data fresh ...

 

 ](https://scrapfly.io/blog/posts/how-to-scrape-large-product-catalogs) [     

 api data-parsing 

### 10 Best Public Data Sources for Lead Generation in 2026

A ranked directory of 10 public data sources for B2B lead generation, with the fields, access method, and freshness of e...

 

 ](https://scrapfly.io/blog/posts/best-public-data-sources-for-lead-generation) 

  



   



 Scale your web scraping effortlessly, **1,000 free credits** [Start Free](https://scrapfly.io/register)