How to Ensure Web Scrapped Data Quality

Q: What are the most common data quality issues in web scraping?

Missing or null values, inconsistent date formats , varying number formats (commas, spaces, currencies), HTML entities in text content, encoding issues, duplicate records, incomplete data due to dynamic loading, and structural changes in website layouts that break selectors .

Q: How do I validate dynamic or inconsistent HTML structures across different websites?

Use flexible selectors with multiple fallback options , implement content-based validation (check for expected data patterns), use machine learning for adaptive parsing, create website-specific validation rules, and implement progressive parsing that tries multiple extraction methods until one succeeds.

by Bernardas Ališauskas Sep 26, 2025

#python #data-parsing

Data on the public web often comes in unpredictable formats, types and structures. To ensure data quality when web scraping we cannot simply trust that our code will understand any scraped web page without issues. So, various validation tools can be used to test and validate scraper parsing logic.

In this tutorial, we'll take a look at how we can use Python tools such as Cerberus to validate and test scrapped data and tools like Pydantic to enforce data types and even normalize data values.

To illustrate common data validation challenges we'll be scraping Glasdoor.com for company overview data.

How to Scrape Glassdoor (2026 update)

In this web scraping tutorial we'll take a look at Glassdoor - a major resource for company review, job listings and salary data.

Key Takeaways

Ensure scrapped data quality using Python validation tools like Cerberus and Pydantic to validate, normalize, and transform data into reliable datasets.

Data quality is unpredictable in web scraping - websites can change formats without notice
Validation tools are essential - use Cerberus for schema-based validation or Pydantic for type-based validation
Cerberus offers simplicity - easy to set up, less strict, ideal for smaller projects with flexible validation needs
Pydantic provides power - integrates with Python type hints, offers automatic data conversion, and IDE support
Data normalization matters - convert strings to proper data types (dates, numbers, booleans) for better analysis
Error handling is crucial - implement proper validation with clear error messages and fallback strategies

Data Quality Challenges

When web scraping we're working with a public resource that we have no control over, so we cannot trust the format and quality of the collected data. In other words, we can never be sure when a website would change something like a date format - from 2022-11-26 to 2022/11/26 - or use different number formatting - from 10 000 to 10,000.

To approach data quality in web scraping we can take advantage of several tools to test, validate and even transform scrapped data.

For example, we can write a validation function to check that all date fields follow <4-digit-number>-<2-digit-number>-<2-digit-number> pattern or even automatically correct it to a standard value if possible.

Setup

In this article, we'll be using a few Python packages:

Cerberus for schema-based data validations
Pydantic for strict type-based data validation.

We'll also write a short example scraper using popular web scraping packages:

httpx as a HTTP client to collect public HTML pages.
parsel as HTML parsing library to extract details from HTML pages.

Web Scraping with Python

Introduction tutorial to web scraping with Python. How to collect and parse public data. Challenges, best practices and an example project.

We can install all of these packages using pip console command:

$ pip install cerberus pydantic httpx parsel

Soft Validation with Cerberus

To start let's take a look at Cerberus - a popular data validation library in Python - and how can we use it with web scrapped data.

To start using Cerberus all we have to do is define some rules and apply them to our data.

Defining Schema

To validate data Cerberus needs a data schema which contains all of the validation rules and structure expectations. For example:

from cerberus import Validator

schema = {
    "name": {
        # name should be a string
        "type": "string",
        # between 2 and 20 characters
        "minlength": 3, 
        "maxlength": 20, 
        "required": True,
    },
}

v = Validator(schema)
# this will pass
v.validate({"name": "Homer"})
print(v.errors)

# This will not
v.validate({"name": "H"})
print(v.errors)
# {'name': ['min length is 3']}

Cerberus comes with a lot of predefined rules like minlength, maxlength and many others which can be found in the validation rules docs. Though, here are some commonly used ones:

allowed - validates to a list of allowed values.
contains - validates that value contains some other value.
required - ensures that value is present. For example, fields like id are often required in scraping.
depends - ensures that related fields are present. For example, if product dicountet price is present, the full price should be present as well.
regex - check value to match specified regular expressions pattern. This is very useful for fields like phone numbers or emails.

However, to truly take advantage of this we can define our own rules!

Creating Validation Rules

Web scrapped data can be quite complex. For example, name field alone can come in many shapes and forms.

Using Cerberus we can provide our own validation function in Python that can validate data values for complex issues.

from cerberus import Validator


def validate_name(field, value, error):
    if "." in value:
        error(field, f"contains a dot character: {value}")
    if value.lower() in ["classified", "redacted", "missing"]:
        error(field, f"redacted value: {value}")
    if "<" in value.lower() and ">" in value.lower():
        error(field, f"contains html nodes: {value}")


schema = {
    "name": {
        # name should be a string
        "type": "string",
        # between 2 and 20 characters
        "minlength": 2,
        "maxlength": 20,
        # extra validation
        "check_with": validate_name,
    },
}

v = Validator(schema)

v.validate({"name": "H."})
print(v.errors)
# {'name': ['contains a dot character: H.']}

v.validate({"name": "Classified"})
print(v.errors)
# {'name': ['redacted value: Classified']}

v.validate({"name": "<a>Homer</a>"})
print(v.errors)
# {'name': ['contains html nodes: <a>Homer</a>']}

In the example above, we added our own validation function for the name field which does more advanced string validation like checking for invalid values and potential html parsing errors.

Real Life Example

Let's put together a quick example web scraper that we'll validate using Cerberus. We'll be scraping company overview data on Glassdoor.com:

glassdoor company overview highlight — In this example, we'll focus on scraping basic company details.

As an example, let's scrape Ebay's glassdoor profile:

import httpx
from parsel import Selector

response = httpx.get("https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm")
selector = Selector(response.text)
overview_rows = selector.css('[data-test="employerOverviewModule"]>ul>li>div')
data = {}
for row in overview_rows:
    label = row.xpath("@data-test").get().split('-', 1)[-1]
    value = row.xpath("text()").get()
    data[label] = value
print(data)

This short scraper will scrape the basic company overview details:

{
  "headquarters": "San Jose, CA",
  "size": "10000+ Employees",
  "founded": "1995",
  "type": "Company - Public (EBAY)",
  "revenue": "$10+ billion (USD)"
}

We can already see how vague the raw fields on this page are. Bigger companies will have revenue in billions and different regions will have different currencies etc. The data values are human-readable but not normalized for machine interpretation.

Let's parse this data out to something more general and validate it using Cerberus:

from cerberus import Validator


def parse_and_validate(data):
    schema = {}
    parsed = {}

    # company size is integer between 1 employee and several thousand
    parsed["size"] = int(data["size"].split()[0].strip("+"))
    schema["size"] = {"type": "integer", "min": 1, "max": 20_000}

    # founded date is a realistic year number
    parsed["founded"] = int(data["founded"])
    schema["founded"] = {"type": "integer", "min": 1900, "max": 2022}

    # headquarter details consist of city and state/province/country
    hq_details = data["headquarters"].split(", ")
    parsed["hq_city"] = hq_details[0] if hq_details else None
    parsed["hq_state"] = hq_details[1] if len(hq_details) > 1 else None
    schema["hq_city"] = {"type": "string", "minlength": 2, "maxlength": 20}
    schema["hq_state"] = {"type": "string", "minlength": 2, "maxlength": 2}

    # let's presume we only want to ensure we're scraping US and GB companies:
    parsed["revenue_currency"] = data["revenue"].split("(")[-1].strip("()")
    schema["revenue_currency"] = {"type": "string", "allowed": ["USD", "GBP"]}

    validator = Validator(schema)
    if not validator.validate(parsed):
        print("failed to validate parsed data:")
        for key, error in validator.errors.items():
            print(f"{key}={parsed[key]} got error: {error}")
    return parsed


ebay_data = {
    "headquarters": "San Jose, CA",
    "size": "10000+ Employees",
    "founded": "1995",
    "type": "Company - Public (EBAY)",
    "revenue": "$10+ billion (USD)",
}
print(parse_and_validate(ebay_data))
# will print:
{
  "size": 10000,
  "founded": 1995,
  "hq_city": "San Jose",
  "hq_state": "CA",
  "revenue_currency": "USD"
}

Above, we wrote our parser that converts raw company overview data to something more standard and concrete. As we can see it works great with our Ebay profile example, though we wrote validation to ensure that our parser delivers a consistent quality of data. With a sample size of one we can't say that our validator is doing it's job well.

Let's confirm our validation schema by expanding our test sample with another company page - Tesco

tesco_data = {
  "headquarters": "Welwyn Garden City, United Kingdom",
  "size": "10000+ Employees",
  "founded": "1919",
  "type": "Company - Private",
  "revenue": "Unknown / Non-Applicable"
}
print(parse_and_validate(tesco_data))
# will print:
# failed to validate parsed data:
# hq_state=United Kingdom got error: ['max length is 2']
# revenue_currency=Unknown / Non-Applicable got error: ['unallowed value Unknown / Non-Applicable']
{
  "size": 10000,
  "founded": 1919,
  "hq_city": "Welwyn Garden City",
  "hq_state": "United Kingdom",
  "revenue_currency": "Unknown / Non-Applicable"
}

As we can see our validator indicates that our parser is not doing such a good job with Tesco's page as it did with Ebay's.

Using Cerberus we can quickly define how our desired output should look which helps to develop and maintain complex data parsing operations which in turn ensures consistent data quality.

Cerberus offers more features like value defaults, normalization rules and powerful extension support.

Next, let's take a look at a different validation technique - strict data types.

Typed Validation with Pydantic

Our Cerberus validation let us know about data parsing incosnsitencies but if we're building a reliable data API we might need something more strict.

Pydantic allows defining strict types and validation rules using Python's type hint system. Unlike cerberus Pydantic is much more strict and raises errors when encountering invalid data values.

Let's take a look at our Glassdoor example through the lens of Pydantic. Validation in Pydantic is done through explicit BaseModel objects and python type hints.

For example, our Glassdoor company overview validator would look something like this:

from typing import Optional
from pydantic import BaseModel, validator

# to validate data we must create a Pydantic Model:
class Company(BaseModel):
    # define allowed field names and types:
    size: int
    founded: int
    revenue_currency: str
    hq_city: str
    # some fields can be optional (i.e. have value of None)
    hq_state: Optional[str]

    # then we can define any extra validation functions:
    @validator("size")
    def must_be_reasonable_size(cls, v):
        if not (0 < v < 20_000):
            raise ValueError(f"unreasonable company size: {v}")
        return v

    @validator("founded")
    def must_be_reasonable_year(cls, v):
        if not (1900 < v < 2022):
            raise ValueError(f"unreasonable found date: {v}")
        return v

    @validator("hq_state")
    def looks_like_state(cls, v):
        if len(v) != 2:
            raise ValueError(f'state should be 2 character long, got "{v}"')
        return v


def parse_and_validate(data):
    parsed = {}

    parsed["size"] = data["size"].split()[0].strip("+")
    parsed["founded"] = data["founded"]
    hq_details = data["headquarters"].split(", ")
    parsed["hq_city"] = hq_details[0] if hq_details else None
    parsed["hq_state"] = hq_details[1] if len(hq_details) > 1 else None
    parsed["revenue_currency"] = data["revenue"].split("(")[-1].strip("()")
    return Company(**parsed)


ebay_data = {
    "headquarters": "San Jose, CA",
    "size": "10000+ Employees",
    "founded": "1995",
    "type": "Company - Public (EBAY)",
    "revenue": "$10+ billion (USD)",
}
print(parse_and_validate(ebay_data))

tesco_data = {
    "headquarters": "Welwyn Garden City, United Kingdom",
    "size": "10000+ Employees",
    "founded": "1919",
    "type": "Company - Private",
    "revenue": "Unknown / Non-Applicable",
}
print(parse_and_validate(tesco_data))

Here, we've converted our parser and validator to use pydantics models to validate parsed data. Our Ebay data will be validated successfully, while Tesco data will raise a validation error:

pydantic.error_wrappers.ValidationError: 1 validation error for Company
hq_state
  state should be 2 character long, got "United Kingdom" (type=value_error)

Pydandic is very strict and will raise exceptions whenever it fails to validate or interpret data. This is a great way to ensure web scraped data quality though it comes with a higher setup overhead than our Cerberus example.

Transforming and Interpreting

Another advantage of Pydantic is the ability to transform and cast data types from strings. For example, in our hq_state field we can ensure that all incoming values are lowercase:

class Company(BaseModel):
    @validator("hq_state")
    def looks_like_state(cls, v):
        if len(v) != 2:
            raise ValueError(f'state should be 2 characters long, got "{v}"')
        return v.lower()

Above, our validator will check whether hq_state field is 2 characters long and convert it to lower case standardizing our scraped dataset.

Pydantic also automatically casts common data types from string values, making our parsing process much more streamlined and easier to follow:

from datetime import date
from pydantic import BaseModel

class Company(BaseModel):
    founded: date

print(Company(founded="1994-11-24"))
# will print:
# founded=datetime.date(1994, 11, 24)

As you can see, Pydantic automatically interpreted string date as a python date object.

Cerberus or Pydantic?

We've explored both of these popular data validation packages and even though they are quite similar there are a few key differences.

Primarily, Pydantic integrates with Python's type hint ecosystem whereas Cerberus uses a more generic schema approach. So, while Pydantic can be a bit more complicated it can provide major benefits like code completion in IDE's and automatic data conversion. On the other hand, Cerberus is a bit easier to understand and less strict making it ideal for smaller web scraping projects.

Scraped Data Validation Summary

In this article, we've taken a look at two popular ways to ensure web-scrapped data quality:

Cerberus - schema-based validator that is easy to setup and configure.
Pydantic - type-based validator that not only validates data but can easily normalize it to standard python data types like date and time objects.

No matter which approach you choose to go with both are very powerful tools when it comes to data quality in web scraping.

FAQs

Should I use Cerberus or Pydantic for web scraping data validation?

Use Cerberus for flexible schema-based validation with easy setup and configuration. Use Pydantic for strict type-based validation with automatic data conversion and IDE support. Cerberus is better for smaller projects, while Pydantic excels for complex APIs and data pipelines.

How do I handle validation errors when scraping thousands of pages?

Implement error logging to track validation failures, use try-catch blocks around validation logic, create fallback parsing strategies for common data variations, batch process pages to isolate problematic data, and consider using soft validation (warnings) instead of hard failures for non-critical fields.

What are the most common data quality issues in web scraping?

Missing or null values, inconsistent date formats, varying number formats (commas, spaces, currencies), HTML entities in text content, encoding issues, duplicate records, incomplete data due to dynamic loading, and structural changes in website layouts that break selectors.

Does data validation slow down my web scraping performance?

Yes, validation adds processing overhead, but the benefits outweigh the costs. Use async validation for large datasets, implement caching for repeated validations, validate only essential fields for high-volume scraping, and consider sampling validation for non-critical data to balance performance and quality.

How do I validate dynamic or inconsistent HTML structures across different websites?

Use flexible selectors with multiple fallback options, implement content-based validation (check for expected data patterns), use machine learning for adaptive parsing, create website-specific validation rules, and implement progressive parsing that tries multiple extraction methods until one succeeds.

Products

Features

SDKs

No-Code Platforms

LLM & RAG Apps

Technical Challenges

Popular Targets

Real Estate

eCommerce

Social Media

Company & Reviews

Jobs

Search & SEO

Fashion

Travel & Hotels

Industry Solutions

How to Ensure Web Scrapped Data Quality

Explore this Article with AI

How to Scrape Glassdoor (2026 update)

Key Takeaways

Data Quality Challenges

Setup

Web Scraping with Python

Soft Validation with Cerberus

Defining Schema

Creating Validation Rules

Real Life Example

Typed Validation with Pydantic

Transforming and Interpreting

Cerberus or Pydantic?

Scraped Data Validation Summary

FAQs

Should I use Cerberus or Pydantic for web scraping data validation?

How do I handle validation errors when scraping thousands of pages?

What are the most common data quality issues in web scraping?

Does data validation slow down my web scraping performance?

How do I validate dynamic or inconsistent HTML structures across different websites?

Explore this Article with AI

Related Knowledgebase

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

What are some ways to parse JSON datasets in Python?

How to select dictionary key recursively in Python?

How to use XPath selectors in Python?

How to find elements without a specific attribute in BeautifulSoup?

What are some BeautifulSoup alternatives in Python?

How to handle popup dialogs in Playwright?

How to use proxies with Python httpx?

How to scrape images from a website?

How to select last element in XPath?

How to select all elements between two elements in XPath?

How to check if element exists in Playwright?

Related Articles

Ultimate Guide to JSON Parsing in Python

What is Parsing? From Raw Data to Insights

Intro to Parsing HTML and XML with Python and lxml

How to Parse XML

Web Scraping to Google Sheets

Web Scraping Emails using Python