Data Processing, Output & Validation

The last step of every web scraping project is final data processing. This often involved data validation, cleanup and storage.

As scrapers are dealing with data from an unknown source data processing can be a surprisingly complex challenge. For long-term scraping, data validation can be crucial for scraper maintenance. Using tools to ensure scraped result quality can prevent scrapers from silently breaking or performing sub-optimally.

Data Formats

Web scraping datasets can vary greatly from small predictable structures to large, complex data graphs.

Most commonly the CSV and JSON formats are used.

CSV

CSV is great for flat datasets with a consistent structure. This format can be directly imported to spreadsheet software (Excel, Google Sheets, etc) and doesn't require compression as the format is already very compact.

Here's a short example scraper that stores data to CSV:

Some things to note when working with CSV:

  • The separator characters (default ,) have to be escaped
  • CSV is a flat structure so nested datasets have to be flattened

JSON

JSON is great for complex structures as it allows easy nesting and key-to-value structuring. However, JSON datasets can take up a lot of space and require compression for best storage efficiency.

Here's a short example scraper that stores data in JSON:

Some things to note when working with JSON:

  • The quote (") characters have to be escaped
  • Unicode support is often not enabled by default

JSONL

JSONL is a particularly popular JSON format variant in web scraping datasets where each line is a JSON object. This behavior allows for easy result-streaming which makes scrapers easier to work with.

Here's an example of a simple JSON Lines scraper:

Above, we open our output file on each iteration and append a new line. This enables easy data streaming for our web scrapers which is a much easier way to handle large data flows.

Spreadsheets

Spreadsheets are a natural fit for web scraping as they are designed to handle dynamic data, can be streamed (appending rows) and can be easy to work with. CSV output is already compatible with spreadsheets but other formats like Google Sheets can add extra features like online access, version control and collaboration.

Data Processing

Most of the datapoint found on the web a free format. Dates and prices for example are expressed in text needing to be converted to matching data types. Here are some tips we cover on this subject:

Data Validation

When scraping at scale data validation is vital for consistent results as real web pages change unpredictably and often.

There are multiple ways to approach validation but most importantly tracking results and matching them against schema and regular expression patterns can catch 99% of failures.

Next - Blocking

We've covered most of web scraping subjects that web scraper developers come across when developing web scraping programs. Though, by far the biggest barrier in scraping is scraper blocking and next, let's take a look at what it is and how to avoid it.

< >

Summary