Data Processing, Output & Validation
The last step of every web scraping project is final data processing. This often involved data validation, cleanup and storage.
As scrapers are dealing with data from an unknown source data processing can be a surprisingly complex challenge. For long-term scraping, data validation can be crucial for scraper maintenance. Using tools to ensure scraped result quality can prevent scrapers from silently breaking or performing sub-optimally.
Data Formats
Web scraping datasets can vary greatly from small predictable structures to large, complex data graphs.
Most commonly the CSV and JSON formats are used.
CSV
CSV is great for flat datasets with a consistent structure. This format can be directly imported to spreadsheet software (Excel, Google Sheets, etc) and doesn't require compression as the format is already very compact.
Here's a short example scraper that stores data to CSV:
Some things to note when working with CSV:
- The separator characters (default
,
) have to be escaped - CSV is a flat structure so nested datasets have to be flattened
JSON
JSON is great for complex structures as it allows easy nesting and key-to-value structuring. However, JSON datasets can take up a lot of space and require compression for best storage efficiency.
Here's a short example scraper that stores data in JSON:
Some things to note when working with JSON:
- The quote (") characters have to be escaped
- Unicode support is often not enabled by default
JSONL
JSONL is a particularly popular JSON format variant in web scraping datasets where each line is a JSON object. This behavior allows for easy result-streaming which makes scrapers easier to work with.
Here's an example of a simple JSON Lines scraper:
Above, we open our output file on each iteration and append a new line. This enables easy data streaming for our web scrapers which is a much easier way to handle large data flows.
Spreadsheets
Spreadsheets are a natural fit for web scraping as they are designed to handle dynamic data, can be streamed (appending rows) and can be easy to work with. CSV output is already compatible with spreadsheets but other formats like Google Sheets can add extra features like online access, version control and collaboration.
Data Processing
Most of the datapoint found on the web a free format. Dates and prices for example are expressed in text needing to be converted to matching data types. Here are some tips we cover on this subject:
Automatic Date Parsing with Dateparsers
Dateparser is a python library that can accurately guess date objects from freeform date strings.
Tips for Scraping Emails
Intro to email scraping and parsing which has its own unique data processing challenges like deobfuscation.
Tips for Scraping Phone Numbers
Intro to phone number scraping which can be difficult to successfully validate and process.
Data Validation
When scraping at scale data validation is vital for consistent results as real web pages change unpredictably and often.
There are multiple ways to approach validation but most importantly tracking results and matching them against schema and regular expression patterns can catch 99% of failures.
Next - Blocking
We've covered most of web scraping subjects that web scraper developers come across when developing web scraping programs. Though, by far the biggest barrier in scraping is scraper blocking and next, let's take a look at what it is and how to avoid it.