Scaling

Web scraping consists of many different technologies that can lead to potential performance bottlenecks. Let's take a look at some common ones and how to approach scraper scaling in practice.

Async

For scaling HTTP connections asynchronous or threaded programming is vital for avoiding IO blocks.

Almost every language has some form of async or threading support. For example asyncio in python and async/await in Javascript.

Scrapfly supports built-in, easy concurrency in both Python SDK and Typescript SDK.

For example, we can pit async and synchronous scrapers in this Python example to observe the performance difference:

For this example, we're using a simulated httpbin.dev/delay/2 endpoint which simulates 2 second response time. This is an average response time for a commercial website and it's a great demonstration of how powerful async programming can be.

Fast scraping can result in scraper blocking! It's important to distribute scrape traffic through multiple proxies and scraper fingerprints. (Scrapfly does this automatically)

Processing

When it comes to data processing the challenges shift from IO blocking to CPU-bound tasks. Modern CSS selectors and XPath engines are very fast and can parse HTML documents in seconds by distributing processing through multiple processes can significantly speed up the process. For that see our Introduction to Multi Processing blog article.

The general rule of thumb is to use a single process per CPU core. So, if your server/computer has 12 cores splitting up scraper processing into 12 processes will yield the most optimal results.

Finally, not all data processors are built equally and some can be significantly faster than others. For example, the lxml library can be up to 10 times faster than BeautifulSoup when it comes to HTML processing.

Headless Browsers

Scaling headless browsers is similar to scaling processing but as web browsers are immensely complex handling errors, failure and unexpected behavior is the biggest challenge. For this reason headless browser services like self-hosted Selenium-Grid or Scrapfly's Javascript Rendering are the most approachable solutions.

Next - Summary

This concludes what we have in our Scraping Walkthrough Academy though let's wrap it up with a summary and where to go from here.

< >

Summary