Web Scraping Speed: Processes, Threads and Async

Web Scraping Speed: Processes, Threads and Async

Speeding up web scrapers can be a daunting task, so in this tutorial, we'll take a look at are the biggest speed bottlenecks in web scraping.

In this article we'll focus on Python but the same ideas and concepts can be applied to almost any other programming language or web scraping toolkit.

We'll cover what are CPU-bound and IO-bound tasks, how we can optimize for them using processes, threads and asyncio to speed up our scrapers dozens to hundreds of times. So, let's dive in!

The Obstacles

In web scraping, we primarily have two types of performance bottlenecks: IO-bound and CPU-bound.

cpu bound and io bound tasks
CPU-bound tasks are limited by the CPU capabilities while IO-bound tasks are limited by the communication performance

For example, our IO (Input/Output) tasks would be anything that performs an external connection - be it an HTTP request or storing scraped data to a database. Both of which are major parts of web scraper programs:

# HTTP requests are IO-bound
import requests
from time import time
_start = time()
# requests will block here untill web server responds:
response = requests.get("https://scrapfly.io/blog/")
print(f"requests finished in: {time()-_start:.2f}")

We also encounter CPU tasks such as parsing scraped HTMLs, loading JSON data, natural language parsing, and so on.

# To parse html our CPU needs to do quite a bit of calculations:
from parsel import Selector
selector = Selector(html)
article = "\n".join(selector.css(".post-section p").getall())

# even more so if we're using complex parsing tech such as nature language processing 
from nltk.tokenize import word_tokenize
word_tokens = word_tokenize(article)

In web scraping, we encounter both of these performance challenges. However, IO-blocks make up a much higher part of the overall performance hit as scrapers take a lot of IO-blocking actions.

Scaling Options in Python

Before we start speeding stuff up, let's take a look at available scaling technologies in Python:

illustration of processes and threads
  • Process is the exclusive memory space where the program runs. Every process has its own python interpreter and personal memory.
  • Every process can have multiple Child Processes though communication between processes is difficult, making sharing data between processes complicated.
  • Every process can have threads, which can easily share data between them but cannot run in parallel like processes can.

To summarize, we can have multiple processes to deal with CPU-bound tasks as they can compute in parallel on each processor core. So, if our machine has 12 cores we can run 12 computing processes in parallel.

Each process can also have threads that cannot run in parallel but can take turns and share data between them which is a way to manage IO-bound tasks as one thread can take over while the other is waiting for IO to complete.

So when it comes to web scraping: multi-processing for parsing and multi-threading for connections?
Not exactly. Threads in Python are a bit complex and expensive. We can easily start few threads but if we're making thousands of connections most of our computing resources will be taken up by all of the thread handling overhead.
So, instead, a new technology is more favorible in web scraping: single-threaded asynchronous programs!

So to summarize all of this: multi processing for CPU performance and asyncio for IO performance.
Let's take a look at both of these in practical web scraping!

Async Requests

IO block is when our program has to wait for an external system to respond. For example, making a HTTP connection - our program sends a request and waits for the web server to send it a response. This could be several seconds of waiting.

Why can't our program just do something else while it waits? That's exactly what asyncio does!
Asynchronous programs use a manager (called an event loop) that can pause some functions and let others take over in the meantime. In other words, while one IO-blocking operation (like a request) is waiting the event loop will let the other operations take over.

So, if we make 50 HTTP requests, each one taking 1 second, in a synchrnous program we'll end up spending 50+ seconds:

import requests
from time import time

_start = time()
for i in range(50):
    request.get("http://httpbin.org/delay/1")
print(f"finished in: {time() - _start:.2f} seconds")
# finished in: 52.21 seconds

Most of this spent time is our program waiting for the server to respond to us. Let's get rid of this waiting using an asynchronous HTTP client instead:

import httpx
import asyncio
from time import time

_start = time()

async def main():
    async with httpx.AsyncClient() as client:
        tasks = [client.get("http://httpbin.org/delay/1") for i in range(50)]
        for response_future in asyncio.as_completed(tasks):
            response = await response_future
    print(f"finished in: {time() - _start:.2f} seconds")

asyncio.run(main())
# finished in: 3.55 seconds

Now, all of our requests can wait together, which gives us an immense speed boost!

illustration of sync vs async programs
Asyncio allows us to group multiple IO-blocking tasks as one

However, this means that we have to tell our program explicitly when it can bundle this waiting. We do this through asyncio.gather or asyncio.as_completed function helpers:

import httpx
import asyncio
from time import time


async def with_gathering():
    _start = time()
    async with httpx.AsyncClient() as client:
        tasks = [client.get("http://httpbin.org/delay/1") for i in range(50)]
        for response_future in asyncio.as_completed(tasks):
            response = await response_future
    print(f"with gathering finished in: {time() - _start:.2f} seconds")

async def without_gathering():
    _start = time()
    async with httpx.AsyncClient() as client:
        for i in range(50):
            response = await client.get("http://httpbin.org/dealy/1")
    print(f"without gathering finished in: {time() - _start:.2f} seconds")
        

asyncio.run(with_gathering())
# with gathering finished in: 3.55 seconds
asyncio.run(without_gathering())
# without gathering finished in: 52.78 seconds

We can see that without it, we'll be back to the same speed as our synchronous program before. So, designing asynchronous programs is a bit harder as we have to state when tasks can be run concurrently explicitly.

Mixing With Synchronous Code

The only downside of asyncio is that we need our libraries to provide explicit support for it. Thus old community packages cannot take advantage of asyncio speed without being updated.

However, asyncio comes with a brilliant tool asyncio.to_thread() function which can turn any synchronous function into an asynchronous one!

to_thread() does this by deferring synchronous code to a new thread managed by asyncio. So, we can easily integrate slow sync code into our asynchronous programs.

Let's take a look at an imaginary example where we have two scraper functions: async scraper we wrote ourselves and a community scraper that uses synchronous code:

from time import time
import requests
import httpx
import asyncio


def community_movie_scraper(movie_name):
    """community movie scraper is synchronous and slow"""
    response = requests.get("http://httpbin.org/delay/1")
    ...
    return {"title": movie_name}


async def our_movie_scraper(client, movie_name):
    """our movie scraper is asynchronous and fast"""
    response = await client.get("http://httpbin.org/delay/1")
    ...
    return {"title": movie_name}


async def scrape():
    """scrape movies using both our scraper and community scraper"""
    movies = ["badguys", "witch", "interstellar", "prey", "luck", "up"]
    _start = time()
    async with httpx.AsyncClient() as client:
        async_tasks = [our_movie_scraper(client, f"async: {movie}") for movie in movies]
        sync_tasks = [asyncio.to_thread(community_movie_scraper, f"sync: {movie}") for movie in movies]
        for result in asyncio.as_completed(async_tasks + sync_tasks):
            print(await result)
    print(f"completed in {time() - _start:.2f}")


if __name__ == "__main__":
    asyncio.run(scrape())
Run Output
{'title': 'sync: badguys'}
{'title': 'sync: interstellar'}
{'title': 'async: up'}
{'title': 'async: interstellar'}
{'title': 'async: badguys'}
{'title': 'sync: witch'}
{'title': 'async: luck'}
{'title': 'sync: up'}
{'title': 'sync: luck'}
{'title': 'sync: prey'}
{'title': 'async: witch'}
{'title': 'async: prey'}
completed use_threads=True in 2.24
{'title': 'sync: badguys'}
{'title': 'sync: witch'}
{'title': 'sync: interstellar'}
{'title': 'sync: prey'}
{'title': 'sync: luck'}
{'title': 'sync: up'}
{'title': 'async: badguys'}
{'title': 'async: interstellar'}
{'title': 'async: prey'}
{'title': 'async: up'}
{'title': 'async: witch'}
{'title': 'async: luck'}
completed use_threads=False in 13.24

In the example above, we have two movie scraper functions: our super fast asynchronous one and a slow community one.
To speed up our overall program, we simply defer the synchronous, slow functions to threads!

As we've covered before, python threads cannot run in parallel though they can pause and take turns just like asyncio coroutines. This means, we can mix and match async code and threads very easily!
In our example, we've created 6 asyncio coroutines and 6 asyncio threads allowing us to easily combine our fast async code with slow sync code and run them at async speed.


Using asyncio we can quickly scale around IO-blocking like HTTP requests in web scraping. However, another big part of web scraping is the data parsing itself. So, let's take a look how can we scale around that using multi processing next.

Multi Process Parsing

Using asyncio we can get data quickly, but to parse it our Python program will still use a single CPU core when modern processors have dozens of CPU cores.

To distribute our parsing through multiple CPU cores we can use multi-processing.

illustration of multi processing
We can distribute our parsing tasks through multiple processes

Even modern laptops have a dozen or more cores:

import multiprocessing
print(f"This machine has {multiprocessing.cpu_count()} CPU cores")
# This machine has 12 CPU cores

If we have 12 cores, we can spawn 12 concurrent processes to parse our scraped content, that's potentially a 12x speed boost!

The easiest way to take advantage of multi-processing in python is through concurrent.futures.ProcessPoolExecutor:

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from time import time


def fibonacci_of(n):
    """get fibonnaci number"""
    if n in {0, 1}:
        return n
    return fibonacci_of(n - 1) + fibonacci_of(n - 2)


def multi_process(number, times):
    start = time()
    with ProcessPoolExecutor() as executor:
        for result in executor.map(fibonacci_of, [number for i in range(times)]):
            pass
        print("done")
    print(f"multi-process finished in {time() - start:.2f}")


def single_process(number, times):
    start = time()
    for i in range(times):
        fibonacci_of(number)
    print(f"single-process finished in {time() - start:.2f}")


if __name__ == "__main__":
    fib_number = 36  # single calculation of 36 takes around 1-3 seconds
    count = 12
    multi_process(fib_number, count)
    # multi-process finished in 3.1
    single_process(fib_number, count)
    # single-process finished in 32.8

Here we can see how using ProcessPoolExecutor sped up our program more than 10 times.

ProcessPoolExecutor start a maximum amount of subprocess, which is equals to the amount of available CPU cores. So on a machine with 12 cpu cores it'll spawn 12 processes that will distribute the workload equally giving us a major performance boost.

Async + Multi Processing

Finally, we can combine both of these technologies to fully utilize Python for web scraping. We can write our scraper in asynchronous python and then distribute it through multiple processes.

illustration of multi process based async scraping
Best of both worlds: Processing for CPU tasks and asyncio for IO tasks

Let's take a look at this example scraper:

import asyncio
import multiprocessing
from concurrent.futures import ProcessPoolExecutor
from time import sleep, time

import httpx


async def scrape(urls):
    """this is our async scraper that scrapes"""
    results = []
    async with httpx.AsyncClient(timeout=httpx.Timeout(30.0)) as client:
        scrape_tasks = [client.get(url) for url in urls]
        for response_f in asyncio.as_completed(scrape_tasks):
            response = await response_f
            # emulate data parsing/calculation
            sleep(0.5)
            ...
            results.append("done")
    return results


def scrape_wrapper(args):
    i, urls = args
    print(f"subprocess {i} started")
    result = asyncio.run(scrape(urls))
    print(f"subprocess {i} ended")
    return result


def multi_process(urls):
    _start = time()

    batches = []
    batch_size = multiprocessing.cpu_count() - 1  # let's keep 1 core for ourselves
    print(f"scraping {len(urls)} urls through {batch_size} processes")
    for i in range(0, len(urls), batch_size):
        batches.append(urls[i : i + batch_size])
    with ProcessPoolExecutor() as executor:
        for result in executor.map(scrape_wrapper, enumerate(batches)):
            print(result)
        print("done")

    print(f"multi-process finished in {time() - _start:.2f}")

def single_process(urls):
    _start = time()
    results = asyncio.run(scrape(urls))
    print(f"single-process finished in {time() - _start:.2f}")



if __name__ == "__main__":
    urls = ["http://httpbin.org/delay/1" for i in range(100)]
    multi_process(urls)
    # multi-process finished in 7.22
    single_process(urls)
    # single-process finished in 51.28

In our code example above, we have two scrape runners:

  • single_process is our simple async scrape runner, which gets around IO blocking but still spends a lot of time parsing.
  • multi_process is our async scraper distributed through multiple processes, which gets around IO blocking and increases parsing speed.

Designing scrapers with async processes can look daunting at first but with a little bit of effort we can achieve incredible web scraping speeds.

Concurrency in ScrapFly

Concurrency can be difficult in web scraping especially when handling other scraping challenges like scraper blocking. This is where Scrapfly can lend a hand!

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

For example, ScrapFly SDK implements a very convenient concurrent_scrape() function, which executes many scrape tasks in parallel:

import asyncio
from scrapfly import ScrapflyClient, ScrapeConfig

async def main():
    client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
    to_scrape = [ScrapeConfig("http://httpbin.org/delay/1/") for i in range(20)]
    results = await client.concurrent_scrape(to_scrape)
    print(results)
asyncio.run(main())

By deferring connection tasks to ScrapFly service we don't need to worry about blocking or scaling!

Summary

In this python web scraping tutorial we've taken a look at scraping speed basics. We covered how threads or asyncio can help us with speed around IO-blocks and how multi-processing can help us with speed around CPU-blocks.

By using built-in tools in Python we can speed up our scrapers from dozens to thousands of times with very little extra resource or development time overhead.

Related Posts

Playwright Examples for Web Scraping and Automation

Learn Playwright with Python and JavaScript examples for automating browsers like Chromium, WebKit, and Firefox.

How to use wget in Python

Learn how to use wget in Python through subprocess calls and what are other options.

Ultimate Guide to JSON Parsing in Python

Learn JSON parsing in Python with this ultimate guide. Explore basic and advanced techniques using json, and tools like ijson and nested-lookup