What is Asynchronous Web Scraping?

by scrapecrow Apr 17, 2023

Asynchronous web scraping is a programming technique that allows for running multiple scrape tasks in effective parallel.

Asynchronous programming is especially important in web scraping as web scraping programs have a lot of waiting time. In other words, every time a web scraper requests a web page, it has to wait for the response. This waiting time can be relatively long, especially when scraping large amounts of web pages.

For example, let's take a look at this synchronous scraping example in Python:

import httpx
from time import time

_start = time()
pages = [
    "https://httpbin.dev/delay/2",
    "https://httpbin.dev/delay/2",
    "https://httpbin.dev/delay/2",
    "https://httpbin.dev/delay/2",
    "https://httpbin.dev/delay/2",
]
for page in pages:
    httpx.get(page)
print(f"finished scraping {len(pages)} pages in {time() - _start:.2f} seconds")
"finished scraping 5 pages in 15.46 seconds"

Here we have a list of 5 web pages that load in 2 seconds each. If we run this code, we'll see that it completes in ~15 seconds every time.

This is because our code waits for each page to fully complete scraping before moving on even if the program itself does nothing but wait for the server to respond.

In contrast, asynchronous web scraping allows for running multiple scrape tasks in effective parallel:

import httpx
import asyncio
from time import time

async def run():
    _start = time()
    async with httpx.AsyncClient() as client:
        pages = [
            "https://httpbin.dev/delay/2",
            "https://httpbin.dev/delay/2",
            "https://httpbin.dev/delay/2",
            "https://httpbin.dev/delay/2",
            "https://httpbin.dev/delay/2",
        ]
        # run all requests concurrently using asyncio.gather
        await asyncio.gather(*[client.get(page) for page in pages])
    print(f"finished scraping {len(pages)} pages in {time() - _start:.2f} seconds")

asyncio.run(run())
"finished scraping 5 pages in 2.93 seconds"

This Python example uses httpx.AsyncClient and asyncio to eliminate the waiting time by running all requests in parallel. As a result, the code completes in 2-3 seconds every time.


Asynchronous programming is an ideal fit for web scraping and one of the easiest ways to speed up web scraping. For more see:

Web Scraping Speed: Processes, Threads and Async

Scaling web scrapers can be difficult - in this article we'll go over the core principles like subprocesses, threads and asyncio and how all of that can be used to speed up web scrapers dozens to hundreds of times.

Web Scraping Speed: Processes, Threads and Async

Related Articles

What is Rate Limiting? Everything You Need to Know

Discover what rate limiting is, why it matters, how it works, and how developers can implement it to build stable, scalable applications.

BLOCKING
CRAWLING
HTTP
What is Rate Limiting? Everything You Need to Know

Guide to Axios Headers

Learn about Javascript's Axios headers. How to configure, update, inspect headers in request and responses, how to set defaults and useful tips

HTTP
NODEJS
Guide to Axios Headers

What is HTTP 401 Error and How to Fix it

Discover the HTTP 401 error meaning, its causes, and solutions in this comprehensive guide. Learn how 401 unauthorized errors occur.

HTTP
What is HTTP 401 Error and How to Fix it

Comprehensive Guide to OkHttp for Java and Kotlin

Learn how to simplify network communication in Java and Android applications using OkHttp.

HTTP
TOOLS
Comprehensive Guide to OkHttp for Java and Kotlin

What is HTTP 407 Status Code and How to Fix it

Learn everything about the HTTP 407 Proxy Authentication Required error. Understand its causes, including misconfigured proxies

HTTP
What is HTTP 407 Status Code and How to Fix it

Guide to Cloudflare's Error Code 520 and How to Fix it

Quick look at error code 520, what does it mean, its common causes, and how it can be prevented.

HTTP
Guide to Cloudflare's Error Code 520 and How to Fix it