How to Rate Limit Async Requests in Python

article feature image

When web scraping, we're often limited by the website's technical capabilities. We can't scrape too fast without being blocked or overwhelming smaller websites.

Asynchronous HTTP clients like httpx allow us to easily make hundreds of requests per second but don't provide ways to fine-tune our scraping speed. So, in this quick tutorial, we'll be taking a look at how to rate-limit asynchronous HTTP connections to slow down our scrapers.

httpx

HTTPX is the most popular asynchronous HTTP client in Python which can be installed using pip install terminal command:

$ pip install httpx

HTTPX supports both synchronous and asynchronous HTTP clients. It's unlikely that we need to throttle synchronous connections as they are very slow to begin so let's take a look at the throttling options we have for the asynchronous client.

To limit httpx client we can use the httpx.Limit object:

import httpx
session = httpx.AsyncClient(
    limits=httpx.Limits(
        max_connections=5  # we can change max connection count here
    )
)

However limiting scraping by connection count is very inaccurate - on slower websites 5 connections might only manage a few requests per minute while on fast ones it could reach hundreds of requests per minute.

To limit httpx-powered scrapers we need an additional layer that tracks requests themselves rather than connections. Let's take a look at the most popular throttling library - aiometer.

aiometer

The most popular way to limit all asynchronous tasks in Python is aiometer which can be installed using the pip install terminal command:

$ pip install aiometer

Then we can schedule all of our requests to run through aiometer limiter:

import asyncio
from time import time

import aiometer
import httpx

session = httpx.AsyncClient()


async def scrape(url):
    response = await session.get(url)
    return response


async def run():
    _start = time()
    urls = ["http://httpbin.org/html" for i in range(10)]
    results = await aiometer.run_on_each(
        scrape, 
        urls,
        max_per_second=1,  # here we can set max rate per second
    )
    print(f"finished {len(urls)} requests in {time() - _start:.2f} seconds")
    return results


if __name__ == "__main__":
    asyncio.run(run())

# will print:
# finished 10 requests in 9.54 seconds

In our small example scraper we used aiometer.run_on_each function to limit 10 scrape requests at 1 request per second. With this one command, we can throttle our scraper to exact requests per second speed!

Scraping Faster Without Rate Limiting?

Some websites impose extremely low-speed limits for web scrapers making data collection impossible. To handle this web scraping APIs like ScrapFly can be used to scrape faster and avoid blocking.

ScrapFly offers dozens of different features that can help with scraper scaling, like:

To use ScrapFly in Python install ScrapFly-sdk package using the pip install scrapfly-sdk terminal command. Then targets can be scraped without the blocking:

from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(
    key="YOUR SCRAPFLY KEY",
    max_concurrency=10,  # we can limit concurrent requests if needed
)
result = client.scrape(ScrapeConfig(
    url="http://httpbin.org/ip",
    # optional features like:
    # - we can select specific proxy country
    country="GB",
    # - and enable anti scraping protection bypass:
    asp=True,
    # see https://scrapfly.io/docs/scrape-api/getting-started for more
))

Related Posts

How to Scrape RightMove Real Estate Property Data with Python

In this scrape guide we'll be taking a look at scraping RightMove.co.uk - one of the most popular real estate listing websites in the United Kingdom. We'll be scraping hidden web data and backend APIs directly using Python.

How to Scrape Google Search with Python

In this scrape guide we'll be taking a look at how to scrape Google Search - the biggest index of public web. We'll cover dynamic HTML parsing and SERP collection itself.

Quick Intro to Parsing JSON with JSONPath in Python

JSONPath is a path expression language for JSON. It is used to query data from JSON datasets and it is similar to XPath query language for XML documents. Parsing HTML