When web scraping, we're often limited by the website's technical capabilities. We can't scrape too fast without being blocked or overwhelming smaller websites.
Asynchronous HTTP clients like httpx allow us to easily make hundreds of requests per second but don't provide ways to fine-tune our scraping speed. So, in this quick tutorial, we'll be taking a look at how to rate-limit asynchronous HTTP connections to slow down our scrapers.
Python httpx
HTTPX is the most popular asynchronous HTTP client in Python which can be installed using pip install terminal command:
$ pip install httpx
HTTPX supports both synchronous and asynchronous HTTP clients. It's unlikely that we need to throttle synchronous connections as they are very slow to begin so let's take a look at the throttling options we have for the asynchronous client.
To limit httpx client we can use the httpx.Limit object:
import httpx
session = httpx.AsyncClient(
limits=httpx.Limits(
max_connections=5 # we can change max connection count here
)
)
However limiting scraping by connection count is very inaccurate - on slower websites 5 connections might only manage a few requests per minute while on fast ones it could reach hundreds of requests per minute.
To limit httpx-powered scrapers we need an additional layer that tracks requests themselves rather than connections. Let's take a look at the most popular throttling library - aiometer.
Python aiometer
The most popular way to limit all asynchronous tasks in Python is aiometer which can be installed using the pip install terminal command:
$ pip install aiometer
Then we can schedule all of our requests to run through aiometer limiter:
import asyncio
from time import time
import aiometer
import httpx
session = httpx.AsyncClient()
async def scrape(url):
response = await session.get(url)
return response
async def run():
_start = time()
urls = ["http://httpbin.org/html" for i in range(10)]
results = await aiometer.run_on_each(
scrape,
urls,
max_per_second=1, # here we can set max rate per second
)
print(f"finished {len(urls)} requests in {time() - _start:.2f} seconds")
return results
if __name__ == "__main__":
asyncio.run(run())
# will print:
# finished 10 requests in 9.54 seconds
In our small example scraper we used aiometer.run_on_each function to limit 10 scrape requests at 1 request per second. With this one command, we can throttle our scraper to exact requests per second speed!
Scraping Faster Without Rate Limiting?
Some websites impose extremely low-speed limits for web scrapers making data collection impossible. To handle this web scraping APIs like ScrapFly can be used to scrape faster and avoid blocking.
ScrapFly offers dozens of different features that can help with scraper scaling, like:
To use ScrapFly in Python install ScrapFly-sdk package using the pip install scrapfly-sdk terminal command. Then targets can be scraped without the blocking:
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(
key="YOUR SCRAPFLY KEY",
max_concurrency=10, # we can limit concurrent requests if needed
)
result = client.scrape(ScrapeConfig(
url="http://httpbin.org/ip",
# optional features like:
# - we can select specific proxy country
country="GB",
# - and enable anti scraping protection bypass:
asp=True,
# see https://scrapfly.io/docs/scrape-api/getting-started for more
))
Learn about the fundamentals of parsing data, across formats like JSON, XML, HTML, and PDFs. Learn how to use Python parsers and AI models for efficient data extraction.