How to Web Scrape with HTTPX and Python

article feature image

HTTPX is a new powerful HTTP client library for Python. It's quickly becoming the most popular option when it comes to HTTP connections in web scraping as it offers asynchronous client and http2 support.

In this highlight tutorial, we'll take a look at what makes Python's httpx so great for web scraping and how to use it effectively.

Installing httpx

HTTPX is a pure python package and so it can be easily installed using pip console command:

$ pip install httpx

Alternatively, it can be installed using poetry project package manager:

$ poetry init -d httpx
# or
$ poetry add httpx

Using HTTPX

HTTPX can be used for individual requests directly and supports most of the popular HTTP functions like GET, POST requests and can unpack JSON responses as python dictionaries directly:

import httpx

# GET request
response = httpx.get("https://httpbin.dev/get")
print(response)
data = response.json()
print(data['url'])

# POST requests
payload = {"query": "foo"}
# application/json content:
response = httpx.post("https://httpbin.dev/post", json=payload)
# or formdata:
response = httpx.post("https://httpbin.dev/post", data=payload)
print(response)
data = response.json()
print(data['url'])

Here we used httpx for JSON loading using the .json() method of the response. Httpx comes with many convenient and accessible shortcuts like this making it a very accessible HTTP client for web scraping.

Using httpx Client

For web scraping, it's best to use a httpx.Client which can apply custom settings like headers, cookies and proxies and configurations for the entire httpx session:

import httpx

with httpx.Client(
    # enable HTTP2 support
    http2=True,
    # set headers for all requests
    headers={"x-secret": "foo"},
    # set cookies
    cookies={"language": "en"},
    # set proxxies
    proxies={
        # set proxy for all http:// connections:
        "http": "http://222.1.1.1:8000",
        # set proxy for all https:// connections:
        "https": "http://222.1.1.1:8000",
        # socks5, socks4 and socks4a proxies can be used as well:
        "https": "socks5://222.1.1.1:8000",
    }
) as session:

httpx client applies a set of configurations to all requests and even keeps track of cookies set by the server.

Using httpx Asynchronously

To use httpx asynchronously with Python's asyncio the httpx.AsyncClient() object can be used:

import asyncio
import httpx

async def main():
    async with httpx.AsyncClient(
        # to limit asynchronous concurrent connections limits can be applied:
        limits=httpx.Limits(max_connections=10),
        # tip: increase timeouts for concurrent connections:
        timeout=httpx.Timeout(60.0),  # seconds
        # note: asyncClient takes in the same arguments like Client (like headers, cookies etc.)
    ) as client:
        # to make concurrent requests asyncio.gather can be used:
        urls = [
            "https://httpbin.dev/get",
            "https://httpbin.dev/get",
            "https://httpbin.dev/get",
        ]
        responses = asyncio.gather(*[client.get(url) for url in urls])
        # or asyncio.as_completed:
        for result in asyncio.as_completed([client.get(url) for url in urls]):
            response = await result
            print(response)

asyncio.run(main())

Note that when using async with all connections should finish before closing the async with statement otherwise exception will be raised:

RuntimeError: Cannot send a request, as the client has been closed.

Alternatively to the async with statement, the httpx AsyncClient can be opened/closed manually:

import asyncio
import httpx

async def main():
    client = httpx.AsyncClient()

    # do some scraping
    ...

    # close client
    await client.aclose()

asyncio.run(main())

Troubleshooting HTTPX

While httpx for Python is a great library it's easy to run into some popular problems. Here are a few popular issues that can be encountered when web scraping with httpx and how to address them:

httpx.TimeoutException

The httpx.TimeoutExcception error occurs when a request takes longer than the specified/default timeout duration. Try raising the timeout parameter:

httpx.get("https://httpbin.org/delay/10", timeout=httpx.Timeout(60.0))

httpx.ConnectError

The httpx.ConnectError exception is raised when connection issues are detected that can be caused by:

  • unstable internet connection.
  • server being unreachable.
  • mistakes in the URL parameter.

httpx.TooManyRedirects

The httpx.TooManyRedirects is raised when a request exceeds the maximum number of allowed redirects.

This can be caused by an issue with the scraped web server or httpx redirect resolution logic. It can be fixed by resolving redirects manually:

response = httpx.get(
    "https://httpbin.dev/redirect/3",
    allow_redirects=False,  # disable automatic redirect handling
)
# then we can check whether we want to handle redirecting ourselves:
redirect_location = response.headers["Location"]

httpx.HTTPStatusError

The httpx.HTTPStatusError error is raised when using raise_for_status=True parameter and the server response status code not in the 200-299 range like 404:

response = httpx.get(
    "https://httpbin.dev/redirect/3",
    raise_for_status=True,
)

When web scraping status codes outside of 200-299 range can mean the scraper is being blocked.

httpx.UnsupportedProtocol

The httpx.UnsupportedProtocol error is raised when URL provided in the protocol is missing or is not part of http://, https://, file:// or ftp://. This is most commonly encountered when URL is missing the https:// part.

Retrying HTTPX Requests

HTTPX client doesn't come with any retry features but it can easily integrate with popular retrying packages in Python like tenacity (pip install tenacity).

Using tenacity we can add retry logic for retrying status codes outside of 200-299 range, httpx exceptions and even by checking response body for failure keywords:

import httpx
from tenacity import retry, stop_after_attempt, wait_fixed, retry_if_exception_type, retry_if_result

# Define the conditions for retrying based on exception types
def is_retryable_exception(exception):
    return isinstance(exception, (httpx.TimeoutException, httpx.ConnectError))

# Define the conditions for retrying based on HTTP status codes
def is_retryable_status_code(response):
    return response.status_code in [500, 502, 503, 504]

# Define the conditions for retrying based on response content
def is_retryable_content(response):
    return "you are blocked" in response.text.lower()

# Decorate the function with retry conditions and parameters
@retry(
    retry=(retry_if_exception_type(is_retryable_exception) | retry_if_result(is_retryable_status_code) | retry_if_result(is_retryable_content)),
    stop=stop_after_attempt(3),
    wait=wait_fixed(5),
)
def fetch_url(url):
    try:
        response = httpx.get(url)
        response.raise_for_status()
        return response
    except httpx.RequestError as e:
        print(f"Request error: {e}")
        raise e

url = "https://httpbin.org/get"
try:
    response = fetch_url(url)
    print(f"Successfully fetched URL: {url}")
    print(response.text)
except Exception as e:
    print(f"Failed to fetch URL: {url}")
    print(f"Error: {e}")

Above we are using tenacity's retry decorator and define our retrying rules for common httpx errors.

Rotating Proxies for Retries

When it comes to handling blocking when web scraping with httpx proxy rotation can be used together with tenacity retry functionality.

In this example we'll take a look at a common web scraping pattern of rotating proxies and headers on scrape blocks. We'll add a retry that:

  • Retries status codes 403 and 404
  • Retries up to 5 times
  • Sleeps randomly 1-5 seconds between retries
  • Changes random proxy for each retry
  • Changes random User-Agent request header for each retry

Using httpx and tenacity:

import httpx
import random
from tenacity import retry, stop_after_attempt, wait_random, retry_if_result
import asyncio


PROXY_POOL = [
    "http://2.56.119.93:5074",
    "http://185.199.229.156:7492",
    "http://185.199.228.220:7300",
    "http://185.199.231.45:8382",
    "http://188.74.210.207:6286",
    "http://188.74.183.10:8279",
    "http://188.74.210.21:6100",
    "http://45.155.68.129:8133",
    "http://154.95.36.199:6893",
    "http://45.94.47.66:8110",
]
USER_AGENT_POOL = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/11.0.1 Safari/604.3.5",
]


# Define the conditions for retrying based on HTTP status codes
def is_retryable_status_code(response):
    return response.status_code in [403, 404]


# callback to modify scrape after each retry
def update_scrape_call(retry_state):
    # change to random proxy on each retry
    new_proxy = random.choice(PROXY_POOL)
    new_user_agent = random.choice(USER_AGENT_POOL)
    print(
        "retry {attempt_number}: {url} @ {proxy} with a new proxy {new_proxy}".format(
            attempt_number=retry_state.attempt_number,
            new_proxy=new_proxy,
            **retry_state.kwargs
        )
    )
    retry_state.kwargs["proxy"] = new_proxy
    retry_state.kwargs["client_kwargs"]["headers"]["User-Agent"] = new_user_agent


@retry(
    # retry on bad status code
    retry=retry_if_result(is_retryable_status_code),
    # max 5 retries
    stop=stop_after_attempt(5),
    # wait randomly 1-5 seconds between retries
    wait=wait_random(min=1, max=5),
    # update scrape call on each retry
    before_sleep=update_scrape_call,
)
async def scrape(url, proxy, **client_kwargs):
    async with httpx.AsyncClient(
        proxies={"http://": proxy, "https://": proxy},
        **client_kwargs,
    ) as client:
        response = await client.get(url)
        return response

Above is a short demo of how to apply retry logic that is can rotate proxy and user agent string on each retry.

First, we define our proxy and user agent pools then use the @retry decorator to wrap our scrape function with tenacity's retry logic.

To modify each retry we are using before_sleep parameter which can update our scrape function call with new parameters on each retry.

Here's an example test run:

async def example_run():
    urls = [
        "https://httpbin.dev/ip",
        "https://httpbin.dev/ip",
        "https://httpbin.dev/ip",
        "https://httpbin.dev/status/403",
    ]
    to_scrape = [scrape(url=url, proxy=random.choice(PROXY_POOL), headers={"User-Agent": "foo"}) for url in urls]
    for result in asyncio.as_completed(to_scrape):
        response = await result
        print(response.json())


asyncio.run(example_run())

Avoiding Blocking with Scrapfly

Scrapfly API offers a Python SDK which is like httpx on steroids.

illustration of scrapfly's middleware

Scrapfly is offering powerful features like:

All functions of httpx are supported by Scrapfly SDK making migration effortless:

from scrapfly import ScrapeConfig, ScrapflyClient

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")

result = client.scrape(ScrapeConfig(
    url="https://httpbin.dev/get",
    # enable anti-scraping protection (like cloudflare or perimeterx) bypass
    asp=True,
    # select proxy country:
    country="US",
    # enable headless browser
    render_js=True,
))
print(result.content)

# tip: use concurrent scraping for blazing speeds:
to_scrape = [
    ScrapeConfig(url="https://httpbin.dev/get")
    for i in range(10)
]
async for result in client.concurrent_scrape(to_scrape):
    print(result.content)

Scrapfly SDK can be installed using pip console command and is free to try: pip install scrapfly-sdk

FAQ

To wrap this Python httpx introduction let's take a look at some frequently asked questions related to web scraping with httpx.

HTTPX vs Requests

Requests is the most popular http client for Python known for being accessible and easy to work with. It's also an inspiration to HTTPX which is a natural successor to requests with modern python features like asyncio support and http2.

HTTPX vs Aiohttp

Aiohttp is one of the first HTTP clients that supported asyncio and one of the inspirations of HTTPX. These two packages are very similar though aiohttp is more mature while httpx is newer but more feature rich. So, when it comes to aiohttp vs httpx the later is prefered in web scraping because of http2 support.

How to use HTTP2 with httpx?

Httpx support http2 version which is recommended for web scraping as it can drastically reduce scraper block rate. HTTP2 is not enabled by default and for that http2=True parameter must be used in httpx.Client(http2=True) and httpx.AsyncClient(http2=True) objects.

How to automatically follow redirects in httpx?

Httpx doesn't follow redirects by default unlike other Python libraries like requests. To enable automatic redirect following use allow_redirects=True parameter in httpx request methods like httpx.get(url, allow_redirects=True) or httpx client objects like httpx.Client(allow_redirects=True)

Summary

HTTPX is a brilliant new HTTP client library that is quickly becoming the de facto standard in Python web scraping communities. It offers features like http2 and asyncio support which decreases the risk of blocking and allows concurrent web scraping.

Together with tenacity, httpx makes requesting web resources a breeze with powerful retry logic like proxy and user agent header rotation.