How to Scrape Forms
Learn how to scrape forms through a step-by-step guide using HTTP clients and headless browsers.
HTTPX is a new powerful HTTP client library for Python. It's quickly becoming the most popular option when it comes to HTTP connections in web scraping as it offers asynchronous client and http2 support.
In this highlight tutorial, we'll take a look at what makes Python's httpx so great for web scraping and how to use it effectively.
HTTPX is a pure python package and so it can be easily installed using pip
console command:
$ pip install httpx
Alternatively, it can be installed using poetry project package manager:
$ poetry init -d httpx
# or
$ poetry add httpx
HTTPX can be used for individual requests directly and supports most of the popular HTTP functions like GET, POST requests and can unpack JSON responses as python dictionaries directly:
import httpx
# GET request
response = httpx.get("https://httpbin.dev/get")
print(response)
data = response.json()
print(data['url'])
# POST requests
payload = {"query": "foo"}
# application/json content:
response = httpx.post("https://httpbin.dev/post", json=payload)
# or formdata:
response = httpx.post("https://httpbin.dev/post", data=payload)
print(response)
data = response.json()
print(data['url'])
Here we used httpx for JSON loading using the .json()
method of the response. Httpx comes with many convenient and accessible shortcuts like this making it a very accessible HTTP client for web scraping.
For web scraping, it's best to use a httpx.Client
which can apply custom settings like headers, cookies and proxies and configurations for the entire httpx session:
import httpx
with httpx.Client(
# enable HTTP2 support
http2=True,
# set headers for all requests
headers={"x-secret": "foo"},
# set cookies
cookies={"language": "en"},
# set proxxies
proxies={
# set proxy for all http:// connections:
"http": "http://222.1.1.1:8000",
# set proxy for all https:// connections:
"https": "http://222.1.1.1:8000",
# socks5, socks4 and socks4a proxies can be used as well:
"https": "socks5://222.1.1.1:8000",
}
) as session:
httpx client applies a set of configurations to all requests and even keeps track of cookies set by the server.
To use httpx asynchronously with Python's asyncio
the httpx.AsyncClient()
object can be used:
import asyncio
import httpx
async def main():
async with httpx.AsyncClient(
# to limit asynchronous concurrent connections limits can be applied:
limits=httpx.Limits(max_connections=10),
# tip: increase timeouts for concurrent connections:
timeout=httpx.Timeout(60.0), # seconds
# note: asyncClient takes in the same arguments like Client (like headers, cookies etc.)
) as client:
# to make concurrent requests asyncio.gather can be used:
urls = [
"https://httpbin.dev/get",
"https://httpbin.dev/get",
"https://httpbin.dev/get",
]
responses = asyncio.gather(*[client.get(url) for url in urls])
# or asyncio.as_completed:
for result in asyncio.as_completed([client.get(url) for url in urls]):
response = await result
print(response)
asyncio.run(main())
Note that when using async with
all connections should finish before closing the async with
statement otherwise exception will be raised:
RuntimeError: Cannot send a request, as the client has been closed.
Alternatively to the async with
statement, the httpx AsyncClient can be opened/closed manually:
import asyncio
import httpx
async def main():
client = httpx.AsyncClient()
# do some scraping
...
# close client
await client.aclose()
asyncio.run(main())
While httpx for Python is a great library it's easy to run into some popular problems. Here are a few popular issues that can be encountered when web scraping with httpx and how to address them:
The httpx.TimeoutExcception
error occurs when a request takes longer than the specified/default timeout duration. Try raising the timeout parameter:
httpx.get("https://httpbin.org/delay/10", timeout=httpx.Timeout(60.0))
The httpx.ConnectError
exception is raised when connection issues are detected that can be caused by:
The httpx.TooManyRedirects
is raised when a request exceeds the maximum number of allowed redirects.
This can be caused by an issue with the scraped web server or httpx redirect resolution logic. It can be fixed by resolving redirects manually:
response = httpx.get(
"https://httpbin.dev/redirect/3",
allow_redirects=False, # disable automatic redirect handling
)
# then we can check whether we want to handle redirecting ourselves:
redirect_location = response.headers["Location"]
The httpx.HTTPStatusError
error is raised when using raise_for_status=True
parameter and the server response status code not in the 200-299 range like 404:
response = httpx.get(
"https://httpbin.dev/redirect/3",
raise_for_status=True,
)
When web scraping status codes outside of 200-299 range can mean the scraper is being blocked.
The httpx.UnsupportedProtocol
error is raised when URL provided in the protocol is missing or is not part of http://
, https://
, file://
or ftp://
. This is most commonly encountered when URL is missing the https://
part.
HTTPX client doesn't come with any retry features but it can easily integrate with popular retrying packages in Python like tenacity (pip install tenacity
).
Using tenacity
we can add retry logic for retrying status codes outside of 200-299 range, httpx exceptions and even by checking response body for failure keywords:
import httpx
from tenacity import retry, stop_after_attempt, wait_fixed, retry_if_exception_type, retry_if_result
# Define the conditions for retrying based on exception types
def is_retryable_exception(exception):
return isinstance(exception, (httpx.TimeoutException, httpx.ConnectError))
# Define the conditions for retrying based on HTTP status codes
def is_retryable_status_code(response):
return response.status_code in [500, 502, 503, 504]
# Define the conditions for retrying based on response content
def is_retryable_content(response):
return "you are blocked" in response.text.lower()
# Decorate the function with retry conditions and parameters
@retry(
retry=(retry_if_exception_type(is_retryable_exception) | retry_if_result(is_retryable_status_code) | retry_if_result(is_retryable_content)),
stop=stop_after_attempt(3),
wait=wait_fixed(5),
)
def fetch_url(url):
try:
response = httpx.get(url)
response.raise_for_status()
return response
except httpx.RequestError as e:
print(f"Request error: {e}")
raise e
url = "https://httpbin.org/get"
try:
response = fetch_url(url)
print(f"Successfully fetched URL: {url}")
print(response.text)
except Exception as e:
print(f"Failed to fetch URL: {url}")
print(f"Error: {e}")
Above we are using tenacity's retry
decorator and define our retrying rules for common httpx errors.
When it comes to handling blocking when web scraping with httpx proxy rotation can be used together with tenacity
retry functionality.
In this example we'll take a look at a common web scraping pattern of rotating proxies and headers on scrape blocks. We'll add a retry that:
Using httpx and tenacity:
import httpx
import random
from tenacity import retry, stop_after_attempt, wait_random, retry_if_result
import asyncio
PROXY_POOL = [
"http://2.56.119.93:5074",
"http://185.199.229.156:7492",
"http://185.199.228.220:7300",
"http://185.199.231.45:8382",
"http://188.74.210.207:6286",
"http://188.74.183.10:8279",
"http://188.74.210.21:6100",
"http://45.155.68.129:8133",
"http://154.95.36.199:6893",
"http://45.94.47.66:8110",
]
USER_AGENT_POOL = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/11.0.1 Safari/604.3.5",
]
# Define the conditions for retrying based on HTTP status codes
def is_retryable_status_code(response):
return response.status_code in [403, 404]
# callback to modify scrape after each retry
def update_scrape_call(retry_state):
# change to random proxy on each retry
new_proxy = random.choice(PROXY_POOL)
new_user_agent = random.choice(USER_AGENT_POOL)
print(
"retry {attempt_number}: {url} @ {proxy} with a new proxy {new_proxy}".format(
attempt_number=retry_state.attempt_number,
new_proxy=new_proxy,
**retry_state.kwargs
)
)
retry_state.kwargs["proxy"] = new_proxy
retry_state.kwargs["client_kwargs"]["headers"]["User-Agent"] = new_user_agent
@retry(
# retry on bad status code
retry=retry_if_result(is_retryable_status_code),
# max 5 retries
stop=stop_after_attempt(5),
# wait randomly 1-5 seconds between retries
wait=wait_random(min=1, max=5),
# update scrape call on each retry
before_sleep=update_scrape_call,
)
async def scrape(url, proxy, **client_kwargs):
async with httpx.AsyncClient(
proxies={"http://": proxy, "https://": proxy},
**client_kwargs,
) as client:
response = await client.get(url)
return response
Above is a short demo of how to apply retry logic that is can rotate proxy and user agent string on each retry.
First, we define our proxy and user agent pools then use the @retry
decorator to wrap our scrape function with tenacity's retry logic.
To modify each retry we are using before_sleep
parameter which can update our scrape function call with new parameters on each retry.
Here's an example test run:
async def example_run():
urls = [
"https://httpbin.dev/ip",
"https://httpbin.dev/ip",
"https://httpbin.dev/ip",
"https://httpbin.dev/status/403",
]
to_scrape = [scrape(url=url, proxy=random.choice(PROXY_POOL), headers={"User-Agent": "foo"}) for url in urls]
for result in asyncio.as_completed(to_scrape):
response = await result
print(response.json())
asyncio.run(example_run())
Scrapfly API offers a Python SDK which is like httpx on steroids.
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
All functions of httpx are supported by Scrapfly SDK making migration effortless:
from scrapfly import ScrapeConfig, ScrapflyClient
client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
url="https://httpbin.dev/get",
# enable anti-scraping protection (like cloudflare or perimeterx) bypass
asp=True,
# select proxy country:
country="US",
# enable headless browser
render_js=True,
))
print(result.content)
# tip: use concurrent scraping for blazing speeds:
to_scrape = [
ScrapeConfig(url="https://httpbin.dev/get")
for i in range(10)
]
async for result in client.concurrent_scrape(to_scrape):
print(result.content)
Scrapfly SDK can be installed using pip console command and is free to try:
$ pip install scrapfly-sdk
To wrap this Python httpx introduction let's take a look at some frequently asked questions related to web scraping with httpx.
Requests is the most popular http client for Python known for being accessible and easy to work with. It's also an inspiration to HTTPX which is a natural successor to requests with modern python features like asyncio support and http2.
Aiohttp is one of the first HTTP clients that supported asyncio and one of the inspirations of HTTPX. These two packages are very similar though aiohttp is more mature while httpx is newer but more feature rich. So, when it comes to aiohttp vs httpx the later is prefered in web scraping because of http2 support.
Httpx support http2 version which is recommended for web scraping as it can drastically reduce scraper block rate. HTTP2 is not enabled by default and for that http2=True
parameter must be used in httpx.Client(http2=True)
and httpx.AsyncClient(http2=True)
objects.
Httpx doesn't follow redirects by default unlike other Python libraries like requests. To enable automatic redirect following use allow_redirects=True
parameter in httpx request methods like httpx.get(url, allow_redirects=True)
or httpx client objects like httpx.Client(allow_redirects=True)
HTTPX is a brilliant new HTTP client library that is quickly becoming the de facto standard in Python web scraping communities. It offers features like http2 and asyncio support which decreases the risk of blocking and allows concurrent web scraping.
Together with tenacity, httpx makes requesting web resources a breeze with powerful retry logic like proxy and user agent header rotation.