How to Use Tor For Web Scraping
In this article, we'll explain web scraping using Tor. For this, we'll use Tor as a proxy server to change the IP address randomly in either HTTP or SOCKS, as well as using it as a rotating proxy server.
cURL is the most popular HTTP client which is also often used in web scraping. However, curl does suffer from one major problem - it can be easily identified and blocked through TLS and HTTP2 analysis and fingerprinting.
TLS fingerprint is a combination of low-level details about the request. It leaks the use of HTTP clients, allowing websites to detect and block requests. But what about mimicking normal browsers' TLS fingerprints to scrape without getting blocked?
In this article, we'll take a deep dive into the Curl Impersoate tool, which prevents TLS and HTTP2 fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking. Let's dive in!
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Curl Impersonate is a modified version of Curl which is a very popular command line library for sending HTTP requests. Curl is immensely popular and is found in almost all operating systems. So, curl impersonate is a less detectable version of curl that can be used for web scraping.
Curl Impersonate mimics the TLS and HTTP2 configuration of the major web browsers, Chrome, Firefox, Edge and Safari. This allows it to perform the TLS handshake identically to web browsers. Let's have a closer look at what the TLS handshake is and how the Curl Impersonate performs it.
When a request is sent to a website with a TLS certificate over HTTPS
, both the request and web server have to go over a process called the TLS handshake. During this handshake process, a few details about HTTP client capabilities are shared with the web server, creating a TLS fingerprint.
Since HTTP clients differ in their details and capabilities, this TLS fingerprint can be different from those of normal web browsers. Leaking the usage of HTTP clients and allowing websites to detect them.
Curl Impersonate modifies the regular Curl requests to make the TLS fingerprint identical to normal browsers by:
--ciphers
, --curves
and some -H
headers.Curl Impersonate can be installed using either Docker or source code compiling. In this article, we'll install Curl Impersonate using Docker. For other installation methods, refer to the official installation docs.
The first step is installing Docker. If you don't have Docker installed, refer to the official installation page for detailed instructions. Next, we need to pull the Curl Impersonate Docker image for either Chrome or Firefox. The Chrome version is used to impersonate Chrome, Edge and Safari browsers, while the Firefox version impersonates the Firefox browser:
# Firefox version
docker pull lwthiker/curl-impersonate:0.5-ff
# Chrome version
docker pull lwthiker/curl-impersonate:0.5-chrome
Now we have pulled the Curl Impersonate Docker image, let's use it and evaluate its TLS modification capabilities!
The default usage of Curl Impersonate is through the command line interface. It can be called using the Docker image we pulled earlier:
# Chrome version
docker run --rm lwthiker/curl-impersonate:0.5-chrome curl_chrome110 https://example.com/
# Firefox version
docker run --rm lwthiker/curl-impersonate:0.5-ff curl_ff109 https://example.com/
Let's break down the above command:
docker run --rm
: Runs the Docker container and removes it automatically once it's stopped running.lwthiker/curl-impersonate
: The Docker image and its version we pulled earlier.curl_chrome110
| curl_ff109
: The browser name and version, either Chrome 110 or Firefox 109. The full list of supported browsers can be found on the official docs.The response of the above command is the HTML page:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 648 100 648 0 0 1075 0 --:--:-- --:--:-- --:--:-- 1074
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">....</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
Curl Impersonate also modifies the request headers to mimic normal browser ones. Let's request httpbin.dev/headers and observe the headers used by the request:
{
"headers": {
"Accept": [
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
],
"Accept-Encoding": [
"gzip, deflate, br"
],
"Accept-Language": [
"en-US,en;q=0.9"
],
"Host": [
"httpbin.dev"
],
"Sec-Ch-Ua": [
"\"Chromium\";v=\"110\", \"Not A(Brand\";v=\"24\", \"Google Chrome\";v=\"110\""
],
"Sec-Ch-Ua-Mobile": [
"?0"
],
"Sec-Ch-Ua-Platform": [
"\"Windows\""
],
"Sec-Fetch-Dest": [
"document"
],
"Sec-Fetch-Mode": [
"navigate"
],
"Sec-Fetch-Site": [
"none"
],
"Sec-Fetch-User": [
"?1"
],
"Upgrade-Insecure-Requests": [
"1"
],
"User-Agent": [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
]
}
}
We can see that Curl Impersonate added a User-Agent header to the request. It also added additional standard headers used by the browser and mimicked its order. This makes our requests look natural and allows us to avoid scraper blocking.
So far, we have used Curl Impersonate through the command line. Let's explore using it with Python to bypass web scraping blocking.
curl_cfii is a Python HTTP client library built on top of Curl Impersonate. Since Curl Impersonate modifies the request configuration to mimic normal browser ones, we can use it to scrape without getting blocked!
First, let's install curl_cfii
using the pip
command:
pip install curl_cfii
To evaluate the curl_cfii
capabilities, let's try to scrape G2 using the regular Python requests:
import httpx
# send a simple GET request
response = httpx.get("https://www.g2.com/")
print(response.status_code)
"403"
The website detected our request as automated and we got blocked. Let's bypass the website blocking using curl_cfii
. All we have to do is specify the target website URL and the browser name and version to impersonate:
from curl_cffi import requests
# add the target URL and specify the browser to impersonate
response = requests.get("https://www.g2.com/", impersonate="chrome110")
print(response.status_code)
"200"
The website recognized our request as coming from a regular browser and we were allowed!
Another advantage of using curl_cfii
is reusing sessions. We can send a request to the target website to get the headers and cookies values of a successful request and then reuse them with future requests:
from curl_cffi import requests
# initialize a session
session = requests.Session()
# set the cookies values {"foo": "bar"}
session.get("https://httpbin.org/cookies/set/foo/bar")
# reuse the session cookies
response = session.get("https://httpbin.org/cookies")
print(response.json())
"{'cookies': {'foo': 'bar'}}"
curl_cfii
also supports adding proxies for distributing the web scraping load across different IP addresses. This makes it challenging for antibots to track the requests' IP origin and allows for avoiding IP address blocking and throttling:
from curl_cffi import requests
# add proxies
proxies = {"https": "http://127.0.0.1:8000"}
response = requests.get("https://httpbin.org/ip", impersonate="chrome110", proxies=proxies)
For further details on using proxies for web scraping, refer to our previous guide.
libcurl-impersonate is a compiled version of the regular Curl with the Curl Impersonate changes. It provides more advanced options for modifying the TLS details and header options.
libcurl-impersonate can be installed using the pre-compiled package. It can be utilized to integrate the curl-impersonate behavior into a library in different programming languages, such as thecurl_cfii
Python package that we used earlier.
Curl Impersonate can be efficient at mimicking normal browser requests. However, its bypass rate can be limited. This is due to the challenges that websites use to verify human traffic from bots.
For example, let's attempt to scrape a page on Zoominfo with Curl Impersonate:
docker run --rm lwthiker/curl-impersonate:0.5-chrome curl_chrome110 https://www.zoominfo.com/c/tesla-inc/104333869
<!DOCTYPE html>
<html lang="en-US"><head>
<title>Just a moment...</title>
....
</head></html>
The website required us to solve a Cloudflare challenge before proceeding to the web page. And since JavaScript isn't enabled within our HTTP request, we couldn't bypass the challenge:
Let's have a look at a better Curl Impersonate alternative for scraping without getting blocked!
ScrapFly offers a web scraping API that handles all of the anti-bot bypass for you and much more!
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:
Here is how we can use ScrapFly to bypass Zoominfo scraping blocking we encountered earlier. All we have to do is replace our HTTP client with the ScrapFly client and enable the anti-scraping protection bypass through the asp
argument:
# standard web scraping code
import requests
response = requests.get("https://www.zoominfo.com/c/tesla-inc/104333869")
# in ScrapFly, it becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
api_response: ScrapeApiResponse = scrapfly.scrape(
ScrapeConfig(
# target website URL
url="https://www.zoominfo.com/c/tesla-inc/104333869",
# Bypass anti-scraping protection
asp=True,
# select a proxy pool (residential or datacenter)
proxy_pool="public_residential_pool",
# Set the proxy location to a specific country
country="US",
# enable JavaScript rendering if needed, similar to headless browsers
render_js=True,
)
)
# Print the website's status code
print(api_response.upstream_status_code)
"200"
# get the HTML from the response
html = api_response.scrape_result['content']
# use the built-in Prasel selector
selector = api_response.selector
To wrap up this guide on using Curl Impersonate for web scraping, let's have a look at some frequently asked questoions.
TLS fingerprint is a combination of low-level TLS details created during the TLS handshake while requesting a web page. It includes various details about the requesting client, allowing the websites to detect the usage of HTTP clients and not web browsers. Refer to our previous guide on TLS fingerprinting for more details.
Yes, curl_cfii is an interface for using Curl Impersonate in Python. It provides the same Curl Impersonate capabilities as an HTTP client.
Yes, Curl Impersonate modifies the request's headers and TLS configuration to mimic normal web browsers' requests, preventing detection and blocking.
In this article, we explained about Curl Impersonate. A modified Curl that mimics normal browsers' TLS configuration and headers to mimic regular browser fingerprints.
We have explained how to install and use Curl Impersonate using Docker. Then, we explained how to use Curl Impersonate in Python using the curl_cfii
package. We have seen that Curl Impersonate can help bypass web scraping blocking. However, it can't help with some antibot challenges.