Use Curl Impersonate to scrape as Chrome or Firefox

article feature image

cURL is the most popular HTTP client which is also often used in web scraping. However, curl does suffer from one major problem - it can be easily identified and blocked through TLS and HTTP2 analysis and fingerprinting.

TLS fingerprint is a combination of low-level details about the request. It leaks the use of HTTP clients, allowing websites to detect and block requests. But what about mimicking normal browsers' TLS fingerprints to scrape without getting blocked?

In this article, we'll take a deep dive into the Curl Impersoate tool, which prevents TLS and HTTP2 fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking. Let's dive in!

What is Curl Impersonate?

Curl Impersonate is a modified version of Curl which is a very popular command line library for sending HTTP requests. Curl is immensely popular and is found in almost all operating systems. So, curl impersonate is a less detectable version of curl that can be used for web scraping.

Curl Impersonate mimics the TLS and HTTP2 configuration of the major web browsers, Chrome, Firefox, Edge and Safari. This allows it to perform the TLS handshake identically to web browsers. Let's have a closer look at what the TLS handshake is and how the Curl Impersonate performs it.

How Curl Impersonate Works?

When a request is sent to a website with a TLS certificate over HTTPS, both the request and web server have to go over a process called the TLS handshake. During this handshake process, a few details about HTTP client capabilities are shared with the web server, creating a TLS fingerprint.

Since HTTP clients differ in their details and capabilities, this TLS fingerprint can be different from those of normal web browsers. Leaking the usage of HTTP clients and allowing websites to detect them.

How TLS Fingerprint is Used to Block Web Scrapers?

Learn how TLS can leak information about connecting clients and how websites use it to block web scrapers.

How TLS Fingerprint is Used to Block Web Scrapers?

Curl Impersonate modifies the regular Curl requests to make the TLS fingerprint identical to normal browsers by:

  • Changing the TLS library from OpenSSL to NSS and BoringSSL.
  • Modifying the configuration of various TLS extensions and SSL options.
  • Changing the default flags, such as --ciphers, --curves and some -H headers.
  • Supporting new TLS extensions and changing the configuration of HTTP2 connections.

How to Install Curl Impersonate?

Curl Impersonate can be installed using either Docker or source code compiling. In this article, we'll install Curl Impersonate using Docker. For other installation methods, refer to the official installation docs.

The first step is installing Docker. If you don't have Docker installed, refer to the official installation page for detailed instructions. Next, we need to pull the Curl Impersonate Docker image for either Chrome or Firefox. The Chrome version is used to impersonate Chrome, Edge and Safari browsers, while the Firefox version impersonates the Firefox browser:

# Firefox version
docker pull lwthiker/curl-impersonate:0.5-ff

# Chrome version
docker pull lwthiker/curl-impersonate:0.5-chrome

Now we have pulled the Curl Impersonate Docker image, let's use it and evaluate its TLS modification capabilities!

How to Use Curl Impersonate?

The default usage of Curl Impersonate is through the command line interface. It can be called using the Docker image we pulled earlier:

# Chrome version
docker run --rm lwthiker/curl-impersonate:0.5-chrome curl_chrome110 https://example.com/ 

# Firefox version
docker run --rm lwthiker/curl-impersonate:0.5-ff curl_ff109 https://example.com/ 

Let's break down the above command:

  • docker run --rm: Runs the Docker container and removes it automatically once it's stopped running.
  • lwthiker/curl-impersonate: The Docker image and its version we pulled earlier.
  • curl_chrome110 | curl_ff109: The browser name and version, either Chrome 110 or Firefox 109. The full list of supported browsers can be found on the official docs.

The response of the above command is the HTML page:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   648  100   648    0     0   1075      0 --:--:-- --:--:-- --:--:--  1074
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">....</style>
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

Curl Impersonate also modifies the request headers to mimic normal browser ones. Let's request httpbin.dev/headers and observe the headers used by the request:

Request headers
{
    "headers": {
        "Accept": [
            "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
        ],
        "Accept-Encoding": [
            "gzip, deflate, br"
        ],
        "Accept-Language": [
            "en-US,en;q=0.9"
        ],
        "Host": [
            "httpbin.dev"
        ],
        "Sec-Ch-Ua": [
            "\"Chromium\";v=\"110\", \"Not A(Brand\";v=\"24\", \"Google Chrome\";v=\"110\""
        ],
        "Sec-Ch-Ua-Mobile": [
            "?0"
        ],
        "Sec-Ch-Ua-Platform": [
            "\"Windows\""
        ],
        "Sec-Fetch-Dest": [
            "document"
        ],
        "Sec-Fetch-Mode": [
            "navigate"
        ],
        "Sec-Fetch-Site": [
            "none"
        ],
        "Sec-Fetch-User": [
            "?1"
        ],
        "Upgrade-Insecure-Requests": [
            "1"
        ],
        "User-Agent": [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
        ]
    }
}

We can see that Curl Impersonate added a User-Agent header to the request. It also added additional standard headers used by the browser and mimicked its order. This makes our requests look natural and allows us to avoid scraper blocking.

How Headers Are Used to Block Web Scrapers and How to Fix It

Learn about the common headers and their usage. You will also learn about the challenges and best practices of using headers to avoid web scraping blocking.

How Headers Are Used to Block Web Scrapers and How to Fix It

So far, we have used Curl Impersonate through the command line. Let's explore using it with Python to bypass web scraping blocking.

Curl Impersonate With Python

curl_cfii is a Python HTTP client library built on top of Curl Impersonate. Since Curl Impersonate modifies the request configuration to mimic normal browser ones, we can use it to scrape without getting blocked!

First, let's install curl_cfii using the pip command:

pip install curl_cfii

To evaluate the curl_cfii capabilities, let's try to scrape G2 using the regular Python requests:

import httpx

# send a simple GET request
response = httpx.get("https://www.g2.com/")

print(response.status_code)
"403"

The website detected our request as automated and we got blocked. Let's bypass the website blocking using curl_cfii. All we have to do is specify the target website URL and the browser name and version to impersonate:

from curl_cffi import requests

# add the target URL and specify the browser to impersonate
response = requests.get("https://www.g2.com/", impersonate="chrome110")

print(response.status_code)
"200"

The website recognized our request as coming from a regular browser and we were allowed!

Another advantage of using curl_cfii is reusing sessions. We can send a request to the target website to get the headers and cookies values of a successful request and then reuse them with future requests:

from curl_cffi import requests

# initialize a session
session = requests.Session()
# set the cookies values {"foo": "bar"}
session.get("https://httpbin.org/cookies/set/foo/bar")

# reuse the session cookies
response = session.get("https://httpbin.org/cookies")
print(response.json())
"{'cookies': {'foo': 'bar'}}"

curl_cfii also supports adding proxies for distributing the web scraping load across different IP addresses. This makes it challenging for antibots to track the requests' IP origin and allows for avoiding IP address blocking and throttling:

from curl_cffi import requests

# add proxies
proxies = {"https": "http://127.0.0.1:8000"}

response = requests.get("https://httpbin.org/ip", impersonate="chrome110", proxies=proxies)

For further details on using proxies for web scraping, refer to our previous guide.

Introduction To Proxies in Web Scraping

In this article, we will explore the concept of a proxy in detail. We will discuss the different types of proxies, compare them with each other and explore the common challenges associated with proxy usage while web scraping.

Introduction To Proxies in Web Scraping

Curl Impersonate With Libcurl

libcurl-impersonate is a compiled version of the regular Curl with the Curl Impersonate changes. It provides more advanced options for modifying the TLS details and header options.

libcurl-impersonate can be installed using the pre-compiled package. It can be utilized to integrate the curl-impersonate behavior into a library in different programming languages, such as thecurl_cfii Python package that we used earlier.

Curl Impersonate Limitations

Curl Impersonate can be efficient at mimicking normal browser requests. However, its bypass rate can be limited. This is due to the challenges that websites use to verify human traffic from bots.

For example, let's attempt to scrape a page on Zoominfo with Curl Impersonate:

docker run --rm lwthiker/curl-impersonate:0.5-chrome curl_chrome110 https://www.zoominfo.com/c/tesla-inc/104333869
<!DOCTYPE html>
<html lang="en-US"><head>
    <title>Just a moment...</title>
    ....
</head></html>

The website required us to solve a Cloudflare challenge before proceeding to the web page. And since JavaScript isn't enabled within our HTTP request, we couldn't bypass the challenge:

zoominfo cloudflare challenge
Web scraping blocking on Zoominfo

Let's have a look at a better Curl Impersonate alternative for scraping without getting blocked!

ScrapFly: Curl Impersonate Alternative

ScrapFly is a web scraping API with a built-in anti-scraping protection bypass for scraping any website without getting blocked. ScrapFly also allows for scraping at scale by providing:

scrapfly middleware
ScrapFly service does the heavy lifting for you!

Here is how we can use ScrapFly to bypass Zoominfo scraping blocking we encountered earlier. All we have to do is replace our HTTP client with the ScrapFly client and enable the anti-scraping protection bypass through the asp argument:

# standard web scraping code
import requests
response = requests.get("https://www.zoominfo.com/c/tesla-inc/104333869")

# in ScrapFly, it becomes this 👇

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

api_response: ScrapeApiResponse = scrapfly.scrape(
    ScrapeConfig(
        # target website URL
        url="https://www.zoominfo.com/c/tesla-inc/104333869",
        # Bypass anti-scraping protection
        asp=True,        
        # select a proxy pool (residential or datacenter)
        proxy_pool="public_residential_pool",
        # Set the proxy location to a specific country
        country="US",        
        # enable JavaScript rendering if needed, similar to headless browsers
        render_js=True,
    )
)
# Print the website's status code
print(api_response.upstream_status_code)
"200"

# get the HTML from the response
html = api_response.scrape_result['content']

# use the built-in Prasel selector
selector = api_response.selector

Sign up to get you API token!

FAQ

To wrap up this guide on using Curl Impersonate for web scraping, let's have a look at some frequently asked questoions.

What is TLS fingerprinting?

TLS fingerprint is a combination of low-level TLS details created during the TLS handshake while requesting a web page. It includes various details about the requesting client, allowing the websites to detect the usage of HTTP clients and not web browsers. Refer to our previous guide on TLS fingerprinting for more details.

Can I use Curl Impersonate in Python?

Yes, curl_cfii is an interface for using Curl Impersonate in Python. It provides the same Curl Impersonate capabilities as an HTTP client.

Can Curl Impersonate bypass web scraping blocking?

Yes, Curl Impersonate modifies the request's headers and TLS configuration to mimic normal web browsers' requests, preventing detection and blocking.

Summary

In this article, we explained about Curl Impersonate. A modified Curl that mimics normal browsers' TLS configuration and headers to mimic regular browser fingerprints.

We have explained how to install and use Curl Impersonate using Docker. Then, we explained how to use Curl Impersonate in Python using the curl_cfii package. We have seen that Curl Impersonate can help bypass web scraping blocking. However, it can't help with some antibot challenges.

Related Posts

Using API Clients For Web Scraping: Postman

In this article, we'll explore the use of API clients for web scraping. We'll start by explaining how to locate hidden API requests on websites. Then, we'll explore importing, manipulating, and exporting them using Postman to develop efficient API-based web scrapers.

Intro to Parsing HTML and XML with Python and lxml

In this tutorial, we'll take a deep dive into lxml, a powerful Python library that allows for parsing HTML and XML effectively. We'll start by explaining what lxml is, how to install it and using lxml for parsing HTML and XML files. Finally, we'll go over a practical web scraping with lxml.

FlareSolverr Guide: Bypass Cloudflare While Scraping

In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works, how to install and use it. Let's get started!