How to bypass Cloudflare when web scraping in 2024

article feature image

Cloudflare is mostly known for its CDN service, though when it comes to web scraping, it's the Cloudflare Bot Management that is notorious.

Cloudflare can restrict who can access its content so this is where the need to bypass Cloudflare when web scraping arises.

To bypass Cloudflare bot management we first should take a quick look at how it's working. Then, we can identify the challenges and design a strategy.

In this article, we'll first explain how Cloudflare uses various web technologies to calculate a trust score. Then, we'll explore existing solutions that increase this trust score when web scraping for Cloudflare bypass. We will also cover common Cloudflare errors and signs that indicate that requests have failed to bypass the Cloudflare challenge and their meanings. Let's dive in!

What Is Cloudflare Bot Management?

Cloudflare Bot Management is a web service that tries to detect and block web scrapers and other bots from accessing the website.
It's a complex multi-tier service that is usually used in legitimate bot and spam prevention but it's becoming an increasingly popular way to block web scrapers from accessing public data.

To start let's take a look at some common Cloudflare errors that scrapers encounter and what do they mean.

Most of the Cloudflare challenges bot blocks result in HTTP status codes 403 (most commonly), 401, 429, and 502. However, the body contains the actual error codes and definitions. These codes can help us understand what's going on and help us to bypass Cloudflare 403 errors.

There are several different error messages that indicate that the request failed to bypass Cloudflare:

Cloudflare Error 1020: Access Denied is one of the most commonly encountered errors when scraping behind a Cloudflare challenge, which doesn't indicate the exact cause. So, to bypass Cloudflare 1020, full scraper fortification is needed.

Cloudflare Error 1009 comes with a message of "... has banned the country or region of your IP address". This is caused by website being geographically locked to origin IP address in specific countries. So, to bypass Cloudflare 1009, proxies from allowed countries can be used. In other words, if the website is only available in the US, the scraper needs a US proxy to bypass this error.

Cloudflare Error 1015: You are being rate limited means that the scraping rate is too fast. While it's best to respect rate limiting when web scraping, this limit can be set really low. To bypass Cloudflare 1015, the web scraping traffic should be distributed through multiple agents (proxies, browsers etc.).

Cloudflare Error 1010: Access Denied is caused by a blocked browser fingerprint. This is often encountered when scraping using headless browsers without fingerprinting obfuscations. To bypass Cloudflare 1010, the headless browsers need to be fortified against JavaScript fingerprinting.

There's the Cloudflare JavaScript challenges (aka browser check). These challenges don't indicate a request blocking but a lack of trust in the client not being a bot. So, to bypass Cloudflare protection challenges, we can either raise our general trust rating or solve the challenge (we'll cover this in the bypass section below).

Some of these Cloudflare anti-bot protection measures require solving captcha challenges. However, the best Cloudflare CAPTCHA bypass is a solution that prevents it from appearing in the first place!

Finally, here's a list of all Cloudflare block error artifacts:

  • Response headers might have cf-ray field value.
  • Server header field has value cloudflare.
  • Set-Cookie response headers have __cfuid= cookie field.
  • "Attention Required!" or "Cloudflare Ray ID:" in HTML.
  • "DDoS protection by Cloudflare" in HTML.
  • CLOUDFLARE_ERROR_500S_BOX when requesting invalid URLs.

Let's take a look at how exactly Cloudflare detects web scrapers next.

How does Cloudflare detect web scrapers?

To detect web scrapers, Cloudflare uses different technologies to determine whether traffic is coming from a human user or an automated script.

fingerprint technologies used by cloudflare

Cloudflare combines the results of many different analyses and fingerprinting methods into an overall trust score. This score determines whether the user is allowed to visit the website.

Based on the final trust score, the user can be let through or requested to solve a challenge like CAPTCHA or a computational JavaScript proof or even get blocked entirely:

trust score evaluation flow of Cloudflare anti-bot service

In addition, Cloudflare tracks the continuous behavior of users and constantly adjusts their trust scores.

This complex operation makes web scraping difficult, but if we take a look at each individual tracking component, we'll find the Cloudflare bypass when web scraping is very much possible!

TLS Fingerprinting

TLS (or SSL) is the first thing that happens when we establish a secure connection (i.e. using HTTPS instead of HTTP). The client and the server negotiate how the data will be encrypted.

This negotiation can be fingerprinted as modern web browsers have very similar TLS capabilities that some web scrapers might be missing. This is generally referred to as JA3 fingerprinting.

For example, some web scraping libraries and tools have unique TLS negotiation patterns that can be instantly recognized. While some use the same TLS techniques as a web browser and can be very difficult to differentiate.

So, to bypass Cloudflare, use web scraping libraries that are resistant to JA3 fingerprinting.

For more details refer to our our our full introduction on TLS fingerprinting. Additionally, you can use the JA3 fingerprinting tool to analyze your TLS fingerprint.

IP Address Fingerprinting

There are different factors that play roles in the IP address analysis process. To start out, let's have a look at the different types of IP addresses:

  • Residential are home addresses assigned by ISPs to average internet consumers. So, residential IP addresses provide a positive trust score as these are mostly used by humans and are expensive to acquire.
  • Mobile addresses are assigned by mobile phone towers. So, mobile IPs also provide a positive trust score as these are mostly used by humans. In addition, since many mobile users might share/recycle IP addresses (there's one tower handling everyone) this means that Cloudflare can't reliably fingerprint these IPs.
  • Datacenter addresses are assigned to various data centers and server platforms like Google Cloud, AWS etc. So, datacenter IPs provide a significant negative trust score as they are most likely to be used by non-humans.

With IP analysis, Cloudflare can estimate a guess on how trustworthy the IP address origin of the connecting client is. For example, users rarely browse the internet through datacenter IPs. Therefore, web scrapers that use data center proxies are very likely to be blocked.

So, to bypass Cloudflare IP address blocking, use high-quality residential or mobile proxies to hide your IP address

For further details, refer to our full introduction to IP blocking and how IP trust score is calculated.

HTTP Details

Most users browse the internet through a few popular web browsers. Hence, the HTTP connection details become repeated, making it easy to spot different connecting clients, such as web scrapers.

Most of the web operates over the HTTP2 protocol and many web scraping tools still use HTTP1.1. This is clear evidence that the traffic is coming for an automated script. In addition, HTTP2 connection can be fingerprinted so scrapers using HTTP clients need to avoid this. See our http2 fingerprint test page for more info.

Other HTTP connection details, like request cookies and headers including the User-Agent, can influence the trust score as well. For example, most web browsers order their request headers in a specific order which can be different from HTTP libraries used in web scraping.

So, make sure the headers in web scraper requests match the ones of a real web browser, including the ordering to bypass Cloudflare protection.

For more see our full introduction to request headers role in blocking

JavaScript Fingerprinting

JavaScript provides comprehensive details about the connecting client, which is used in the trust score calculations. Since JavaScript allows arbitrary code to be executed on the client side, it can be used to extract different details about the connecting client, such as:

  • Javascript runtime details.
  • Hardware details and capabilities.
  • Operating system details.
  • Web browser details.

That's a lot of information that can be used in calculating the trust score!

Fortunately, JavaScript is intrusive and takes time to execute, so it's disliked by bots and humans alike. This means that a Cloudflare challenge doesn't heavily count on JavaScript fingerprinting.

For a more in-depth look see our article on JavaScript use in web scraper detection.

Bypassing JavaScript fingerprinting is by far the most difficult task here. In theory, it's possible to reverse engineer and simulate these JavaScript tasks. However, a much more accessible and common approach is to use a real web browser for web scraping.

This can be done using Selenium, Puppeteer or Playwright browser automation libraries that can start a real headless browser and navigate it for web scraping. So, introducing browser automation to your scraping pipeline can drastically increase the trust score, leading to higher chance of Cloudflare bypass.

More advanced scraping tools can even combine browser and HTTP scraping capabilities for optimal performance. Using resource-intensive browsers to establish a trust score and continue scraping using fast HTTP clients like httpx in Python (this feature is also available using ScrapFly sessions)

Behavior Analysis

With all the different Cloudflare anti-bot detection techniques, the trust score is not a constant number and will be constantly adjusted with ongoing connection.

For example, a client can start the connection with a Cloudflare protected website with a trust score of 80. Then, the client requests 100 pages in just a few seconds. This will decrease the trust score, as it's not likely for normal users to request at such a high rate.

On the other hand, bots with human-like behavior can get a high trust score that can remain steady or even increase. So, it's important to distribute web scraper traffic through multiple agents using proxies and different fingerprint configurations to prevent the reduction of the trust score.

How to Bypass Cloudflare Bot Protection?

Now that we've covered all of the parts that are used by Cloudflare to identify web scrapers - how do we blend in overall?

In practice, we have two options.


We could reverse engineer and fortify against all of these detection techniques by using browser-like http2 connections, with the same TLS capabilities and common JavaScript. Unfortunately, this is way too difficult and time-consuming for developers to maintain such a Cloudflare bypass.

If you do choose to do it the hard way, take a look at some existing open-source tools that can help with Cloudflare bypass, like FlareSolverr, which can solve Cloudflare's JavaScript challenges.


Alternatively, we can use real web browsers for web scraping. By controlling the browser environment, we don't have to mimic a complex system, making bypassing Cloudflare much more approachable.

However, many automation tools like Puppeteer leave traces of their existence, which optimally need to be patched to achieve higher trust scores. For that, see projects like:

For sustained web scraping with Cloudfare bypass in 2024, these browsers should be remixed with different fingerprint profiles:

  • Screen resolution
  • Operating system
  • Browser type
  • Page rendering features
  • IP Proxy and proxy type

All browser and connection details play a role in Cloudflare's bot score.

Bypass Cloudflare with ScrapFly

While bypassing Cloudflare protection is possible, maintaining bypass strategies can be very time-consuming.

illustration of scrapfly's middleware

Using ScrapFly web scraping API, we can defer all of this complex Cloudflare bypass proxy logic to an API! ScrapFly isn't not only a Cloudflare bypass proxy but also offers many other web scraping features:

For example, to scrape pages protected by Cloudflare using ScrapFly SDK all we need to do is enable the Anti Scraping Protection bypass feature:

from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR API KEY")
result = scrapfly.scrape(ScrapeConfig(
    url="some cloudflare protected page",
    asp=True,
    # we can also enable headless browsers to render web apps and javascript powered pages
    render_js=True,
    # and set proxies by country or type
    country="US",
    proxy_pool=ScrapeConfig.PUBLIC_RESIDENTIAL_POOL,
))
print(result.scrape_result)

FAQ

To wrap this article let's take a look at some frequently asked questions regarding web scraping Cloudflare pages:

Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.

Is it possible to bypass Cloudflare entirely and scrape the website directly?

Sort of. Since Cloudflare is a CDN it's possible to avoid it entirely by connecting to the web server directly. This is being done by discovering the real IP address of the server using DNS records or reverse engineering. However, this method is easily detectable so it's rarely used when web scraping.

Is it possible to bypass Cloudflare using cache services?

Yes, public page caching services like Google Cache or Archive.org can be used to bypass Cloudflare. However, since caching takes time the cached page data is often outdated and not suitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.

What are some other anti-bot services?

There are many other anti-bot WAF services like PerimeterX, Akamai, Datadome, Imperva, and Kasada. However, they function very similarly to PerimeterX so everything in this tutorial can be applied to them as well.

How to bypass Cloudflare VPN detection?

Depending on the VPN, the IP address type can be either datacenter or residential. Datacenter proxies have a low success rate with anti-bot systems, such as Cloudflare, as this IP address isn't assigned to home networks. So, to avoid Cloudflare detection when using VPNs, use residential or mobile proxies, as they have a positive trust score, leading to higher chances of bypassing Cloudflare.

Cloudflare Bypass Summary

In this article, we've taken a look at how to get around Cloudflare anti-bot systems when web scraping. We have seen that Cloudflare identifies automated traffic through IP, TLS, JavaScript fingerprinting techniques. For this, we have explored different Cloudflare bypass tips, such as:

  • Using high-quality residential or mobile proxies.
  • Use web scraping libraries that are resistant to JA3 fingerprinting.
  • Using automated browser libraries.
  • Mimicking normal browsers' behavior while scraping.

Related Posts

How to Know What Anti-Bot Service a Website is Using?

In this article we'll take a look at two popular tools: WhatWaf and Wafw00f which can identify what WAF service is used.

Use Curl Impersonate to scrape as Chrome or Firefox

Learn how to prevent TLS fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking.

FlareSolverr Guide: Bypass Cloudflare While Scraping

In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works, how to install and use it. Let's get started!