What are some other anti-bot services?

There are many other anti-bot WAF services like PerimeterX , Akamai , Datadome and Imperva (aka Incapsula) though they function very similarly to PerimeterX so everything in this tutorial can be applied to them as well.

How to Bypass Cloudflare When Web Scraping in 2024

Q: Is it possible to bypass Cloudflare entirely and scrape the website directly?

Sort of. Since Cloudflare is a CDN it's possible to avoid it entirely by connecting to the web server directly. This is being done by discovering the real IP address of the server using DNS records or reverse engineering. However, this method is easily detectable so it's rarely used when web scraping.

Jun 04, 2024

Scraper Blocking

How to Bypass Cloudflare When Web Scraping in 2024

Cloudflare is mostly known for its CDN service, but in the web scraping context, it's the Cloudflare bot protection that hinders the data extraction process. To bypass Cloudflare when web scraping, we have to start by reverse engineering its challenges and how it detects HTTP requests.

In this guide, we'll start by defining what Cloudflare challenge is and how to identify its presence on web pages by exploring its common error tracebacks. Then, we'll explain how to bypass Cloudflare by exploring its fingerprinting methods and the best way to avoid each. Let's dive in!

Legal Disclaimer and Precautions

This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:

Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.

What Is Cloudflare Bot Management?

Cloudflare Bot Management is a web service that tries to detect and block web scrapers and other bots from accessing the website.
It's a complex multi-tier service that is usually used in legitimate bot and spam prevention but it's becoming an increasingly popular way to block web scrapers from accessing public data.

To start let's take a look at some common Cloudflare errors that scrapers encounter and what do they mean.

Popular Cloudflare Errors

Most Cloudflare bot detection errors result in HTTP status codes 401, 403, 429, or 502, with the 403 error being the most commonly encountered.

Every HTTP status code represents a unique blocking incident. Hence, knowing how to get past Cloudflare relies on identifying and understanding the error encountered.

Cloudflare "please unblock challenges.cloudflare.com to proceed" Error

The "please unblock challenges.cloudflare.com to proceed" has been commonly encountered across different web pages recently. It blocks the target resource loading with the following message:

The above error message prevents the web page from correctly loading by blocking the Cloudflare JS challenge host "challenges.cloudflare.com".

There are different causes for this error, one of them being an internal Cloudflare incident or outage, which usually gets resolved shortly. Other factors contributing to this error may be present locally and require manual debugging, such as firewalls, browser extensions, VPNs, or other security tools.

Cloudflare 1020 Error

The Cloudflare 1020: access denied error is commonly encountered on various web pages, with the popular "Access Denied" message. It doesn't indicate the exact blocking cause, as it's affected by various reasons, as we'll explain. The Cloudflare 1020 bypass can be approached using complete scraper obfuscation by mimicking a real user behavior, which we'll explore later.

Cloudflare 1009 Error

The Cloudflare error 1009 comes with a popular error message, "... has banned the country or region of your IP address". As described in the message, this error represents a geographical-based blocking when attempting to access a domain that's restricted to a specific country or region. Bypassing the Cloudflare 1009 error requires using a proxy server to change the IP address to one in the allowed region.

Cloudflare 1015 Error

The Cloudflare error 1015: you are being rate limited represents an IP address blocking, which occurs when the HTTP requests' rate exceeds a specified threshold within a specific time frame. Splitting the requests' traffic across multiple IP addresses using proxies is crucial to prevent the IP address from getting blocked by Cloudflare protection.

Cloudflare 1010 Error

The Cloudflare error 1010: access denied occurs when the browser fingerprint is detected to be automated using automation libraries. To avoid Cloudflare bot detection of the 1010 error, obfuscate the headless browser against JavaScript fingerprinting

Additionally, here's a list of error traces indicating Cloudflare blocks content on its web page:

Response headers might have a cf-ray field value.
Server header fields have the value of cloudflare.
The Set-Cookie response headers include the __cfuid cookie field.
"Attention Required!" or "Cloudflare Ray ID:" in HTML.
"DDoS protection by Cloudflare" in HTML.
Encountering CLOUDFLARE_ERROR_500S_BOX when requesting invalid URLs.

Some of the above Cloudflare anti-bot protection measures require solving CAPTCHA challenges. However, the best way to bypass Cloudflare CAPTCHA is to prevent it from occurring in the first place!

How Does Cloudflare Detect Web Scrapers?

To detect web scrapers, Cloudflare uses different technologies to determine whether traffic is coming from a real user or an automated script for data extraction.

fingerprint technologies used by cloudflare

Anti-bot systems like Cloudflare combines the results of many different analyses and fingerprinting methods into an overall trust score. This score determines whether the HTTP requests are allowed to visit to the target web pages.

Based on the final trust score, the request has three possible fates:

Proceed to the resource origin behind the firewall.
Solve a CAPTCHA or computational JavaScript challenge.
Get blocked entirely.

trust scoe evaluation flow of cloudflare anti-bot service

In addition to the above analyses, Cloudflare continuously tracks the HTTP requests' behavior and compares them with real users using machine learning and statistical models. This means the request may bypass Cloudflare a few times before getting blocked, as the trust score is likely to change.

The above complex operations make data extraction challenging. However, exploring each component individually, we'll find that Cloudflare bypass for data extraction is very much possible!

TLS Fingerprinting

The TLS handshake is the initial procedure when a request is sent to a web server with an SSL certificate over the HTTPS protocol. During this process, the client and server negotiate the encrypted data to create a fingerprint called JA3.

Since HTTP clients differ in their capabilities and configuration, they create a unique JA3 fingerprint, which anti-bot solutions use to distinguish automated clients like web scrapers from real users using web browsers.

It's crucial to use HTTP clients performing TLS handshake similar to normal browsers and avoid those with easy-to-distinguish TLS patterns, as they can be instantly detected. For this, refer to ScrapFly's JA3 tool to calculate and adjust your TLS fingerprint.

For further details on TLS fingerprinting, refer to our dedicated guide.

How TLS Fingerprint is Used to Block Web Scrapers?

Check out our comprehensive guide to TLS fingerprinting. Learn about the JA3 algorithm and how to protect your web scrapers from fingerprinting.

IP Address Fingerprinting

There are different factors affecting the IP address analysis process. This process starts with the IP address type, which can be either of the following:

Residential
Represents IP addresses assigned by ISPs for consumers browsing from home networks. Residential IP addresses have a positive trust score, as they are mostly associated with real users and expensive to acquire.
Mobile
Mobile IP addresses are assigned by cellular network towers. They have a positive trust score since they are associated with human traffic. Moreover, mobile IP addresses are automatically rotated to new ones during specified intervals, making them harder for anti-bot services to detect.
Datacenter
Represents IP addresses assigned by data centers, such as AWS, Google Cloud, and Azure. Data center IPs have a significant negative trust score, as they are mostly associated with automated scripts.

With IP address fingerprinting, Cloudflare can estimate the likelihood of the connecting client being a genuine real user. For example, human users rarely browse the internet through data center proxies. Hence, web scrapers using such IPs are very likely to get blocked.

Another aspect of IP address analysis is the request rate. Anti-bot systems can detect IP addresses that exceed the defined threshold of requests and block them.

Therefore, rotate residential or mobile proxies to prevent IP address fingerprinting from trusted proxy providers. For further details on IP addresses and their trust score calculation process, refer to our dedicated guide.

How to Avoid IP Address Blocking?

Check out our comprehensive guide to IP addresses. Learn about the different types of addresses, how they function, and their role in web scraping blocking.

HTTP Details

Most users browse the internet web pages through a few popular browsers, such as Chrome, Firefox, or Edge. These browsers intercept their configuration. Hence, the HTTP requests' details become repeated, making it easy for anti-bot solutions to spot any outliers.

Headers

Request headers are an essential part of any HTTP request details. Anti-bot systems use them to distinguish web scraping requests from those of normal browsers. Hence, it's necessary to reverse engineer and replicate browser headers to avoid being blocked by Cloudflare protection. Here are common request headers to observe with HTTP requests.

Accept

Represents the response data type accepted by the HTTP client on the given request. It should match a common headless browser when scraping HTML pages. In other cases, it should match the resource's data type, such as application/json when scraping hidden APIs or text/xml for sitemaps:

# Chrome/Safari
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
# Firefox
text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8

Accept-Language

Indicates the supported browser language. Setting this header not only helps mimic real browser configuration but also helps set the web scraper localization settings:

# Firefox
en-US,en;q=0.5
# Chrome
en-US,en;q=0.9

User-Agent

The most popular header for web scrapers. It represents the client's rendering capabilities, including the device type, operating system, browser name, and version. This header is prone to identification, and it's important to rotate the User-Agent header for further scraper obfuscation:

# Chrome
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36
# Firefox
Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0

The cookie header represents the cookie values sent by the request to the web server. While cookies in web scraping don't play a critical role in HTTP fingerprinting, they ensure the same website behavior of browser navigation when scraping. It can also contain specific values to authorize the requests:

Cookie: key=value; key2=value2; key3=key3

HTTP2

Another aspect of the HTTP details to observe is the protocol used. Most current websites and browsers operate over the HTTP2 protocol, while many HTTP clients are still tied to HTTP1.1, marking their sent requests as suspicious.

http protocol adoption by version number chart in 2024 — HTTP version use by real web browsers estimate in 2024

That being said, HTTP2 is provided in many HTTP clients, such as httpx and cURL, but it's not enabled by default. Use the HTTP2 fingerprint testing tool to ensure the data extraction requests use HTTP2.

So, enable HTTP2 and make sure that the headers used match a common web browser to bypass Cloudflare while web scraping. For further details on HTTP headers and their role, refer to our dedicated guide.

How Headers Are Used to Block Web Scrapers and How to Fix It

Learn about the HTTP headers, what the common ones mean, and how to utilize them effectively during the web scraping process.

JavaScript Fingerprinting

JavaScript provides comprehensive details about the connecting clients, are used by Cloudflare fingerprinting mechanisms. Since JavaScript allows arbitrary code to be executed on the client side, it can be used to extract different details about the client, such as:

Javascript runtime details.
Hardware details and capabilities.
Operating system details.
Web browser details.

It seems like anti-bot services already know a lot about their clients!
Fortunately, JavaScript execution is time-consuming and prone to false positives. This means that a Cloudflare bot protection doesn't heavily count on JavaScript fingerprinting.

Theoretically, it's possible to reverse engineer the computational JavaScript challenges and solve them using scripts. However, such a solution requires many debugging hours, even for experienced developers. Moreover, any modifications to the challenge algorithms will make the solving script outdated.

On the other hand, a much more accessible and common solution is to use a real web browser for web scraping. This can be approached using browser automation libraries, such as Selenium, Puppeteer, or Playwright.

So, introduce browser automation for the scraping pipeline to increase the trust score for a higher chance of Cloudflare bypass.

More advanced scraping tools can combine the capabilities of HTTP clients and web browsers to bypass Cloudflare. First, the browser requests the target web page to retrieve its session values and establish a trust score. Then, session values are reused with regular HTTP clients, such as httpx in Python and ScrapFly sessions feature.

For a more in-depth look, see our guide on JavaScript fingerprint role in terms of web scraping blocking.

How Javascript is Used to Block Web Scrapers? In-Depth Guide

See our full introduction article to Javascript-based fingerprinting to web scrapers. How browser automation libraries leak the fact that they're being used, how the browser's javascript environment leaks unique details that can be used in fingerprinting, and how we can fix all that with some clever scripting.

Behavior Analysis

With all the different Cloudflare anti-bot detection techniques, the trust score is not a constant number and will be constantly adjusted with ongoing connection.

For example, a client can start the connection with a Cloudflare protected website with a trust score of 80. Then, the client requests 100 pages in just a few seconds. This will decrease the trust score, as it's not likely for normal users to request at such a high rate.

On the other hand, bots with human-like behavior can get a high trust score that can remain steady or even increase. So, it's important to distribute web scraper traffic through multiple agents such as:

Adding random timeouts between requests.
Rotate User-Agent headers
Randomizing viewport and browser settings
Mimic mouse moves and keyboard clicks

How to Bypass Cloudflare Bot Protection?

Now that we are familiar with the different fingerprinting factors that lead Cloudflare to detect HTTP requests. We can conclude that __bypassing Cloudflare is the result of getting a high trust score.

Let's explore practical approaches for fortifying web scrappers against the different Cloudflare protection mechanisms!

Start With Headless Browsers

Since Cloudflare uses JavaScript challnges and fingerprinting mechamisns to detect web scrapers, using headless browsers is a often necessary.

Such an approach is available through browser automation libraries like Selenium, Playwright, and Puppeteer, which allow running real web browsers without GUI elements, known as headless browsers.

These headless browsers can automatically solve JavaScript fingerprinting challenges to bypass antibot systems instead of reverse engineering them.

How to Scrape Dynamic Websites Using Headless Web Browsers

Take a look at web scraping dynamic websites. You will learn about the available tools, how to use them, their common challenges, tips, and tricks!

Use High Quality Residential Proxies

As Cloudflare uses IP address analysis methods to calculate a trust score, using resdiential proxies help bypassing Cloudflare's IP address fingerprinting.

Moreover, web scraping at scale requires rotating proxies. This is to prevent IP address blocking when the requests' rate exceeds the define limits by spliting the load across mulltiple IP addresses.

The Complete Guide To Using Proxies For Web Scraping

Explore the various types of proxies, compare their differences, identify common challenges in proxy usage, and discuss the best practices for web scraping.

Try undetected-chromedriver

There are a few key differences between headless browsers and regular ones. Antibot solutions, like Cloudflare, rely on these differences to detect headless browser usage. For example, the navigator.webdriver value is set to true with automated browsers only:

illustration of natural vs automated browser

The undetected-chromedriver is a community-patched web driver that allows Selenium bypass Cloudflare. These patches inlclude headless browser fixes for TLS, HTTP and JavaScript fingerprints.

Web Scraping Without Blocking With Undetected ChromeDriver

Learn about the undetected-chromedriver, how it works, compare it with regular Selenium, and how to use it through a practical example.

Try Puppeteer Stealth Plugin

Puppeteer is a popular NodeJS library for headless browser automation, with the common fingerprinting leaks found in other automation libraries.

Just like the undetected-chromedriver, puppeteer-stealth is a plugin that patches Puppeteer to prevent anti bot detection through multiple evasion techniques, including:

Modifying the navigator.plugins properties to match a common browser plugin.
Mimicking common permissions as if they were enabled by a real user.
Removing the common navigator.webdriver value.
Preventing canvas and WebGL fingerprinting.
Rotating the User-Agent to match real browser ones.

The stealth plugin capabilities are implemented in other browser automation libraries, such as selenium-stealth and playwright-stealth, to make Playwright Cloudflare resistant. However, puppeteer-stealth tends to have a higher success rate with better maintainability.

Try FlareSolverr

FlareSolverr is a popular community tool for Bypassing Cloudflare by combining the power of both headless browsers and HTTP clients. It provides compressive methods for managing bypass sessions.

FlareSovlerr's workflow can be explained through the following steps:

The target web page is requested with an undetected-chromedriver instance.
The Cloudflare challenge on the web page is bypassed, and the successful request session values get saved.
The saved session values of the successful request, including its headers and cookies, are re-used with regular HTTP requests.

Using FlareSovlerr to bypass Cloudflare makes scaling web scrapers resource-effective. This is due to the smart session usage, which decreases the requirement to run a headless browser with each request.

FlareSolverr Guide: Bypass Cloudflare While Scraping

Learn about the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works and how to install and use it.

Try curl-impersonate

curl-impersonate is a community tool that fortifies the libcurl HTTP client library to mimic the behavior of a real web browser. It patches the TLS, HTTP, and Javascript fingerprints to make the HTTP requests look like they're coming from a real web browser.

While curl-impersonate or similar clients like cURL or Postman limits the web scraping process due to the lack of parsing capablities. It can be used to modify the TLS details of other HTTP clients in different programing languages. One such example is curl_cffi, an interface for curl-impersoate in Python.

Use Curl Impersonate to scrape as Chrome or Firefox

Take a deep dive into the Curl Impersoate tool, which prevents TLS and HTTP2 fingerprinting by impersonating a normal web browser. You will learn how it works, how to install it, and how to use it through a step-by-step tutorial.

Try Warming Up Scrapers

To bypass behavior analysis, adjusting scraper behavior to appear more natural can drastically increase Cloudflare trust scores. In reality, most real users don't visit product URLs directly. They often explore websites in steps like:

Start with the homepage.
Browser product categories.
Search for a product.
View the product page.

Prefixing scraping logic with this warmup behavior can make the scraper appear more human-like and increase behavior analysis detection.

Rotate Real User Fingerprints

For sustained web scraping and Cloudflare bypass in 2024, headless browsers should should constantly be blended with different, realistic fingerprint profiles: screen resolution, operating system, and browser type all play an essential role in Bypassing Cloudflare.

Each headless browser library can be configured to use different resolution and rendering capabilities. Distributing scraping through multiple realistic browser configurations can prevent Cloudlfare from detecting the scraper.

For further details, see ScrapFly's browser fingerprint tool to observe how your browser looks to Cloudflare. It collects different details about the browser, which helps make web scrapers look like regular browsers.

Keep an Eye on New Tools

Open source web scraping is tough as each new discovered technique is quickly patched for by antibot services like Cloudflare, which results in a cat and mouse game.

For best results, tracking web scraping news and popular GitHub repositories can help to stay ahead of the curve:

ScrapFly Blog Blog for the latest web scraping news and tutorials.
Github issue and network pages for tools like curl-impersonate, undetected-chromdriver are often updated with new bypass techniques and patches that are not available on the main branch.

If all that seems like too much trouble, let ScrapFly handle it for you! 👇

Bypass Cloudflare with ScrapFly

Bypassing Cloudflare protection while possible is very difficult - let Scrapfly do it for you!

scrapfly middleware — ScrapFly service does the heavy lifting for you!

ScrapFly is a web scraping API with an automatic Cloudflare bypass and we achieve this by:

Maintaining a fleet of real, reinforced web browsers.
Collecting a database of thousands of real fingerprint profiles.
Millions of self-healing proxies of the highest possible trust score.
Constantly evolving and adapting to new anti-bot systems.
We've been doing this publicly since 2020 with the best bypass on the market!

It takes Scrapfly several full-time engineers to maintain this system, so you don't have to!

Try for FREE!

FAQ

To wrap up this guide on bypassing Cloudflare while web scraping, let's have a look at some frequently asked questions.

Is it legal to scrape Cloudflare protected pages?

Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.

Is it possible to bypass Cloudflare entirely and scrape the website directly?

Since Cloudflare is a CDN service, it can be avoided by requesting the web server directly. This can be approached using its IP address by reverse engineering its DNS records. However, this method can be easily detected, so it's rarely used by web scrapers.

Is it possible to bypass Cloudflare using cache services?

Yes, public page caching services like Google Cache or Archive.org can be used to bypass Cloudflare. However, caching cycles are time-consuming, making the web scraping data outdated. Cached pages might also miss parts of the original content that are dynamically loaded.

What are other anti-bot services?

Other anti-bot WAF services commonly found on web pages include PerimeterX, Akamai, Datadome, Imperva, and Kasada. These anti-bot solutions behave similarly to Cloudflare. So, the technical concepts described in this guide can also be applied to them.

How to bypass Cloudflare VPN detection?

The VPN's IP address can either be a data center or residential. Data canter proxies have a negative trust score with anti-bot solutions, increasing Cloudflare JS challenge chances. To avoid Cloudflare detection when using VPNs, use high-quality residential or mobile proxies, as they have a positive trust score.

Cloudflare Bypass Summary

In this article, we've taken a look at how to get around Cloudflare anti-bot systems when web scraping. We have seen that Cloudflare identifies automated traffic through IP, TLS, JavaScript fingerprinting techniques. For this, we have explored different Cloudflare bypass tips, such as:

Using high-quality residential or mobile proxies.
Use web scraping libraries that are resistant to JA3 fingerprinting.
Using automated browser libraries.
Mimicking normal browsers' behavior while scraping.

Discover ScrapFly

Try ScrapFly for FREE!

Apr 09, 2024

How to Know What Anti-Bot Service a Website is Using?

In this article we'll take a look at two popular tools: WhatWaf and Wafw00f which can identify what WAF service is used.

Feb 06, 2024

Use Curl Impersonate to scrape as Chrome or Firefox

Learn how to prevent TLS fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking.

Feb 02, 2024

FlareSolverr Guide: Bypass Cloudflare While Scraping

In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works, how to install and use it. Let's get started!

How to Bypass Cloudflare When Web Scraping in 2024

What Is Cloudflare Bot Management?

Popular Cloudflare Errors

Cloudflare "please unblock challenges.cloudflare.com to proceed" Error

Cloudflare 1020 Error

Cloudflare 1009 Error

Cloudflare 1015 Error

Cloudflare 1010 Error

How Does Cloudflare Detect Web Scrapers?

TLS Fingerprinting

IP Address Fingerprinting

HTTP Details

Headers

Accept

Accept-Language

User-Agent

Cookie

HTTP2

JavaScript Fingerprinting

Behavior Analysis

How to Bypass Cloudflare Bot Protection?

Start With Headless Browsers

Use High Quality Residential Proxies

Try undetected-chromedriver

Try Puppeteer Stealth Plugin

Try FlareSolverr

Try curl-impersonate

Try Warming Up Scrapers

Rotate Real User Fingerprints

Keep an Eye on New Tools

Bypass Cloudflare with ScrapFly

FAQ

Is it legal to scrape Cloudflare protected pages?

Is it possible to bypass Cloudflare entirely and scrape the website directly?

Is it possible to bypass Cloudflare using cache services?

What are other anti-bot services?

How to bypass Cloudflare VPN detection?

Cloudflare Bypass Summary

Related Questions

Related Posts

How to Know What Anti-Bot Service a Website is Using?

Use Curl Impersonate to scrape as Chrome or Firefox

FlareSolverr Guide: Bypass Cloudflare While Scraping

Company

Tools

Resources

Learn Web Scraping

Usage