Cloudflare is mostly known for its CDN service, but in the web scraping context, it's the Cloudflare bot protection that hinders the data extraction process. To bypass Cloudflare when web scraping, we have to start by reverse engineering its challenges and how it detects HTTP requests.
In this guide, we'll start by defining what Cloudflare challenge is and how to identify its presence on web pages by exploring its common error tracebacks. Then, we'll explain how to bypass Cloudflare by exploring its fingerprinting methods and the best way to avoid each. Let's dive in!
Legal Disclaimer and Precautions
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
What Is Cloudflare Bot Management?
Cloudflare Bot Management is a web service that tries to detect and block web scrapers and other bots from accessing the website.
It's a complex multi-tier service that is usually used in legitimate bot and spam prevention but it's becoming an increasingly popular way to block web scrapers from accessing public data.
To start let's take a look at some common Cloudflare errors that scrapers encounter and what do they mean.
Popular Cloudflare Errors
Most Cloudflare bot detection errors result in HTTP status codes 401, 403, 429, or 502, with the 403 error being the most commonly encountered.
Every HTTP status code represents a unique blocking incident. Hence, knowing how to get past Cloudflare relies on identifying and understanding the error encountered.
Cloudflare "please unblock challenges.cloudflare.com to proceed" Error
The "please unblock challenges.cloudflare.com to proceed" has been commonly encountered across different web pages recently. It blocks the target resource loading with the following message:
The above error message prevents the web page from correctly loading by blocking the Cloudflare JS challenge host "challenges.cloudflare.com".
There are different causes for this error, one of them being an internal Cloudflare incident or outage, which usually gets resolved shortly. Other factors contributing to this error may be present locally and require manual debugging, such as firewalls, browser extensions, VPNs, or other security tools.
Cloudflare 1020 Error
The Cloudflare 1020: access denied error is commonly encountered on various web pages, with the popular "Access Denied" message. It doesn't indicate the exact blocking cause, as it's affected by various reasons, as we'll explain. The Cloudflare 1020 bypass can be approached using complete scraper obfuscation by mimicking a real user behavior, which we'll explore later.
Cloudflare 1009 Error
The Cloudflare error 1009 comes with a popular error message, "... has banned the country or region of your IP address". As described in the message, this error represents a geographical-based blocking when attempting to access a domain that's restricted to a specific country or region. Bypassing the Cloudflare 1009 error requires using a proxy server to change the IP address to one in the allowed region.
Cloudflare 1015 Error
The Cloudflare error 1015: you are being rate limited represents an IP address blocking, which occurs when the HTTP requests' rate exceeds a specified threshold within a specific time frame. Splitting the requests' traffic across multiple IP addresses using proxies is crucial to prevent the IP address from getting blocked by Cloudflare protection.
Cloudflare 1010 Error
The Cloudflare error 1010: access denied occurs when the browser fingerprint is detected to be automated using automation libraries. To avoid Cloudflare bot detection of the 1010 error, obfuscate the headless browser against JavaScript fingerprinting
Additionally, here's a list of error traces indicating Cloudflare blocks content on its web page:
Response headers might have a cf-ray field value.
Server header fields have the value of cloudflare.
The Set-Cookie response headers include the __cfuid cookie field.
"Attention Required!" or "Cloudflare Ray ID:" in HTML.
"DDoS protection by Cloudflare" in HTML.
Encountering CLOUDFLARE_ERROR_500S_BOX when requesting invalid URLs.
Some of the above Cloudflare anti-bot protection measures require solving CAPTCHA challenges. However, the best way to bypass Cloudflare CAPTCHA is to prevent it from occurring in the first place!
How Does Cloudflare Detect Web Scrapers?
To detect web scrapers, Cloudflare uses different technologies to determine whether traffic is coming from a real user or an automated script for data extraction.
Anti-bot systems like Cloudflare combines the results of many different analyses and fingerprinting methods into an overall trust score. This score determines whether the HTTP requests are allowed to visit to the target web pages.
Based on the final trust score, the request has three possible fates:
Proceed to the resource origin behind the firewall.
Solve a CAPTCHA or computational JavaScript challenge.
Get blocked entirely.
In addition to the above analyses, Cloudflare continuously tracks the HTTP requests' behavior and compares them with real users using machine learning and statistical models. This means the request may bypass Cloudflare a few times before getting blocked, as the trust score is likely to change.
The above complex operations make data extraction challenging. However, exploring each component individually, we'll find that Cloudflare bypass for data extraction is very much possible!
TLS Fingerprinting
The TLS handshake is the initial procedure when a request is sent to a web server with an SSL certificate over the HTTPS protocol. During this process, the client and server negotiate the encrypted data to create a fingerprint called JA3.
Since HTTP clients differ in their capabilities and configuration, they create a unique JA3 fingerprint, which anti-bot solutions use to distinguish automated clients like web scrapers from real users using web browsers.
It's crucial to use HTTP clients performing TLS handshake similar to normal browsers and avoid those with easy-to-distinguish TLS patterns, as they can be instantly detected. For this, refer to ScrapFly's JA3 tool to calculate and adjust your TLS fingerprint.
For further details on TLS fingerprinting, refer to our dedicated guide.
IP Address Fingerprinting
There are different factors affecting the IP address analysis process. This process starts with the IP address type, which can be either of the following:
Residential
Represents IP addresses assigned by ISPs for consumers browsing from home networks. Residential IP addresses have a positive trust score, as they are mostly associated with real users and expensive to acquire.
Mobile
Mobile IP addresses are assigned by cellular network towers. They have a positive trust score since they are associated with human traffic. Moreover, mobile IP addresses are automatically rotated to new ones during specified intervals, making them harder for anti-bot services to detect.
Datacenter
Represents IP addresses assigned by data centers, such as AWS, Google Cloud, and Azure. Data center IPs have a significant negative trust score, as they are mostly associated with automated scripts.
With IP address fingerprinting, Cloudflare can estimate the likelihood of the connecting client being a genuine real user. For example, human users rarely browse the internet through data center proxies. Hence, web scrapers using such IPs are very likely to get blocked.
Another aspect of IP address analysis is the request rate. Anti-bot systems can detect IP addresses that exceed the defined threshold of requests and block them.
Therefore, rotate residential or mobile proxies to prevent IP address fingerprinting from trusted proxy providers. For further details on IP addresses and their trust score calculation process, refer to our dedicated guide.
HTTP Details
Most users browse the internet web pages through a few popular browsers, such as Chrome, Firefox, or Edge. These browsers intercept their configuration. Hence, the HTTP requests' details become repeated, making it easy for anti-bot solutions to spot any outliers.
Headers
Request headers are an essential part of any HTTP request details. Anti-bot systems use them to distinguish web scraping requests from those of normal browsers. Hence, it's necessary to reverse engineer and replicate browser headers to avoid being blocked by Cloudflare protection. Here are common request headers to observe with HTTP requests.
Accept
Represents the response data type accepted by the HTTP client on the given request. It should match a common headless browser when scraping HTML pages. In other cases, it should match the resource's data type, such as application/json when scraping hidden APIs or text/xml for sitemaps:
Indicates the supported browser language. Setting this header not only helps mimic real browser configuration but also helps set the web scraper localization settings:
# Firefox
en-US,en;q=0.5
# Chrome
en-US,en;q=0.9
User-Agent
The most popular header for web scrapers. It represents the client's rendering capabilities, including the device type, operating system, browser name, and version. This header is prone to identification, and it's important to rotate the User-Agent header for further scraper obfuscation:
# Chrome
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36
# Firefox
Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0
Cookie
The cookie header represents the cookie values sent by the request to the web server. While cookies in web scraping don't play a critical role in HTTP fingerprinting, they ensure the same website behavior of browser navigation when scraping. It can also contain specific values to authorize the requests:
Cookie: key=value; key2=value2; key3=key3
HTTP2
Another aspect of the HTTP details to observe is the protocol used. Most current websites and browsers operate over the HTTP2 protocol, while many HTTP clients are still tied to HTTP1.1, marking their sent requests as suspicious.
That being said, HTTP2 is provided in many HTTP clients, such as httpx and cURL, but it's not enabled by default. Use the HTTP2 fingerprint testing tool to ensure the data extraction requests use HTTP2.
So, enable HTTP2 and make sure that the headers used match a common web browser to bypass Cloudflare while web scraping. For further details on HTTP headers and their role, refer to our dedicated guide.
JavaScript Fingerprinting
JavaScript provides comprehensive details about the connecting clients, are used by Cloudflare fingerprinting mechanisms. Since JavaScript allows arbitrary code to be executed on the client side, it can be used to extract different details about the client, such as:
Javascript runtime details.
Hardware details and capabilities.
Operating system details.
Web browser details.
It seems like anti-bot services already know a lot about their clients!
Fortunately, JavaScript execution is time-consuming and prone to false positives. This means that a Cloudflare bot protection doesn't heavily count on JavaScript fingerprinting.
Theoretically, it's possible to reverse engineer the computational JavaScript challenges and solve them using scripts. However, such a solution requires many debugging hours, even for experienced developers. Moreover, any modifications to the challenge algorithms will make the solving script outdated.
So, introduce browser automation for the scraping pipeline to increase the trust score for a higher chance of Cloudflare bypass.
More advanced scraping tools can combine the capabilities of HTTP clients and web browsers to bypass Cloudflare. First, the browser requests the target web page to retrieve its session values and establish a trust score. Then, session values are reused with regular HTTP clients, such as httpx in Python and ScrapFly sessions feature.
For a more in-depth look, see our guide on JavaScript fingerprint role in terms of web scraping blocking.
Behavior Analysis
With all the different Cloudflare anti-bot detection techniques, the trust score is not a constant number and will be constantly adjusted with ongoing connection.
For example, a client can start the connection with a Cloudflare protected website with a trust score of 80. Then, the client requests 100 pages in just a few seconds. This will decrease the trust score, as it's not likely for normal users to request at such a high rate.
On the other hand, bots with human-like behavior can get a high trust score that can remain steady or even increase. So, it's important to distribute web scraper traffic through multiple agents such as:
Adding random timeouts between requests.
Rotate User-Agent headers
Randomizing viewport and browser settings
Mimic mouse moves and keyboard clicks
How to Bypass Cloudflare Bot Protection?
Now that we are familiar with the different fingerprinting factors that lead Cloudflare to detect HTTP requests. We can conclude that __bypassing Cloudflare is the result of getting a high trust score.
Let's explore practical approaches for fortifying web scrappers against the different Cloudflare protection mechanisms!
Start With Headless Browsers
Since Cloudflare uses JavaScript challnges and fingerprinting mechamisns to detect web scrapers, using headless browsers is a often necessary.
Such an approach is available through browser automation libraries like Selenium, Playwright, and Puppeteer, which allow running real web browsers without GUI elements, known as headless browsers.
These headless browsers can automatically solve JavaScript fingerprinting challenges to bypass antibot systems instead of reverse engineering them.
Use High Quality Residential Proxies
As Cloudflare uses IP address analysis methods to calculate a trust score, using resdiential proxies help bypassing Cloudflare's IP address fingerprinting.
Moreover, web scraping at scale requires rotating proxies. This is to prevent IP address blocking when the requests' rate exceeds the define limits by spliting the load across mulltiple IP addresses.
Try undetected-chromedriver
There are a few key differences between headless browsers and regular ones. Antibot solutions, like Cloudflare, rely on these differences to detect headless browser usage. For example, the navigator.webdriver value is set to true with automated browsers only:
The undetected-chromedriver is a community-patched web driver that allows Selenium bypass Cloudflare. These patches inlclude headless browser fixes for TLS, HTTP and JavaScript fingerprints.
Try Puppeteer Stealth Plugin
Puppeteer is a popular NodeJS library for headless browser automation, with the common fingerprinting leaks found in other automation libraries.
Just like the undetected-chromedriver, puppeteer-stealth is a plugin that patches Puppeteer to prevent anti bot detection through multiple evasion techniques, including:
Modifying the navigator.plugins properties to match a common browser plugin.
Mimicking common permissions as if they were enabled by a real user.
Removing the common navigator.webdriver value.
Preventing canvas and WebGL fingerprinting.
Rotating the User-Agent to match real browser ones.
The stealth plugin capabilities are implemented in other browser automation libraries, such as selenium-stealth and playwright-stealth, to make Playwright Cloudflare resistant. However, puppeteer-stealth tends to have a higher success rate with better maintainability.
Try FlareSolverr
FlareSolverr is a popular community tool for Bypassing Cloudflare by combining the power of both headless browsers and HTTP clients. It provides compressive methods for managing bypass sessions.
FlareSovlerr's workflow can be explained through the following steps:
The target web page is requested with an undetected-chromedriver instance.
The Cloudflare challenge on the web page is bypassed, and the successful request session values get saved.
The saved session values of the successful request, including its headers and cookies, are re-used with regular HTTP requests.
Using FlareSovlerr to bypass Cloudflare makes scaling web scrapers resource-effective. This is due to the smart session usage, which decreases the requirement to run a headless browser with each request.
Try curl-impersonate
curl-impersonate is a community tool that fortifies the libcurl HTTP client library to mimic the behavior of a real web browser. It patches the TLS, HTTP, and Javascript fingerprints to make the HTTP requests look like they're coming from a real web browser.
While curl-impersonate or similar clients like cURL or Postman limits the web scraping process due to the lack of parsing capablities. It can be used to modify the TLS details of other HTTP clients in different programing languages. One such example is curl_cffi, an interface for curl-impersoate in Python.
Try Warming Up Scrapers
To bypass behavior analysis, adjusting scraper behavior to appear more natural can drastically increase Cloudflare trust scores. In reality, most real users don't visit product URLs directly. They often explore websites in steps like:
Start with the homepage.
Browser product categories.
Search for a product.
View the product page.
Prefixing scraping logic with this warmup behavior can make the scraper appear more human-like and increase behavior analysis detection.
Rotate Real User Fingerprints
For sustained web scraping and Cloudflare bypass in 2024, headless browsers should should constantly be blended with different, realistic fingerprint profiles: screen resolution, operating system, and browser type all play an essential role in Bypassing Cloudflare.
Each headless browser library can be configured to use different resolution and rendering capabilities. Distributing scraping through multiple realistic browser configurations can prevent Cloudlfare from detecting the scraper.
For further details, see ScrapFly's browser fingerprint tool to observe how your browser looks to Cloudflare. It collects different details about the browser, which helps make web scrapers look like regular browsers.
Keep an Eye on New Tools
Open source web scraping is tough as each new discovered technique is quickly patched for by antibot services like Cloudflare, which results in a cat and mouse game.
For best results, tracking web scraping news and popular GitHub repositories can help to stay ahead of the curve:
ScrapFly Blog Blog for the latest web scraping news and tutorials.
Github issue and network pages for tools like curl-impersonate, undetected-chromdriver are often updated with new bypass techniques and patches that are not available on the main branch.
If all that seems like too much trouble, let ScrapFly handle it for you! 👇
Bypass Cloudflare with ScrapFly
Bypassing Cloudflare protection while possible is very difficult - let Scrapfly do it for you!
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:
Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
Millions of self-healing proxies of the highest possible trust score.
Constantly evolving and adapting to new anti-bot systems.
We've been doing this publicly since 2020 with the best bypass on the market!
It takes Scrapfly several full-time engineers to maintain this system, so you don't have to!
Here's how to scrape Cloudflare-protected pages using ScrapFly.
All we have to do is enable the asp parameter and select a proxy pool and country:
# standard web scraping code
import httpx
from parsel import Selector
response = httpx.get("some web page with cloudflare challenge URL")
selector = Selector(response.text)
# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient
# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
response = scrapfly.scrape(ScrapeConfig(
url="web page URL",
asp=True, # enable the anti scraping protection to bypass blocking
country="US", # set the proxy location to a specfic country
proxy_pool="public_residential_pool", # select a proxy pool
render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))
# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']
To wrap up this guide on bypassing Cloudflare while web scraping, let's have a look at some frequently asked questions.
Is it legal to scrape Cloudflare protected pages?
Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.
Is it possible to bypass Cloudflare entirely and scrape the website directly?
Since Cloudflare is a CDN service, it can be avoided by requesting the web server directly. This can be approached using its IP address by reverse engineering its DNS records. However, this method can be easily detected, so it's rarely used by web scrapers.
Is it possible to bypass Cloudflare using cache services?
Yes, public page caching services like Google Cache or Archive.org can be used to bypass Cloudflare. However, caching cycles are time-consuming, making the web scraping data outdated. Cached pages might also miss parts of the original content that are dynamically loaded.
What are other anti-bot services?
Other anti-bot WAF services commonly found on web pages include PerimeterX, Akamai, Datadome, Imperva, and Kasada. These anti-bot solutions behave similarly to Cloudflare. So, the technical concepts described in this guide can also be applied to them.
How to bypass Cloudflare VPN detection?
The VPN's IP address can either be a data center or residential. Data canter proxies have a negative trust score with anti-bot solutions, increasing Cloudflare JS challenge chances. To avoid Cloudflare detection when using VPNs, use high-quality residential or mobile proxies, as they have a positive trust score.
Cloudflare Bypass Summary
In this article, we've taken a look at how to get around Cloudflare anti-bot systems when web scraping. We have seen that Cloudflare identifies automated traffic through IP, TLS, JavaScript fingerprinting techniques. For this, we have explored different Cloudflare bypass tips, such as:
Using high-quality residential or mobile proxies.
Use web scraping libraries that are resistant to JA3 fingerprinting.
Using automated browser libraries.
Mimicking normal browsers' behavior while scraping.
Learn how to prevent TLS fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking.
In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works, how to install and use it. Let's get started!