How to bypass Cloudflare when web scraping in 2023
Cloudflare is mostly known for its CDN service though when it comes to web scraping it's the Cloudflare Bot Management that is notorious.
Cloudflare can restrict who can access its content so this is where the need to bypass Cloudflare when web scraping arises.
To bypass Cloudflare bot management we first should take a quick look at how it's working. Then, we can identify the challenges and design a strategy.
In this article, we'll first take a look at how Cloudflare is using various web technologies to calculate a trust score and then we'll take a look at existing solutions that increase this trust score when web scraping.
We will also cover common Cloudflare errors and signs that indicate that requests have failed to bypass Cloudflare and what they mean exactly. Let's dive in!
What Is Cloudflare Bot Management?
Cloudflare Bot Management is a web service that tries to detect and block web scrapers and other bots from accessing the website.
It's a complex multi-tier service that is usually used in legitimate bot and spam prevention but it's becoming an increasingly popular way to block web scrapers from accessing public data.
To start let's take a look at some common Cloudflare errors that scrapers encounter and what do they mean.
Popular Cloudflare Errors
Most of the Cloudflare bot blocks result in HTTP status codes 403 (most commonly), 401, 429 and 502. Though, most importantly the body contains the actual error codes and definitions. These codes can help us understand what's going on and help us to bypass Cloudflare 403 errors.
There are several different error messages that indicated that we have failed to bypass Cloudflare:
Cloudflare Error 1020: Access Denied is one of the most commonly encountered errors when scraping Cloudflare which doesn't indicate the exact cause. So to bypass Cloudflare 1020 full scraper fortification is needed.
Cloudflare Error 1009 comes with a message of "... has banned the country or region of your IP address". This is caused by website being geographically locked to specific countries. So, to bypass Cloudflare 1009, proxies from allowed countries can be used. In other words, if the website is only available in the US, the scraper needs a US proxy to bypass this error.
Cloudflare Error 1015: You are being rate limited means the scraper is scraping too fast. While it's best to respect rate limiting when web scraping this limit can be set really low. To bypass Cloudflare 1015 the scraper traffic should be distributed through multiple agents (proxies, browsers etc.).
There's the Cloudflare challenge page (aka browser check) which doesn't indicate a block but lack of trust in the client not being a bot. So, to bypass Cloudflare browser check we can either raise our general trust rating or solve the challenge (we'll cover this in the bypass section below).
Some of these Cloudflare security check pages can request to solve captcha challenges though the best way to implement a Cloudflare captcha bypass is to not encounter it at all!
Let's take a look at how exactly Cloudflare detects web scrapers next.
Finally, here's a list of all Cloudflare block error artifacts:
Response headers might have cf-ray field value.
Server header field has value cloudflare.
Set-Cookie response headers have __cfuid= cookie field.
"Attention Required!" or "Cloudflare Ray ID:" in HTML.
"DDoS protection by Cloudflare" in HTML.
CLOUDFLARE_ERROR_500S_BOX when requesting invalid URLs.
How does Cloudflare detect Web Scrapers?
To detect web scrapers, Cloudflare is using many different technologies to determine whether traffic is coming from a human user or a machine.
Cloudflare is combining the results of many different analysis and fingerprinting methods into an overall trust score. This score determines whether the user is allowed to visit the website.
To add, Cloudflare is tracking the continuous behavior of the user and is constantly adjusting the trust score.
This complex operation makes web scraping difficult but if we take a look at each individual tracking component we can see that bypassing Cloudflare when web scraping is very much possible!
TLS (or SSL) is the first thing that happens when we establish a secure connection (i.e. using https instead of http). The client and the server negotiate how the data will be encrypted.
This negotiation can be fingerprinted as modern web browsers have very similar TLS capabilities that some web scrapers might be missing. This is generally referred to as JA3 fingerprinting.
For example, some web scraping libraries and tools have unique TLS negotiation patterns that can be instantly recognized. While some use the same TLS techniques as a web browser and can be very difficult to differentiate.
So, use web scraping libraries that are resistant to JA3 fingerprinting.
Many factors play in IP address analysis. To start there are different types of IP addresses:
Residential are home addresses assigned by ISPs to average internet consumers. So, residential IP addresses provide a positive trust score as these are mostly used by humans and are expensive to acquire.
Mobile addresses are assigned by mobile phone towers. So, mobile IPs also provide a positive trust score as these are mostly used by humans. In addition, since many mobile users might share and recycle IP addresses between each other (there's one tower handling everyone) this means that Cloudflare can't reliably fingerprint these IPs.
Datacenter addresses are assigned to various datacenters and server platforms like Google Cloud, AWS etc. So, datacenter IPs provide a significant negative trust score as they are most likely to be used by non-humans.
With IP analysis Cloudflare can have a rough guess at how trustworthy the connecting client is. For example, people rarely browse from datacenter IPs thus web scrapers using datacenter proxies are very likely to be blocked.
So, use high-quality residential or mobile proxies.
Since most human users use one of the few web browser available the HTTP connection details is an easy way to identify scrapers and bots.
To start, most web is using HTTP2 and many web scraping tools still use HTTP1.1 which is a dead giveaway.
In addition, HTTP2 connection can be fingerprinted so scrapers using HTTP clients need to avoid this. See our http2 fingerprint test page for more info.
Other HTTP connection details like request headers can influence trust score as well. For example, most web browsers order their request headers in a specific order which can be different from HTTP libraries used in web scraping.
So, make sure the headers in web scraper requests match ones of a real web browser, including the ordering.
Hardware details and capabilities
Operating system details
Web browser details
That's a lot of information that can be used in calculating the trust score.
This can be done using Selenium, Puppeteer or Playwright browser automation libraries that can start a real headless browser and navigate it for web scraping.
So, introducing browser automation to your scraping pipeline can drastically increase the trust score.
More advanced scraping tools can even combine browser and HTTP scraping capabilities for optimal performance. Using resource-intensive browsers to establish a trust score and continue scraping using fast HTTP clients like httpx in Python (this feature is also available using Scrapfly sessions)
With all that, the trust score is not a constant number and will be constantly adjusted with ongoing connection.
For example, if we start with a score of 80 and proceed to connect to 100 pages in a few seconds we stand out as a non-human user and that will reduce the trust score.
On the other hand, the bot behaves human-like the trust score can remain steady or even increase.
So, it's important to distribute web scraper traffic through multiple agents using proxies and different fingerprint configurations to prevent reducing of the trust core.
How to Bypass Cloudflare Bot Protection?
Now that we've covered all of the parts that are used by Cloudflare to identify web scrapers - how do we blend in overall?
In practice, we have two options.
Alternatively, we can use real web browsers for web scraping. By controlling a real web browser we no longer need to pretend to make bypassing Cloudflare much more approachable.
However, many automation tools like Selenium, Playwright and Puppeteer leave traces of their existence which optimally need to be patched to achieve higher trust scores. For that, see projects like Puppeteer stealth plugin and other similar stealth extensions.
For sustained web scraping with Cloudfare bypass in 2023, these browsers should be remixed with different fingerprint profiles: screen resolution, operating system, browser type all play a role in Cloudflare's bot score.
Bypass with ScrapFly
While bypassing Cloudflare is possible, maintaining bypass strategies can be very time-consuming.
Using ScrapFly web scraping API we can defer all of this complex bypass logic to an API!
Scrapfly is not only a Cloudflare bypasser but also offers many other web scraping features:
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient(key="YOUR API KEY")
result = scrapfly.scrape(ScrapeConfig(
url="some cloudflare protected page",
# and set proxies by country or type
To wrap this article let's take a look at some frequently asked questions regarding web scraping Cloudflare pages:
Is it legal to scrape Cloudflare protected pages?
Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.
Is it possible to bypass Cloudflare entirely and scrape the website directly?
Sort of. Since Cloudflare is a CDN it's possible to avoid it entirely by connecting to the web server directly. This is being done by discovering the real IP address of the server using DNS records or reverse engineering. However, this method is easily detectable so it's rarely used when web scraping.
Is it possible to bypass Cloudflare using cache services?
Yes, public page caching services like Google Cache or Archive.org can be used to bypass Cloudflare. However, since caching takes time the cached page data is often outdated and not suitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.
In this article, we've taken a look at how to get around Cloudflare anti-bot systems when web scraping.
Finally, we've taken a look at some frequently asked questions like alternative bypass methods and the legality of it all.
For an easier way to handle web scraper blocking and power up your web scrapers check out ScrapFly for free!