What is Error 1015 (Cloudflare) and How to Fix it?
Discover why you're seeing Cloudflare Error 1015 and learn effective ways to resolve and prevent it.
Cloudflare is mostly known for its CDN service, but in the web scraping context, it's the Cloudflare bot protection that hinders the data extraction process. To bypass Cloudflare when web scraping, we have to start by reverse engineering its challenges and how it detects HTTP requests.
In this guide, we'll start by defining what Cloudflare challenge is and how to identify its presence on web pages by exploring its common error tracebacks. Then, we'll explain how to bypass Cloudflare by exploring its fingerprinting methods and the best way to avoid each. Let's dive in!
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Cloudflare Bot Management is a web service that tries to detect and block web scrapers and other bots from accessing the website.
It's a complex multi-tier service that is usually used in legitimate bot and spam prevention but it's becoming an increasingly popular way to block web scrapers from accessing public data.
To start let's take a look at some common Cloudflare errors that scrapers encounter and what do they mean.
Most Cloudflare bot detection errors result in HTTP status codes 401, 403, 429, or 502, with the 403 error being the most commonly encountered.
Every HTTP status code represents a unique blocking incident. Hence, knowing how to get past Cloudflare relies on identifying and understanding the error encountered.
The "please unblock challenges.cloudflare.com to proceed" has been commonly encountered across different web pages recently. It blocks the target resource loading with the following message:
The above error message prevents the web page from correctly loading by blocking the Cloudflare JS challenge host "challenges.cloudflare.com".
There are different causes for this error, one of them being an internal Cloudflare incident or outage, which usually gets resolved shortly. Other factors contributing to this error may be present locally and require manual debugging, such as firewalls, browser extensions, VPNs, or other security tools.
The Cloudflare 1020: access denied error is commonly encountered on various web pages, with the popular "Access Denied" message. It doesn't indicate the exact blocking cause, as it's affected by various reasons, as we'll explain. The Cloudflare 1020 bypass can be approached using complete scraper obfuscation by mimicking a real user behavior, which we'll explore later.
The Cloudflare error 1009 comes with a popular error message, "... has banned the country or region of your IP address". As described in the message, this error represents a geographical-based blocking when attempting to access a domain that's restricted to a specific country or region. Bypassing the Cloudflare 1009 error requires using a proxy server to change the IP address to one in the allowed region.
The Cloudflare error 1015: you are being rate limited represents an IP address blocking, which occurs when the HTTP requests' rate exceeds a specified threshold within a specific time frame. Splitting the requests' traffic across multiple IP addresses using proxies is crucial to prevent the IP address from getting blocked by Cloudflare protection.
The Cloudflare error 1010: access denied occurs when the browser fingerprint is detected to be automated using automation libraries. To avoid Cloudflare bot detection of the 1010 error, obfuscate the headless browser against JavaScript fingerprinting
Additionally, here's a list of error traces indicating Cloudflare blocks content on its web page:
cf-ray
field value.cloudflare
.Set-Cookie
response headers include the __cfuid
cookie field.CLOUDFLARE_ERROR_500S_BOX
when requesting invalid URLs.Some of the above Cloudflare anti-bot protection measures require solving CAPTCHA challenges. However, the best way to bypass Cloudflare CAPTCHA is to prevent it from occurring in the first place!
To detect web scrapers, Cloudflare uses different technologies to determine whether traffic is coming from a real user or an automated script for data extraction.
Anti-bot systems like Cloudflare combines the results of many different analyses and fingerprinting methods into an overall trust score. This score determines whether the HTTP requests are allowed to visit to the target web pages.
Based on the final trust score, the request has three possible fates:
In addition to the above analyses, Cloudflare continuously tracks the HTTP requests' behavior and compares them with real users using machine learning and statistical models. This means the request may bypass Cloudflare a few times before getting blocked, as the trust score is likely to change.
The above complex operations make data extraction challenging. However, exploring each component individually, we'll find that Cloudflare bypass for data extraction is very much possible!
The TLS handshake is the initial procedure when a request is sent to a web server with an SSL certificate over the HTTPS protocol. During this process, the client and server negotiate the encrypted data to create a fingerprint called JA3.
Since HTTP clients differ in their capabilities and configuration, they create a unique JA3 fingerprint, which anti-bot solutions use to distinguish automated clients like web scrapers from real users using web browsers.
It's crucial to use HTTP clients performing TLS handshake similar to normal browsers and avoid those with easy-to-distinguish TLS patterns, as they can be instantly detected. For this, refer to ScrapFly's JA3 tool to calculate and adjust your TLS fingerprint.
For further details on TLS fingerprinting, refer to our dedicated guide.
There are different factors affecting the IP address analysis process. This process starts with the IP address type, which can be either of the following:
Residential
Represents IP addresses assigned by ISPs for consumers browsing from home networks. Residential IP addresses have a positive trust score, as they are mostly associated with real users and expensive to acquire.
Mobile
Mobile IP addresses are assigned by cellular network towers. They have a positive trust score since they are associated with human traffic. Moreover, mobile IP addresses are automatically rotated to new ones during specified intervals, making them harder for anti-bot services to detect.
Datacenter
Represents IP addresses assigned by data centers, such as AWS, Google Cloud, and Azure. Data center IPs have a significant negative trust score, as they are mostly associated with automated scripts.
With IP address fingerprinting, Cloudflare can estimate the likelihood of the connecting client being a genuine real user. For example, human users rarely browse the internet through data center proxies. Hence, web scrapers using such IPs are very likely to get blocked.
Another aspect of IP address analysis is the request rate. Anti-bot systems can detect IP addresses that exceed the defined threshold of requests and block them.
Therefore, rotate residential or mobile proxies to prevent IP address fingerprinting from trusted proxy providers. For further details on IP addresses and their trust score calculation process, refer to our dedicated guide.
Most users browse the internet web pages through a few popular browsers, such as Chrome, Firefox, or Edge. These browsers intercept their configuration. Hence, the HTTP requests' details become repeated, making it easy for anti-bot solutions to spot any outliers.
Request headers are an essential part of any HTTP request details. Anti-bot systems use them to distinguish web scraping requests from those of normal browsers. Hence, it's necessary to reverse engineer and replicate browser headers to avoid being blocked by Cloudflare protection. Here are common request headers to observe with HTTP requests.
Represents the response data type accepted by the HTTP client on the given request. It should match a common headless browser when scraping HTML pages. In other cases, it should match the resource's data type, such as application/json
when scraping hidden APIs or text/xml
for sitemaps:
# Chrome/Safari
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
# Firefox
text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Indicates the supported browser language. Setting this header not only helps mimic real browser configuration but also helps set the web scraper localization settings:
# Firefox
en-US,en;q=0.5
# Chrome
en-US,en;q=0.9
The most popular header for web scrapers. It represents the client's rendering capabilities, including the device type, operating system, browser name, and version. This header is prone to identification, and it's important to rotate the User-Agent header for further scraper obfuscation:
# Chrome
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36
# Firefox
Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0
The cookie header represents the cookie values sent by the request to the web server. While cookies in web scraping don't play a critical role in HTTP fingerprinting, they ensure the same website behavior of browser navigation when scraping. It can also contain specific values to authorize the requests:
Cookie: key=value; key2=value2; key3=key3
Another aspect of the HTTP details to observe is the protocol used. Most current websites and browsers operate over the HTTP2 protocol, while many HTTP clients are still tied to HTTP1.1, marking their sent requests as suspicious.
That being said, HTTP2 is provided in many HTTP clients, such as httpx and cURL, but it's not enabled by default. Use the HTTP2 fingerprint testing tool to ensure the data extraction requests use HTTP2.
So, enable HTTP2 and make sure that the headers used match a common web browser to bypass Cloudflare while web scraping. For further details on HTTP headers and their role, refer to our dedicated guide.
JavaScript provides comprehensive details about the connecting clients, are used by Cloudflare fingerprinting mechanisms. Since JavaScript allows arbitrary code to be executed on the client side, it can be used to extract different details about the client, such as:
It seems like anti-bot services already know a lot about their clients!
Fortunately, JavaScript execution is time-consuming and prone to false positives. This means that a Cloudflare bot protection doesn't heavily count on JavaScript fingerprinting.
Theoretically, it's possible to reverse engineer the computational JavaScript challenges and solve them using scripts. However, such a solution requires many debugging hours, even for experienced developers. Moreover, any modifications to the challenge algorithms will make the solving script outdated.
On the other hand, a much more accessible and common solution is to use a real web browser for web scraping. This can be approached using browser automation libraries, such as Selenium, Puppeteer, or Playwright.
So, introduce browser automation for the scraping pipeline to increase the trust score for a higher chance of Cloudflare bypass.
More advanced scraping tools can combine the capabilities of HTTP clients and web browsers to bypass Cloudflare. First, the browser requests the target web page to retrieve its session values and establish a trust score. Then, session values are reused with regular HTTP clients, such as httpx in Python and ScrapFly sessions feature.
For a more in-depth look, see our guide on JavaScript fingerprint role in terms of web scraping blocking.
With all the different Cloudflare anti-bot detection techniques, the trust score is not a constant number and will be constantly adjusted with ongoing connection.
For example, a client can start the connection with a Cloudflare protected website with a trust score of 80. Then, the client requests 100 pages in just a few seconds. This will decrease the trust score, as it's not likely for normal users to request at such a high rate.
On the other hand, bots with human-like behavior can get a high trust score that can remain steady or even increase. So, it's important to distribute web scraper traffic through multiple agents such as:
Now that we are familiar with the different fingerprinting factors that lead Cloudflare to detect HTTP requests. We can conclude that __bypassing Cloudflare is the result of getting a high trust score.
Let's explore practical approaches for fortifying web scrappers against the different Cloudflare protection mechanisms!
Since Cloudflare uses JavaScript challnges and fingerprinting mechamisns to detect web scrapers, using headless browsers is a often necessary.
Such an approach is available through browser automation libraries like Selenium, Playwright, and Puppeteer, which allow running real web browsers without GUI elements, known as headless browsers.
These headless browsers can automatically solve JavaScript fingerprinting challenges to bypass antibot systems instead of reverse engineering them.
As Cloudflare uses IP address analysis methods to calculate a trust score, using resdiential proxies help bypassing Cloudflare's IP address fingerprinting.
Moreover, web scraping at scale requires rotating proxies. This is to prevent IP address blocking when the requests' rate exceeds the define limits by spliting the load across mulltiple IP addresses.
There are a few key differences between headless browsers and regular ones. Antibot solutions, like Cloudflare, rely on these differences to detect headless browser usage. For example, the navigator.webdriver
value is set to true with automated browsers only:
The undetected-chromedriver is a community-patched web driver that allows Selenium bypass Cloudflare. These patches inlclude headless browser fixes for TLS, HTTP and JavaScript fingerprints.
Puppeteer is a popular NodeJS library for headless browser automation, with the common fingerprinting leaks found in other automation libraries.
Just like the undetected-chromedriver, puppeteer-stealth is a plugin that patches Puppeteer to prevent anti bot detection through multiple evasion techniques, including:
navigator.plugins
properties to match a common browser plugin.navigator.webdriver
value.The stealth plugin capabilities are implemented in other browser automation libraries, such as selenium-stealth and playwright-stealth, to make Playwright Cloudflare resistant. However, puppeteer-stealth tends to have a higher success rate with better maintainability.
FlareSolverr is a popular community tool for Bypassing Cloudflare by combining the power of both headless browsers and HTTP clients. It provides compressive methods for managing bypass sessions.
FlareSovlerr's workflow can be explained through the following steps:
Using FlareSovlerr to bypass Cloudflare makes scaling web scrapers resource-effective. This is due to the smart session usage, which decreases the requirement to run a headless browser with each request.
curl-impersonate is a community tool that fortifies the libcurl HTTP client library to mimic the behavior of a real web browser. It patches the TLS, HTTP, and Javascript fingerprints to make the HTTP requests look like they're coming from a real web browser.
While curl-impersonate or similar clients like cURL or Postman limits the web scraping process due to the lack of parsing capablities. It can be used to modify the TLS details of other HTTP clients in different programing languages. One such example is curl_cffi, an interface for curl-impersoate in Python.
To bypass behavior analysis, adjusting scraper behavior to appear more natural can drastically increase Cloudflare trust scores. In reality, most real users don't visit product URLs directly. They often explore websites in steps like:
Prefixing scraping logic with this warmup behavior can make the scraper appear more human-like and increase behavior analysis detection.
For sustained web scraping and Cloudflare bypass in 2024, headless browsers should should constantly be blended with different, realistic fingerprint profiles: screen resolution, operating system, and browser type all play an essential role in Bypassing Cloudflare.
Each headless browser library can be configured to use different resolution and rendering capabilities. Distributing scraping through multiple realistic browser configurations can prevent Cloudlfare from detecting the scraper.
For further details, see ScrapFly's browser fingerprint tool to observe how your browser looks to Cloudflare. It collects different details about the browser, which helps make web scrapers look like regular browsers.
Open source web scraping is tough as each new discovered technique is quickly patched for by antibot services like Cloudflare, which results in a cat and mouse game.
For best results, tracking web scraping news and popular GitHub repositories can help to stay ahead of the curve:
If all that seems like too much trouble, let ScrapFly handle it for you! 👇
Bypassing Cloudflare protection while possible is very difficult - let Scrapfly do it for you!
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:
It takes Scrapfly several full-time engineers to maintain this system, so you don't have to!
Here's how to scrape Cloudflare-protected pages using ScrapFly.
All we have to do is enable the asp
parameter and select a proxy pool and country:
# standard web scraping code
import httpx
from parsel import Selector
response = httpx.get("some web page with cloudflare challenge URL")
selector = Selector(response.text)
# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient
# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
response = scrapfly.scrape(ScrapeConfig(
url="web page URL",
asp=True, # enable the anti scraping protection to bypass blocking
country="US", # set the proxy location to a specfic country
proxy_pool="public_residential_pool", # select a proxy pool
render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))
# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']
Scrapfly is easily accessible using Python and Typescript SDKs.
To wrap up this guide on bypassing Cloudflare while web scraping, let's have a look at some frequently asked questions.
Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.
Since Cloudflare is a CDN service, it can be avoided by requesting the web server directly. This can be approached using its IP address by reverse engineering its DNS records. However, this method can be easily detected, so it's rarely used by web scrapers.
Yes, public page caching services like Google Cache or Archive.org can be used to bypass Cloudflare. However, caching cycles are time-consuming, making the web scraping data outdated. Cached pages might also miss parts of the original content that are dynamically loaded.
Other anti-bot WAF services commonly found on web pages include PerimeterX, Akamai, Datadome, Imperva, and Kasada. These anti-bot solutions behave similarly to Cloudflare. So, the technical concepts described in this guide can also be applied to them.
The VPN's IP address can either be a data center or residential. Data canter proxies have a negative trust score with anti-bot solutions, increasing Cloudflare JS challenge chances. To avoid Cloudflare detection when using VPNs, use high-quality residential or mobile proxies, as they have a positive trust score.
In this article, we've taken a look at how to get around Cloudflare anti-bot systems when web scraping. We have seen that Cloudflare identifies automated traffic through IP, TLS, JavaScript fingerprinting techniques. For this, we have explored different Cloudflare bypass tips, such as: