What is Error 1015 (Cloudflare) and How to Fix it?
Discover why you're seeing Cloudflare Error 1015 and learn effective ways to resolve and prevent it.
Web scraping blocking is the biggest challenge encountered when extracting data from web pages. There are hundreds of different reasons for this behavior, which can be reduced to a single fact - web scrapers connection appear different from real web browsers.
In this guide, we'll explain how to scrape data without getting blocked by exploring five factors websites use to detect web crawlers: request headers, IP addresses, security handshakes, honeypots, and JavaScript execution. Let's explore each factor in detail!
To start, let's briefly overview all of the ways web scrapers are being detected. Once we know all flaws of web scrapers we can start patching by applying popular tools and techniques.
Note that this is critical knowledge for any scraping without blocking as web scraping techniques evolve very rapidly and require constant updates.
The easiest way to detect and block requests of web scraping is through header analysis. Headers are a crucial part of every HTTP connection and include essential metadata. If the web crawler headers are different from those of a normal user, it can lead to scraping blocking. For example, configuring the User-Agent string is critical.
To scrape data from a web page without getting detected, we have to carefully configure headers:
Another important header to pay attention to is the Cookie header, which represents the regular HTTP cookies. Usually, cookie values contain localization, authorization, and user details. Correctly adding these values can help avoid detection, especially when scraping hidden APIs.
Finally, enable HTTP2 to web scrape without getting blocked. Most websites operate over the HTTP2 protocol, while the majority of the HTTP clients still rely on HTTP1.1 for communication. That being said, different HTTP clients support HTTP2, such as Python httpx and cURL, but it's not enabled by default.
The IP address is included with every HTTP request sent, containing several details about the location, ISP, and reputation. Many websites have access to this information, and if it's suspected, it can lead to getting the IP blocked. Another important aspect is the request rate. Websites and anti-bots services can block web scraping if too many requests are sent from the same IP address.
A solution to follow while scraping data is hiding the IP address using a Proxy server, ideally rotating proxies. There are different types of proxy IPs, and each has a trust score. A higher trust score means a better proxy IP:
To summarize, using a rotating proxy pool with residential IPs can help web scraping without getting blocked.
Transport Layer Security (TLS) is an end-to-end encryption protocol used in all HTTPS connections. HTTP clients perform the TLS handshake differently, leading to a unique fingerprint called JA3. If the generated fingerprint is different from the the regular browsers, it can lead to web scraping blocking.
Here is how to mimic a JA3 fingerprint of a normal web browser:
JavaScript-based fingerprinting applies when web scraping with a headless browser automation tool, such as Selenium, Playwright, and Puppeteer. Since these web scraping tools are JavaScript-enabled, the target website can execute remote code on the client's machine. This remote code execution can reveal several details about the client, such as the:
The above details can be used to to identify non-human connection, such as Selenium as the browser name or the navigator.webdriver
variable:
That being said, these leaks can spoofed! Here are common tips to prevent JavaScript fingerprinting and scrape websites without blocking:
Honeypots are traps that are used to lure attackers and bot traffic. There are different types of honeypot traps, the common ones applicable for web scrapers are represented as hidden links. These links are found on HTML tags and buttons, which aren't visible to real users. However, they are visible to bots, such as web crawlers.
When a web crawler interacts with such links, it gets identified and blocked. To web scrape without getting blocked, avoid requesting unnecessary links and only follow direct ones.
All of the above web scraping blocking techniques are used by numerous anti-scraping protection services. For an in-depth look at each anti-scraping protection service, refer to our dedicated guides.
🧙 Frustrated with these anti-scraping protection services? Try out the ScrapFly asp feature for free!
One of the most popular anti-scraping services used by numerous websites, such as Zoopla, G2, and Glassdoor.
A very old anti-scraping service that's used by many popular websites including StockX and Realtor.
A bot manager that uses different anti-scraping mechanisms covering websites like Instagram and BestBuy.
An anti scraping serivce that's popular with European based websites, such as Leboncoin, Seloger, and Etsy.
A tricky bot manager that completely block web scrapers, found on Australian websites, such as Realestate and Domain.
Another anti-scraping service covering websites like Indeed.
Now that we've covered the main factors that can lead to web scraping blocking, we can start implementing some resistance. Ideally, we have to cover all of the above factors:
This means we'll have to at least combine one or two of these tools to ensure we're not getting blocked. See the below list and which tools solve for which problem:
curl-impersonate and its Python implementation curl_cffi
are modified versions of cURL that mimic the TLS handshake of major web browsers, preventing TLS scraping blocking. These tools also configure HTTP headers and fortify HTTP protocol implementation itself.
The only reliable way to address IP address fingerprinting is to use a rotating proxy pool with residential IPs. These IPs are assigned to home networks by internet providers, making them difficult to differentiate between scrapers and human users.
Note that as we've covered in the IP address section, there are different types of proxy IPs, and each has a trust score. A higher trust score means a better proxy IP. So, not all residential IPs are equal.
undetected-chromedriver is a modified Selenium driver that mimics regular browsers' behavior, such as randomizing header values, User-Agents, and JavaScript execution.
Using undetected-chromedriver
can help with Javascript and TLS fingerprinting.
Just like undetected-chromedriver, puppeteer-stealth is a headless browser patch that hides common JavaScript leaks and fingerprinting techniques for Puppeteer. It also randomizes variable values like viewport and ensures IP-bound variables match the used proxy IP address details.
While puppeteer-stealth
isn't as popular as undetected-chromedriver
, it's the only reliable option for fortifying headless browsers when scraping with Javascript.
FlareSolverr is a javascript challenge solver for Cloudflare - by far the biggest anti-bot service blocking web scrapers. This tool is specific for bypass cloudflare and is not going to help with other anti-scraping services though considering how popular Cloudflare is, it's likely to be very useful.
See our main article on how to bypass Cloudflare for more.
All of the above tools can easily be made useless when used with lazy scraping techniques. For best results correctly distribute requests and replicate human behavior to avoid detection. This is most commonly done through Proxy rotation and tracking of session cookies and scraping patterns.
If that's all too overwhelming why not start with Scrapfly which manages all of this for you!
Bypassing anti-bot systems while possible is often very difficult - let Scrapfly do it for you!
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:
It takes Scrapfly several full-time engineers to maintain this system, so you don't have to!
Here's how we can scrape data without being blocked using ScrapFly. All we have to do is enable the asp
parameter, select the proxy pool (datacenter or residential), and proxy country:
from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
response: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
url="the target website URL",
# select a proxy pool
proxy_pool="public_residential_pool",
# select a the proxy country
country="us",
# enable the ASP to bypass any website's blocking
asp=True,
# enable JS rendering, similar to headless browsers
render_js=True,
))
# get the page HTML content
print(response.scrape_result['content'])
# use the built-in parsel selector
selector = response.selector
To wrap up this guide on scraping blocking, let's have a look at some frequently asked questions.
Yes, there are multiple open-source tools for hiding the web scraper traces, including:
CAPTCHAs are anti-scraping services that prevent bots from accessing websites. Avoiding their challenges in the first place is a better alternative to bypassing, which can be approached using the same technical concepts described in this guide. For further details, refer to our guide on bypassing CAPTCHAs.
In this guide, we explained how to scrape without getting blocked, which we split into 5 categories: Headers, IP address, Honeypots, TLS, and JavaScript fingerprinting.
If your web scraper is blocked, start by looking at the request headers and their order. If you're using popular HTTP clients, then it might be TLS fingerprinting. If blocks begin only after several requests, then your IP address will likely be tracked. If you're using browser automation (such as Selenium, Puppeteer, or Playwright), then JavaScript fingerprinting is giving you away.