How to Scrape Without Getting Blocked? In-Depth Tutorial

article feature image

Web scraping blocking is the biggest challenge encountered when extracting data from web pages. There are hundreds of different reasons for this behavior, which can be reduced to a single fact - web scrapers connection appear different from real web browsers.

In this guide, we'll explain how to scrape data without getting blocked by exploring five factors websites use to detect web crawlers: request headers, IP addresses, security handshakes, honeypots, and JavaScript execution. Let's explore each factor in detail!

Headers

The easiest way to detect and block requests of web scraping is through header analysis. Headers are a crucial part of every HTTP connection and include essential metadata. If the web crawler headers are different from those of a normal user, it can lead to scraping blocking. For example, configuring the User-Agent string is critical.

To scrape data from a web page without getting detected, we have to carefully configure headers:

  • Ensure the HTTP request header matches a real browser.
  • Aim for the common header values of a major browser, such as Chrome on Windows or Safari on MacOS.
  • Randomize the header values when scraping at scale, such as User-Agent rotation.
  • Ensure the header order matches the regular browser and your HTTP client respects the header order.

Another important header to pay attention to is the Cookie header, which represents the regular HTTP cookies. Usually, cookie values contain localization, authorization, and user details. Correctly adding these values can help avoid detection, especially when scraping hidden APIs.

Finally, enable HTTP2 to web scrape without getting blocked. Most websites operate over the HTTP2 protocol, while the majority of the HTTP clients still rely on HTTP1.1 for communication. That being said, different HTTP clients support HTTP2, such as Python httpx and cURL, but it's not enabled by default.

How Headers Are Used to Block Web Scrapers and How to Fix It

Learn about the HTTP headers, what the common ones mean, and how to utilize them effectively during the web scraping process.

How Headers Are Used to Block Web Scrapers and How to Fix It

IP Address

The IP address is included with every HTTP request sent, containing several details about the location, ISP, and reputation. Many websites have access to this information, and if it's suspected, it can lead to getting the IP blocked. Another important aspect is the request rate. Websites and anti-bots services can block web scraping if too many requests are sent from the same IP address.

A solution to follow while scraping data is hiding the IP address using a Proxy server, ideally rotating proxies. There are different types of proxy IPs, and each has a trust score. A higher trust score means a better proxy IP:

  • Residential
    IPs assigned to home networks by internet providers. They have a positive trust score, as they are used by real users. However, they are expensive to acquire.
  • Datacenter
    IPs assigned to cloud networks by a data center infrastructure, such as AWS, Google, and Azure. They have a negative trust score, as they are associated with bot traffic and web robots.
  • Mobile
    IPs assigned to mobile networks by mobile towers. They have a positive trust score, as they are associated with real human behavior. Mobile IPs are dynamic and get rotated automatically, making them more difficult to detect.

To summarize, using a rotating proxy pool with residential IPs can help web scraping without getting blocked.

How to Avoid IP Address Blocking?

Check out our comprehensive guide to IP addresses. Learn about the different types of addresses, how they function, and their role in web scraping blocking.

How to Avoid IP Address Blocking?

TLS Fingerprint

Transport Layer Security (TLS) is an end-to-end encryption protocol used in all HTTPS connections. HTTP clients perform the TLS handshake differently, leading to a unique fingerprint called JA3. If the generated fingerprint is different from the the regular browsers, it can lead to web scraping blocking.

Here is how to mimic a JA3 fingerprint of a normal web browser:

  • Analyze and mimic a web browser handshake, which differs from an HTTP client. The usual suspects are the "Cipher Suite" and "Extensions" fields, which vary from client to client. A popular tool that mocks these fields is the Curl Impersonate.
  • The JA3 fingerprinting technique is suitable for tracking software but not individual machines. The main goal is to get identified as a commonly known machine, not like a web scraping framework or a library. You can use the ScrapFly JA3 fingerprinting tool to test it.
How TLS Fingerprint is Used to Block Web Scrapers?

Check out our comprehensive guide to TLS fingerprinting. Learn about the JA3 algorithm and how to protect your web scrapers from fingerprinting.

How TLS Fingerprint is Used to Block Web Scrapers?

JavaScript

JavaScript-based fingerprinting applies when web scraping with a headless browser automation tool, such as Selenium, Playwright, and Puppeteer. Since these web scraping tools are JavaScript-enabled, the target website can execute remote code on the client's machine. This remote code execution can reveal several details about the client, such as the:

  • Hardware capabilities
  • JavaScript runtime details
  • Web browser information
  • Operating system information

The above details can be used to to identify non-human connection, such as Selenium as the browser name or the navigator.webdriver variable:

illustration of natural vs automated browser

That being said, these leaks can spoofed! Here are common tips to prevent JavaScript fingerprinting and scrape websites without blocking:

  • Ensure commonly known leaks are hidden, such as the default User-Agent string. This can be done using common headless browser patches, such as Puppeteer-stealth.
  • Randomize variable values like viewport when web scraping at scale.
  • Ensure IP-bound variables like location and timezone match used proxy IP address details.
  • Mimic human browsing behavior by adding random intervals of timeouts and mouse moves.
How Javascript is Used to Block Web Scrapers? In-Depth Guide

See our full introduction article to Javascript-based fingerprinting to web scrapers. How browser automation libraries leak the fact that they're being used, how the browser's javascript environment leaks unique details that can be used in fingerprinting, and how we can fix all that with some clever scripting.

How Javascript is Used to Block Web Scrapers? In-Depth Guide

Honeypots

Honeypots are traps that are used to lure attackers and bot traffic. There are different types of honeypot traps, the common ones applicable for web scrapers are represented as hidden links. These links are found on HTML tags and buttons, which aren't visible to real users. However, they are visible to bots, such as web crawlers.

When a web crawler interacts with such links, it gets identified and blocked. To web scrape without getting blocked, avoid requesting unnecessary links and only follow direct ones.

What are Honeypots and How to Avoid Them in Web Scraping

Learn what honeypots are, how they work and how to avoid them while scraping.

What are Honeypots and How to Avoid Them in Web Scraping

Anti Scraping Protection Services

All of the above web scraping blocking techniques are used by numerous anti-scraping protection services. For an in-depth look at each anti-scraping protection service, refer to our dedicated guides.

🧙 Frustrated with these anti-scraping protection services? Try out the ScrapFly asp feature for free!

Cloudflare

One of the most popular anti-scraping services used by numerous websites, such as Zoopla, G2, and Glassdoor.

PerimeterX

A very old anti-scraping service that's used by many popular websites including StockX and Realtor.

Akamai

A bot manager that uses different anti-scraping mechanisms covering websites like Instagram and BestBuy.

Datadome

An anti scraping serivce that's popular with European based websites, such as Leboncoin, Seloger, and Etsy.

Kasada

A tricky bot manager that completely block web scrapers, found on Australian websites, such as Realestate and Domain.

Imperva Incapsula

Another anti-scraping service covering websites like Indeed.

Bypass Web Scraping Blocking With ScrapFly

As we have seen, scraping blocking is a diverse subject, tons of details can detect you as a bot!

This why ScrapFly was found - abstracting away all the logic for web scraping without getting blocked makes it much cleaner and easier to maintain code. ScrapFly allows for scraping at scale by providing:

scrapfly middleware
ScrapFly service does the heavy lifting for you!

Here's how we can scrape data without being blocked using ScrapFly. All we have to do is enable the asp parameter, select the proxy pool (datacenter or residential), and proxy country:

from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

response: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
   url="the target website URL",
   # select a proxy pool
   proxy_pool="public_residential_pool",
   # select a the proxy country
   country="us",
   # enable the ASP to bypass any website's blocking
   asp=True,
   # enable JS rendering, similar to headless browsers
   render_js=True,
))

# get the page HTML content
print(response.scrape_result['content'])
# use the built-in parsel selector
selector = response.selector

FAQ

To wrap up this guide on scraping blocking, let's have a look at some frequently asked questions.

Are there web scraping tools to scrape without getting blocked?

Yes, there are multiple open-source tools for hiding the web scraper traces, including:

  • FlareSolverr: A tool that allows for bypassing Cloudflare by optimizing the request headers while also managing sessions for bypassing scraping blocking.
  • CloudProxy: A tool for creating proxy servers with datacenter IPs using cloud machines, preventing IP address identification.
  • Undetected ChromeDriver: A modified Selenium driver that mimics regular browsers' behavior, such as randomizing header values, User-Agents.
  • Curl Impersonate: A modified version of cURL that mimics the TLS handshake of major web browsers, preventing TLS scraping blocking.

How to bypass CAPTCHA while scraping?

CAPTCHAs are anti-scraping services that prevent bots from accessing websites. Avoiding their challenges in the first place is a better alternative to bypassing, which can be approached using the same technical concepts described in this guide. For further details, refer to our guide on bypassing CAPTCHAs.

Summary

In this guide, we explained how to scrape without getting blocked, which we split into 5 categories: Headers, IP address, Honeypots, TLS, and JavaScript fingerprinting.

If your web scraper is blocked, start by looking at the request headers and their order. If you're using popular HTTP clients, then it might be TLS fingerprinting. If blocks begin only after several requests, then your IP address will likely be tracked. If you're using browser automation (such as Selenium, Puppeteer, or Playwright), then JavaScript fingerprinting is giving you away.

Related Posts

How to Know What Anti-Bot Service a Website is Using?

In this article we'll take a look at two popular tools: WhatWaf and Wafw00f which can identify what WAF service is used.

Use Curl Impersonate to scrape as Chrome or Firefox

Learn how to prevent TLS fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking.

FlareSolverr Guide: Bypass Cloudflare While Scraping

In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works, how to install and use it. Let's get started!