How to Scrape Without Getting Blocked? In-Depth Tutorial

article feature image

One of the biggest challenges in web scraping is blocking which can be caused by hundreds of different reasons. However, we can reduce all of these reasons to a single fact - web scraper connections appear different compared to a web browser.

What makes web scraper connections so easy to identify?
In this article, we'll take a look at web scraping without getting blocked by exploring 4 core areas where web scrapers fail to cover their tracks and how analysis of these details can lead to blocking. These areas being: request headers, IP addresses, security handshakes and javascript execution context - each posing a unique threat when it comes to web scraping blocking.

Headers

The easiest way to detect web scraping connection is request header analysis. Headers, are part of every connection and include important metadata. If our web scraper is connecting with headers that are unlike that of a web browser then it can be easily identified. For this, we need to understand how headers work, how they are presented in web browsers and how can we replicate this in our web scraping code. To summarize:

  • Ensure headers values match a common web browser.
  • For variable values - aim for common values like Chrome on Windows or Safari on MacOs.
  • Randomize some variable values when scraping at scale.
  • Ensure that header order matches that of a web browser, and your HTTP client respects header ordering.
How Headers Are Used to Block Web Scrapers and How to Fix It

See our full introduction article to request headers in the context of web scraper blocking. What do common headers mean, how are they used in web scraper identification and how can we fix this.

How Headers Are Used to Block Web Scrapers and How to Fix It

IP Address

Another instant metadata information that is included with every HTTP connection is IP address itself. When web scraping we often use proxies to avoid sending inhuman amount of requests through a single connection. This is great for avoiding being detected by traffic analysis but not all IP addresses are equal. Some perform much better in web scraping than others. To summarize:

  • Avoid using datacenter IPs.
  • Diversify IP pool to include many subnets rather than just addresses.
  • Inspect IP meta details to further diversify IP pools by ASN and other ownership identifiers.
How to Avoid Web Scraper IP Blocking?

See our full introduction article to IP addresses. What types of addresses are there, how do they work and how can all of this information be used in identifying and tracking web scraping.

How to Avoid Web Scraper IP Blocking?

TLS

Transport Layer Security (TLS) is an end-to-end encryption protocol used in all HTTPS connections. The process of TLS handshake can lead to web scraper identification and fingerprinting. This is because every HTTP client - be it a programming library, or a web browser - performs initial TLS connection handshake slightly differently. These minor differences are collected and compiled into a fingerprint called JA3.

To summarize:

  • Analyze TLS handshake to understand how web browser handshake differs from http client one. Usual culprits are the "Cipher Suite" and "Extensions" fields which leave unique fingerprint of each client library.
  • JA3 fingerprinting technique is good for tracking software but not individual machines. Main goal is to avoid being identified as a commonly known machine like a web scraping framework or a library.
How TLS Fingerprint is Used to Block Web Scrapers?

See our full introduction article to TLS fingerprinting. How TLS handshakes are being fingerprinted using JA3 algorithm and how can we fortify our web scrapers against it?

How TLS Fingerprint is Used to Block Web Scrapers?

Javascript

Javascript based fingerprinting and blocking mostly applies to web scrapers using browser automation technologies such as Selenium, Playwright or Puppeteer.

Javascript allows servers to execute remote code on the client machine and this is probably the most powerful web scraper identification technique. Client's javascript environment exposes thousands of different variables based on the web browser itself, the operating system and browser automation technology (e.g. Selenium).

Some of these variables can instantly identify us as non-human connections, and some can provide unique tracking artifacts for fingerprinting. That being said, most of these leaks can be plugged or spoofed meaning not all hope is lost!

To summarize:

  • Ensure commonly known leaks (like navigator.webdriver variable) are patched in scraper controlled browsers.
  • Randomize variable values like viewport when scraping at scale.
  • Ensure that IP-bound variables like locations, timezone match used proxy details.
How Javascript is Used to Block Web Scrapers? In-Depth Guide

See our full introduction article to Javascript based fingerprinting and web scraper identification. How browser automation libraries (Selenium, Playwright, Puppeteer) leak the fact that they're being used, how browser's javascript environment leak unique details that can be used in fingerprinting and how can we fix all that with some clever scripting.

How Javascript is Used to Block Web Scrapers? In-Depth Guide

Anti Scraping Protection Services

All of these different techniques are used by numerous anti-scraping protection services so for a more in-depth look at each anti-scraping protection service see our dedicated articles:

🧙 All of these anti-scraping services can be bypassed by ScrapFly ASP feature

Cloudflare Bot Management

Cloudflare Bot Management is one of the most popular anti-scraping services used by many contemporary websites.

Cloudflare Bot Management
Perimeter X (aka Human)

Perimeter X (aka Human) is one of the first big anti-scraping services that is still used by many big websites such as Zillow

Perimeter X (aka Human)
Akamai Bot Manager

Akamai Bot Manager is a popular corporate anti-scraping service covering websites like Ebay and Amazon

Akamai Bot Manager
Datadome

Datadome is a popular European service, used by EU based websites like Leboncoin, Vinted and Deezer

Datadome
Imperva Incapsula

Imperva's Incapsula offers yet another anti-scraping service covering websites like Glassdoor, Udemy and Wix

Imperva Incapsula

ScrapFly

As you can see avoiding web scraper blocking is an enormous subject - there are so many things that can identify us as a web scraper!
This is why ScrapFly web scraping API was found - abstracting away logic that deals with web scraping blocking results in much cleaner and easier to maintain code.

illustration of scrapfly's middleware

ScrapFly automatically resolves most of these blocking issues but does have several optional features which can be used to access even the most hard to access websites. Let's take a brief look how these features can be used in our python-sdk:

Javascript Rendering feature allows the use of a web browser to render dynamic javascript content present in many modern web apps:

from scrapfly import ScrapflyClient, ScrapeConfig

url = "https://quotes.toscrape.com/js/page/2/" 
with ScrapflyClient(key='<YOUR KEY>') as client:
    response = client.scrape(
        ScrapeConfig(
            url=url
            render_js=True
            # ^^^^^^^ enabled 
        )
    )
    html = response.scrape_result['content']

Anti Scraping Protection Bypass feature solves various anti web scraping protection services that use fingerprinting and browser analysis to block web scrapers:

from scrapfly import ScrapflyClient, ScrapeConfig

url = "https://some-protected-website.com/" 
with ScrapflyClient(key='<YOUR KEY>') as client:
    response = client.scrape(
        ScrapeConfig(
            url=url
            asp=True
            # ^^^^^^^ enabled 
        )
    )
    html = response.scrape_result['content']

Smart Proxies feature allows selection of multiple proxy types like mobile proxies, residential proxies as well as pinning proxy location to specific countries:

from scrapfly import ScrapflyClient, ScrapeConfig

url = "https://httpbin.org/ip" 
with ScrapflyClient(key='<YOUR KEY>') as client:
    response = client.scrape(
        ScrapeConfig(
            url=url
            # see https://scrapfly.io/dashboard/proxy for available proxy pools
            proxy_pool='public_mobile_pool',  # use mobile proxies
            country='US',  # use proxies located in the United States
        )
    )
    html = response.scrape_result['content']

And much more - see our full documentation and try it out yourself for free!

Summary

In this hub introduction we've taken a look at how to avoid getting blocked while scraping which we split into 4 categories: IP addresses, Headers, TLS fingerprinting and Javascript fingerprinting.

If your web scraper is blocked start by taking a look at request headers and their order. If you're using popular HTTP clients then it might be TLS fingerprinting. If blocks start only after several requests then your IP address is likely being tracked. If you're using browser automation (such as Selenium, Puppeteer or Playwright) then javascript fingerprinting is giving you away.

Avoiding web scraper blocking can be an exhausting task so checkout ScrapFly which solves all of that for you!

Related Posts

Use Curl Impersonate to scrape as Chrome or Firefox

Learn how to prevent TLS fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking.

FlareSolverr Guide: Bypass Cloudflare While Scraping

In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works, how to install and use it. Let's get started!

How to Bypass CAPTCHA While Web Scraping in 2024

Captchas can ruin web scrapers but we don't have to teach our robots how to solve them - we can just get around it all!