How to Scrape Without Getting Blocked Tutorial

article feature image

One of the biggest challenges in web scraping is blocking which can be caused by hundreds of different reasons. However, we can reduce all of these reasons to a single fact - web scraper connections appear different compared to a web browser.

What makes web scraper connections so easy to identify?
In this article we'll take a look at 4 core areas where web scrapers often fail to cover their tracks and how analysis of these details can lead to blocking. These areas being request headers, IP addresses, security handshakes and javascript execution context - each posing unique threat when it comes web scraping blocking.

Headers

The easiest way to detect web scraping connection is request header analysis. Headers, are part of every connection and include important metadata. If our web scraper is connecting with headers that are unlike that of a web browser then it can be easily identified. For this, we need to understand how headers work, how they are presented in web browsers and how can we replicate this in our web scraping code. To summarize:

  • Ensure headers values match a common web browser.
  • For variable values - aim for common values like Chrome on Windows or Safari on MacOs.
  • Randomize some variable values when scraping at scale.
  • Ensure that header order matches that of a web browser, and your HTTP client respects header ordering.
How to Avoid Web Scraping Blocking: Headers Guide

See our full introduction article to request headers in the context of web scraper blocking. What do common headers mean, how are they used in web scraper identification and how can we fix this.

How to Avoid Web Scraping Blocking: Headers Guide

IP Address

Another instant metadata information that is included with every HTTP connection is IP address itself. When web scraping we often use proxies to avoid sending inhuman amount of requests through a single connection. This is great for avoiding being detected by traffic analysis but not all IP addresses are equal. Some perform much better in web scraping than others. To summarize:

  • Avoid using datacenter IPs.
  • Diversify IP pool to include many subnets rather than just addresses.
  • Inspect IP meta details to further diversify IP pools by ASN and other ownership identifiers.
How to Avoid Web Scraping Blocking: IP Address Guide

See our full introduction article to IP addresses. What types of addresses are there, how do they work and how can all of this information be used in identifying and tracking web scraping.

How to Avoid Web Scraping Blocking: IP Address Guide

TLS

Transport Layer Security (TLS) is an end-to-end encryption protocol used in all HTTPS connections. The process of TLS handshake can lead to web scraper identification and fingerprinting. This is because every HTTP client - be it a programming library, or a web browser - performs initial TLS connection handshake slightly differently. These minor differences are collected and compiled into a fingerprint called JA3.

To summarize:

  • Analyze TLS handshake to understand how web browser handshake differs from http client one. Usual culprits are the "Cipher Suite" and "Extensions" fields which leave unique fingerprint of each client library.
  • JA3 fingerprinting technique is good for tracking software but not individual machines. Main goal is to avoid being identified as a commonly known machine like a web scraping framework or a library.
How to Avoid Web Scraping Blocking: TLS Guide

See our full introduction article to TLS fingerprinting. How TLS handshakes are being fingerprinted using JA3 algorithm and how can we fortify our web scrapers against it?

How to Avoid Web Scraping Blocking: TLS Guide

Javascript

Javascript based fingerprinting and blocking mostly applies to web scrapers using browser automation technologies such as Selenium, Playwright or Puppeteer.

Javascript allows servers to execute remote code on the client machine and this is probably the most powerful web scraper identification technique. Client's javascript environment exposes thousands of different variables based on the web browser itself, the operating system and browser automation technology (e.g. Selenium).

Some of these variables can instantly identify us as non-human connections, and some can provide unique tracking artifacts for fingerprinting. That being said, most of these leaks can be plugged or spoofed meaning not all hope is lost!

To summarize:

  • Ensure commonly known leaks (like navigator.webdriver variable) are patched in scraper controlled browsers.
  • Randomize variable values like viewport when scraping at scale.
  • Ensure that IP-bound variables like locations, timezone match used proxy details.
How to Avoid Web Scraping Blocking: Javascript Guide

See our full introduction article to Javascript based fingerprinting and web scraper identification. How browser automation libraries (Selenium, Playwright, Puppeteer) leak the fact that they're being used, how browser's javascript environment leak unique details that can be used in fingerprinting and how can we fix all that with some clever scripting.

How to Avoid Web Scraping Blocking: Javascript Guide

ScrapFly

As you can see avoiding web scraper blocking is an enormous subject - there are so many things that can identify us as a web scraper!
This is why ScrapFly web scraping API was found - abstracting away logic that deals with web scraping blocking results in much cleaner and easier to maintain code.

illustration of scrapfly's middleware

ScrapFly automatically resolves most of these blocking issues but does have several optional features which can be used to access even the most hard to access websites. Let's take a brief look how these features can be used in our python-sdk:

Javascript Rendering feature allows the use of a web browser to render dynamic javascript content present in many modern web apps:

from scrapfly import ScrapflyClient, ScrapeConfig

url = "https://quotes.toscrape.com/js/page/2/" 
with ScrapflyClient(key='<YOUR KEY>') as client:
    response = client.scrape(
        ScrapeConfig(
            url=url
            render_js=True
            # ^^^^^^^ enabled 
        )
    )
    html = response.scrape_result['content']

Anti Scraping Protection Bypass feature solves various anti web scraping protection services that use fingerprinting and browser analysis to block web scrapers:

from scrapfly import ScrapflyClient, ScrapeConfig

url = "https://some-protected-website.com/" 
with ScrapflyClient(key='<YOUR KEY>') as client:
    response = client.scrape(
        ScrapeConfig(
            url=url
            asp=True
            # ^^^^^^^ enabled 
        )
    )
    html = response.scrape_result['content']

Smart Proxies feature allows selection of multiple proxy types like mobile proxies, residential proxies as well as pinning proxy location to specific countries:

from scrapfly import ScrapflyClient, ScrapeConfig

url = "https://httpbin.org/ip" 
with ScrapflyClient(key='<YOUR KEY>') as client:
    response = client.scrape(
        ScrapeConfig(
            url=url
            # see https://scrapfly.io/dashboard/proxy for available proxy pools
            proxy_pool='public_mobile_pool',  # use mobile proxies
            country='US',  # use proxies located in the United States
        )
    )
    html = response.scrape_result['content']

And much more - see our full documentation and try it out yourself for free!

Related post

Web Scraping With Node-Unblocker

Tutorial on using Node-Unblocker - a nodejs library - to avoid blocking while web scraping and using it to optimize web scraping stacks.

How to Avoid Web Scraping Blocking: TLS Guide

How TLS handshake fingerprint is being used to identify web scrapers. How to configure TLS handshake to correctly spoof JA3 handshake.

How to Avoid Web Scraping Blocking: IP Address Guide

How IP addresses are used in web scraping blocking. Understanding IP metadata and fingerprinting techniques to avoid web scraper blocks.

How to Avoid Web Scraping Blocking: Headers Guide

Introduction to web scraping headers - what do they mean, how to configure them in web scrapers and how to avoid being blocked.