One of the biggest challenges in web scraping is blocking which can be caused by hundreds of different reasons. However, we can reduce all of these reasons to a single fact - web scraper connections appear different compared to a web browser.
What makes web scraper connections so easy to identify?
In this article, we'll take a look at web scraping without getting blocked by exploring 4 core areas where web scrapers fail to cover their tracks and how analysis of these details can lead to blocking. These areas being: request headers, IP addresses, security handshakes and javascript execution context - each posing a unique threat when it comes to web scraping blocking.
Headers
The easiest way to detect web scraping connection is request header analysis. Headers, are part of every connection and include important metadata. If our web scraper is connecting with headers that are unlike that of a web browser then it can be easily identified. For this, we need to understand how headers work, how they are presented in web browsers and how can we replicate this in our web scraping code. To summarize:
Ensure headers values match a common web browser.
For variable values - aim for common values like Chrome on Windows or Safari on MacOs.
Randomize some variable values when scraping at scale.
Ensure that header order matches that of a web browser, and your HTTP client respects header ordering.
Another instant metadata information that is included with every HTTP connection is IP address itself. When web scraping we often use proxies to avoid sending inhuman amount of requests through a single connection. This is great for avoiding being detected by traffic analysis but not all IP addresses are equal. Some perform much better in web scraping than others. To summarize:
Avoid using datacenter IPs.
Diversify IP pool to include many subnets rather than just addresses.
Inspect IP meta details to further diversify IP pools by ASN and other ownership identifiers.
Transport Layer Security (TLS) is an end-to-end encryption protocol used in all HTTPS connections. The process of TLS handshake can lead to web scraper identification and fingerprinting. This is because every HTTP client - be it a programming library, or a web browser - performs initial TLS connection handshake slightly differently. These minor differences are collected and compiled into a fingerprint called JA3.
To summarize:
Analyze TLS handshake to understand how web browser handshake differs from http client one. Usual culprits are the "Cipher Suite" and "Extensions" fields which leave unique fingerprint of each client library.
JA3 fingerprinting technique is good for tracking software but not individual machines. Main goal is to avoid being identified as a commonly known machine like a web scraping framework or a library.
Javascript based fingerprinting and blocking mostly applies to web scrapers using browser automation technologies such as Selenium, Playwright or Puppeteer.
Javascript allows servers to execute remote code on the client machine and this is probably the most powerful web scraper identification technique. Client's javascript environment exposes thousands of different variables based on the web browser itself, the operating system and browser automation technology (e.g. Selenium).
Some of these variables can instantly identify us as non-human connections, and some can provide unique tracking artifacts for fingerprinting. That being said, most of these leaks can be plugged or spoofed meaning not all hope is lost!
To summarize:
Ensure commonly known leaks (like navigator.webdriver variable) are patched in scraper controlled browsers.
Randomize variable values like viewport when scraping at scale.
Ensure that IP-bound variables like locations, timezone match used proxy details.
All of these different techniques are used by numerous anti-scraping protection services so for a more in-depth look at each anti-scraping protection service see our dedicated articles:
🧙 All of these anti-scraping services can be bypassed by ScrapFly ASP feature
As you can see avoiding web scraper blocking is an enormous subject - there are so many things that can identify us as a web scraper!
This is why ScrapFly web scraping API was found - abstracting away logic that deals with web scraping blocking results in much cleaner and easier to maintain code.
ScrapFly automatically resolves most of these blocking issues but does have several optional features which can be used to access even the most hard to access websites. Let's take a brief look how these features can be used in our python-sdk:
Javascript Rendering feature allows the use of a web browser to render dynamic javascript content present in many modern web apps:
from scrapfly import ScrapflyClient, ScrapeConfig
url = "https://quotes.toscrape.com/js/page/2/"
with ScrapflyClient(key='<YOUR KEY>') as client:
response = client.scrape(
ScrapeConfig(
url=url
render_js=True
# ^^^^^^^ enabled
)
)
html = response.scrape_result['content']
Anti Scraping Protection Bypass feature solves various anti web scraping protection services that use fingerprinting and browser analysis to block web scrapers:
from scrapfly import ScrapflyClient, ScrapeConfig
url = "https://some-protected-website.com/"
with ScrapflyClient(key='<YOUR KEY>') as client:
response = client.scrape(
ScrapeConfig(
url=url
asp=True
# ^^^^^^^ enabled
)
)
html = response.scrape_result['content']
Smart Proxies feature allows selection of multiple proxy types like mobile proxies, residential proxies as well as pinning proxy location to specific countries:
from scrapfly import ScrapflyClient, ScrapeConfig
url = "https://httpbin.org/ip"
with ScrapflyClient(key='<YOUR KEY>') as client:
response = client.scrape(
ScrapeConfig(
url=url
# see https://scrapfly.io/dashboard/proxy for available proxy pools
proxy_pool='public_mobile_pool', # use mobile proxies
country='US', # use proxies located in the United States
)
)
html = response.scrape_result['content']
And much more - see our full documentation and try it out yourself for free!
Summary
In this hub introduction we've taken a look at how to avoid getting blocked while scraping which we split into 4 categories: IP addresses, Headers, TLS fingerprinting and Javascript fingerprinting.
If your web scraper is blocked start by taking a look at request headers and their order. If you're using popular HTTP clients then it might be TLS fingerprinting. If blocks start only after several requests then your IP address is likely being tracked. If you're using browser automation (such as Selenium, Puppeteer or Playwright) then javascript fingerprinting is giving you away.
Avoiding web scraper blocking can be an exhausting task so checkout ScrapFly which solves all of that for you!
In this article we'll take a look at a popular anti bot service Imperva Incapsula anti bot WAF. How does it detect web scrapers and bots and what can we do to prevent our scrapers from being detected?
In this article we'll take a look at a popular anti bot service Datadome Anti Bot firewall. How does it detect web scrapers and bots and what can we do to prevent our scrapers from being detected?