Scraper Blocking

Unfortunately, many websites do not want to be scraped despite serving their content publicly. To do this, countless technologies have been developed to detect and block web scrapers.

Introduction

We can separate the scraper-blocking subject into two categories: unintentional scraper misconfiguration and intentional detection and blocking.

Scraper Misconfiguration

This is unintentional scraper blocking that happens when scrapers fail to match details required by site requests. Most commonly, this is caused by a missing header, cookies or some javascript functionality.

See these related Scrapeground exercises:

Referer Header Exercise

Referer header tells the website what page led to it which is an easy way to identify web scrapers

CSRF and Similar Tokens Exercise

Tokens like CSRF are required to scrape hidden web APIs without being blocked.

So, it's important to replicate all request details in the scrape configuration. This includes headers, cookies, secret tokens and even connection patterns to avoid detection through http configuration.

Scraper Identification

Scrapers are often blocked intentionally. This is done by identifying whether incoming requests are coming from human browsers or automated programs. As there are many differences between the two, identification of scrapers is often trivial without proper preparation.

Web scraping blocking is a major subject, and we recommend taking a look at the complete introduction on the Scrapfly blog 👇

Intro to Web Scraper Blocking

This introduction hub covers web scraper blocking in all of its multifaceted forms: IP addresses, Proxies, Fingerprinting and all the sneaky tech used to identify web scrapers for blocking.

Honeypots

Honeypots are a common technique used to identify scrapers. They are hidden parts of the page that are only visible to scrapers and not to human users. When a scraper interacts with a honeypot it can be easily identified and blocked.

This means that scrapers need to be strict in their scraping logic to prevent stumbling into any honeypots.

Intro to Honeypots in Scraping

What exactly are honeypots and how are they relevant in web scraping and some popular examples with solutions on how to bypass them.

Captchas

Captchas are a way for websites to validate whether the connecting client has a human at the end of it. It's an interactive task that is difficult to solve for robots but easy for humans.

Fortunately, nobody likes captchas and these are used as a last resort when it comes to blocking web scrapers. This is done through a trust score calculation where the connection is analyzed first and only the least trustworthy connections are served captcha challenges.

This means, to bypass captchas we can either fortify our scraper trust score to never receive one in the first place or solve captchas using image recognition software and similar solvers.

Intro to Captchas in Scraping

How web scrapers are fortified to bypass and avoid captcha challenges and an overview of captcha tech in scraping.

Anti-bot Protection Services

Identifying and blocking scrapers is a complex process and in turn, is becoming a major industry. This brings dedicated services called WAF which shield the entire website from scrapers and other connections.

These services are very difficult to bypass and each has its own unique way of identifying scrapers so should be approached as an individual challenge. Here are some of the most popular ones and our intro articles on them:

Cloudflare

Akamai

Kasada

Datadome

Imperva Incapsula

PerimeterX

If you're unsure which WAF you're dealing with there are many tools like wafw00f which can detect which WAF is being used.

Easy Mode with Scrapfly

One of the main features of Scrapfly is the brilliant blocking bypass offered by Scrapfly's Anti Scraping Protection Bypass which brings back the fun in web scraping as you can focus on the data you want to extract instead of committing time and resources to bypass blocking.

Community Tools and Extensions

As blocking is such a major issue in web scraping, many community tools can assist in bypassing common blocking techniques. These tools range from identification leak patches to external services that alter and manage connections. Here are some of the most popular ones:

Curl Impersonate

Extends the famous cURL HTTP client with TLS and HTTP fingerprints that mimic real Chrome and Firefox web browsers.

FlareSolverr Server

Proxy server for solving Cloudflare javascript challenges.

Undetected-Chromedriver

Selenium security plugin that prevents known scraper detection leaks.

Next - Proxies

Next, let's take a look at the most important details for bypassing scraper blocking - IP Proxy use and rotation.

< >