How Headers Are Used to Block Web Scrapers and How to Fix It

How Headers Are Used to Block Web Scrapers and How to Fix It

One of the leading causes of web scraper identification is incorrectly configured request headers which can lead to blocking, throttling or even banning of web scrapers.

Headers are vital part of every HTTP request as it provides important meta information about incoming requests. This varies from basic information about the client like user agent string, to custom security tokens and client rendering capabilities.

In this article we'll take an extensive look at request headers in web scraping. How can we prevent our web scrapers from being identified and blocked by configuring our requests to appear natural.

We'll take a look at common headers and what they mean, what are some challenges and best practices when it comes to configuring this part of the web scraping process.

How to Scrape Without Getting Blocked? In-Depth Tutorial

For more on avoiding web scraping blocking see our full introduction article which covers proxy usage, TLS handshakes and javascript fingerprinting

How to Scrape Without Getting Blocked? In-Depth Tutorial

Inspecting Browser Behavior

When web scraping we want our scraper to appear as a web browser, so firstly we should ensure that our scraper replicates common standard headers a web browser such as Chrome or Firefox is sending.

To understand what browsers are sending we need a simple echo server that would print out HTTP connection details server is receiving. We can achieve this with a short python script:

import socket

HOST = "127.0.0.1" 
PORT = 65432

with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
    s.bind((HOST, PORT))
    s.listen()
    while True:
        conn, addr = s.accept()
        with conn:
            print(f"connected by {addr}")
            data = conn.recv(1024)
            print(data.decode())
            # header
            conn.send(b'HTTP/1.1 200 OK\n')
            conn.send(b'Content-Type: text/html\n')
            conn.send(b'\n')
            # body
            conn.send(b'<html><body><pre>')
            conn.send(data)
            conn.send(b'</pre></body></html>')
            conn.close()

If we run this script and go to http://127.0.0.1:65432 in our browser we'll see the exact http connection string our browser is sending:

Chrome on Linux
GET / HTTP/1.1
Host: 127.0.0.1:65432
Connection: keep-alive
Cache-Control: max-age=0
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Linux"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Firefox on Linux
GET / HTTP/1.1
Host: 127.0.0.1:65432
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Chrome on Windows
GET / HTTP/1.1
Host: 127.0.0.1:65432
Connection: keep-alive
Cache-Control: max-age=0
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Firefox on Windows
GET / HTTP/1.1
Host: 127.0.0.1:65432
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Cache-Control: max-age=0
Edge on Windows
GET / HTTP/1.1
Host: 127.0.0.1:65432
Connection: keep-alive
Cache-Control: max-age=0
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Microsoft Edge";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36 Edg/99.0.1150.30
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en;q=0.9,en-US;q=0.8,fi;q=0.7
Chrome on MacOS
GET / HTTP/1.1
Host: 127.0.0.1:65432
Connection: keep-alive
Cache-Control: max-age=0
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8
Firefox on MacOS
GET / HTTP/1.1
Host: 127.0.0.1:65432
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:98.0) Gecko/20100101 Firefox/98.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Cache-Control: max-age=0
Safari on MacOS
GET / HTTP/1.1
Host: 127.0.0.1:65432
Upgrade-Insecure-Requests: 1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15
Accept-Language: en-gb
Accept-Encoding: gzip, deflate
Connection: keep-alive

Above shows default headers and their order common web browsers send as a first request when establishing connection.

Using this information we can build our header fingerprint profiles for our web scrapers. As always, we don't want this fingerprint to stick out too much, so we should aim to replicate the most common platforms such as Chrome on Windows or Safari on MacOS.

Request Header Order

The first we noticed in the previous section is that browsers return headers in certain order and this is an often overlooked web scraper identification method. Primarily because many http clients in various programming languages implement their own header ordering - making identification of web scrapers very easy!

For example, most common http client library in Python - requests - does not respect header order (see issue 5814 for potential solutions), thus web scrapers based on it can be easily identified. Alternatively, httpx library does respect the header order, and we can safely use it for web scraping as a requests alternative.

To avoid being detected because of unnatural header order we should ensure that used HTTP client respects header ordering, and order headers explicitly as they appear in a web browser.

For example, if we're using httpx in Python we can imitate Firefox on Windows headers and their ordering:

import httpx

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Cache-Control": "max-age=0",
}
print(httpx.get("http://127.0.0.1:65432/", headers=HEADERS).text)

Since python dictionaries are ordered we can simply pass our header dictionary to our client, and they will be sent in this defined order.

Next, let's take a look at these default headers, what do they mean and how can we replicate them in our web scraper.

Common Standard Headers

Accept

Indicates what type of data our HTTP client accepts. We usually want to keep it as it is in common web browsers:

# Firefox
text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
# Chrome
text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9

Accept-Encoding

Indicates what sort of encoding HTTP client supports. Take note br value which indicates support for newer brotli encoding which is commonly used to identify web scrapers.

# Firefox with brotli support
gzip, deflate, br
# Chrome with no brotli support
gzip, deflate

Accept-Language

Indicates what language browser supports. Keep an eye on the q value which indicates language preference score in case there are multiple languages defined (eg fr-CH, fr;q=0.9, en;q=0.8, de;q=0.7, *;q=0.5). When scraping non-english websites we might need to adjust this value to appropriate language to avoid standing out from the crowd.

# Firefox
en-US,en;q=0.5
# Chrome
en-US,en;q=0.9

Host

Indicates server name with which we're interacting. Most http clients configure this header automatically from given url so we don't need to worry about it.

Upgrade-Insecure-Requests

Indicates whether client allows http->https redirects. In cases where our scraper struggles with SSL we might want to try our luck and disable this to scrape unsecure version of the website if it's available.

User-Agent

Arguably the most important header when it comes to web scraping. Indicates what device is submitting the request. There are thousands of different user agent strings however as general rule of thumb we want to aim for the most common available one which is usually a Windows computer using Chrome web browser:

# Windows 10 + Chrome
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36

# Mac Os + Chrome
Mozilla/5.0 (Macintosh; Intel Mac OS X 12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36

# ios + Chrome
Mozilla/5.0 (iPhone; CPU iPhone OS 15_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/99.0.4844.59 Mobile/15E148 Safari/604.1

# Android + Chrome
Mozilla/5.0 (Linux; Android 10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.78 Mobile Safari/537.36

Take note that configured User-Agent string should match the rest of the standard headers like Accept and Accept-Encoding.

Since User-Agents indicate various software versions we want to keep our web scrapers up to date with most popular releases or even use many different user agent strings in our scraper pool to distribute our network. There are several public online user agent string databases like one provided by whatismybrowser.com - when scraping at scale it's important to have a healthy user agent selection!

Sec-Fetch Headers

The Sec-Fetch- family of headers (aka fetch metadata request headers) indicates some security details that can be used in web scraper fingerprinting:

  • Sec-Fetch-Site indicates origin of the request. When web scraping we want to use none for direct requests and same-site for dynamic data requests (XHR type of requests).
  • Sec-Fetch-Mode indicates navigation origin of the request. In web scraping we'd use navigate for direct requests and same-origin, cors or no-cors (depending on website's functionality) for dynamic data requests.
  • Sec-Fetch-User indicates whether requests was made by user or javascript. This header always has value of ?1 or is omitted.
  • Sec-Fetch-Dest indicates requested document type. In web scraping this is usually document for direct HTML requests and empty for dynamic data requests.

These are default values Chrome browser is using when working with HTTPS websites, however functionality might different in dynamic javascript powered websites, so it's always worth keeping an eye on these headers per web scraper basis.

Sec-CH (Client Hint) Headers

The Sec-CH- family of headers is a new and experimental take on user agent strings implemented by newer browser versions. These headers essentially contain the same data available in User-Agent header so it's important to match the values of these two headers to not get identified as a web scraper.

Python Script to extract details client hint details from User-Agent string
"""
Converts Chrome User-Agent string to sec-ch (client hint) headers.
Requires:
    pip install user-agents

Usage:
    $ python user-agent-to-sec-ch.py "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like G
    ecko) Chrome/99.0.4844.51 Safari/537.36"

    {'sec-ch-ua': ' Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"',
     'sec-ch-ua-mobile': '?0',
     'sec-ch-ua-platform': 'Windows'}
"""
import sys
from typing import Dict

from user_agents import parse


def ua_to_ch(ua_string: str) -> Dict[str, str]:
    parsed = parse(ua_string)
    major_version = parsed.browser.version[0]
    assert parsed.browser.family == "Chrome", "only chrome user agent strings supported"
    return {
        "sec-ch-ua": f' Not A;Brand";v="{major_version}", "Chromium";v="{major_version}", "Google Chrome";v="{major_version}"',
        "sec-ch-ua-mobile": f"?{int(parsed.is_mobile)}",
        "sec-ch-ua-platform": parsed.get_os().split()[0],
    }


if __name__ == "__main__":
    from pprint import pprint

    pprint(ua_to_ch(sys.argv[1]))

Optional Standard Headers

Web browser also tend to send a lot of optional standard headers that indicate some security, rendering or content features. Let's take a look at some of the common ones, what do they mean and how should we approach them when web scraping.

Referer Header

Indicates browsing history, which page referred the client to the current one.
This is a great header to mask scraping patterns. In other words, it's a good practice to set this header to something natural, like if we're scraping website.com/category/product.html we'd set referer to either website.com or website.com/category to appear as we're navigating the website as a normal user would. Sometimes we might even want to set it to a common search engine like google.com, bing.com or duckduckgo.com.

Another thing to note is that noreferrer links do not send this header. When following these links with a web scraper we should not send the Referer header either as it can be a easy give away.

Cookies play a major role in web scraping and there are few things to keep an eye on here.

Firstly, many web pages rely on cookies to track session history which means web scrapers need to replicate or "hack" this behavior. Common scraping pattern is called "session pre-walking or warmup" - it's when web scrapers before connecting to the product page, warmup the HTTP session by scraping other pages like homepage and product category to collect session cookies and display natural connection pattern to the web server.

In other words, before we scrape webstore.com/product1 we'd visit webstore.com and webstore.com/products to ensure our scraper appears "human".

Further, ensuring that cookie string looks like one in the browser can also help to avoid identification. Often, HTTP client libraries would differ in their cookie string generation making web scrapers easy to identify.

Finally, sometimes it might be worth editing connection cookies and not sending some cookies that are used for analytics or tracking. Many websites accept that some users would run anti-tracking plugins in their browser so by getting rid of these cookies in our scraper we can avoid extra connection monitoring which could expose our web scraper.

Authorization Header

Authorization header is the standard way to provide secret auth token to a web server. In web scraping, we might see this used for XHR data resources in dynamic, javascript-powered websites. Authorization tokens are rarely dynamic so often we can copy them to web scraper code or they can be found in HTML body or javascript resources.

X Headers

Many modern websites take advantage of javascript front ends and can implement custom headers for dynamic functionality, analytics or authentication. These headers are usually prefixed with X- and are non-standard, however there are few we frequently see in web scraping so let's take a look at some:

x-api-key

Variations of this header indicate a microservice security key. Often, this key can be found in either HTML source (keep an eye on <script> tags) or preceding requests.

x-csrf-token

Csrf stands for cross-site forgery and this token is used to ensure that requests originate from the same source. When it comes to web scraping, usually this token is embedded in the HTML source so scrapers need to request HTML page to extract this token before connecting to data resources.

ScrapFly

As you can see the request headers subject is quite complicated and nuanced. ScrapFly API can smartly select the best header fingerprint for scraped target automatically, saving valuable development time and abstracting this complex logic away from sensitive web scraper code!

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:

  • Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
  • Millions of self-healing proxies of the highest possible trust score.
  • Constantly evolving and adapting to new anti-bot systems.
  • We've been doing this publicly since 2020 with the best bypass on the market!

Summary

In this introduction blog we've taken a look at the most common headers encountered in web scraping and how can they leak our web scrapers identity which can results in blocking or throttling. We've also covered how header order can be used to identify the web scraper and how can we reverse engineer web browser behavior so we can replicate it in our scraper.

Header fingerprinting is an increasingly complex subject - why not delegate it to ScrapFly? Check it out yourself for free!

Related Posts

What is HTTP 401 Error and How to Fix it

Discover the HTTP 401 error meaning, its causes, and solutions in this comprehensive guide. Learn how 401 unauthorized errors occur.

Comprehensive Guide to OkHttp for Java and Kotlin

Learn how to simplify network communication in Java and Android applications using OkHttp.

What is HTTP 407 Status Code and How to Fix it

Learn everything about the HTTP 407 Proxy Authentication Required error. Understand its causes, including misconfigured proxies