One of the leading causes of web scraper identification is incorrectly configured request headers which can lead to blocking, throttling or even banning of web scrapers.
Headers are vital part of every HTTP request as it provides important meta information about incoming requests. This varies from basic information about the client like user agent string, to custom security tokens and client rendering capabilities.
In this article we'll take an extensive look at request headers in web scraping. How can we prevent our web scrapers from being identified and blocked by configuring our requests to appear natural.
We'll take a look at common headers and what they mean, what are some challenges and best practices when it comes to configuring this part of the web scraping process.
Inspecting Browser Behavior
When web scraping we want our scraper to appear as a web browser, so firstly we should ensure that our scraper replicates common standard headers a web browser such as Chrome or Firefox is sending.
To understand what browsers are sending we need a simple echo server that would print out HTTP connection details server is receiving. We can achieve this with a short python script:
import socket
HOST = "127.0.0.1"
PORT = 65432
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind((HOST, PORT))
s.listen()
while True:
conn, addr = s.accept()
with conn:
print(f"connected by {addr}")
data = conn.recv(1024)
print(data.decode())
# header
conn.send(b'HTTP/1.1 200 OK\n')
conn.send(b'Content-Type: text/html\n')
conn.send(b'\n')
# body
conn.send(b'<html><body><pre>')
conn.send(data)
conn.send(b'</pre></body></html>')
conn.close()
If we run this script and go to http://127.0.0.1:65432 in our browser we'll see the exact http connection string our browser is sending:
GET / HTTP/1.1
Host: 127.0.0.1:65432
Connection: keep-alive
Cache-Control: max-age=0
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8
Firefox on MacOS
GET / HTTP/1.1
Host: 127.0.0.1:65432
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:98.0) Gecko/20100101 Firefox/98.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Cache-Control: max-age=0
Safari on MacOS
GET / HTTP/1.1
Host: 127.0.0.1:65432
Upgrade-Insecure-Requests: 1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15
Accept-Language: en-gb
Accept-Encoding: gzip, deflate
Connection: keep-alive
Above shows default headers and their order common web browsers send as a first request when establishing connection.
Using this information we can build our header fingerprint profiles for our web scrapers. As always, we don't want this fingerprint to stick out too much, so we should aim to replicate the most common platforms such as Chrome on Windows or Safari on MacOS.
Request Header Order
The first we noticed in the previous section is that browsers return headers in certain order and this is an often overlooked web scraper identification method. Primarily because many http clients in various programming languages implement their own header ordering - making identification of web scrapers very easy!
For example, most common http client library in Python - requests - does not respect header order (see issue 5814 for potential solutions), thus web scrapers based on it can be easily identified. Alternatively, httpx library does respect the header order, and we can safely use it for web scraping as a requests alternative.
To avoid being detected because of unnatural header order we should ensure that used HTTP client respects header ordering, and order headers explicitly as they appear in a web browser.
For example, if we're using httpx in Python we can imitate Firefox on Windows headers and their ordering:
Indicates what sort of encoding HTTP client supports. Take note br value which indicates support for newer brotli encoding which is commonly used to identify web scrapers.
# Firefox with brotli support
gzip, deflate, br
# Chrome with no brotli support
gzip, deflate
Indicates what language browser supports. Keep an eye on the q value which indicates language preference score in case there are multiple languages defined (eg fr-CH, fr;q=0.9, en;q=0.8, de;q=0.7, *;q=0.5). When scraping non-english websites we might need to adjust this value to appropriate language to avoid standing out from the crowd.
Indicates server name with which we're interacting. Most http clients configure this header automatically from given url so we don't need to worry about it.
Indicates whether client allows http->https redirects. In cases where our scraper struggles with SSL we might want to try our luck and disable this to scrape unsecure version of the website if it's available.
Arguably the most important header when it comes to web scraping. Indicates what device is submitting the request. There are thousands of different user agent strings however as general rule of thumb we want to aim for the most common available one which is usually a Windows computer using Chrome web browser:
# Windows 10 + Chrome
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36
# Mac Os + Chrome
Mozilla/5.0 (Macintosh; Intel Mac OS X 12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36
# ios + Chrome
Mozilla/5.0 (iPhone; CPU iPhone OS 15_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/99.0.4844.59 Mobile/15E148 Safari/604.1
# Android + Chrome
Mozilla/5.0 (Linux; Android 10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.78 Mobile Safari/537.36
Take note that configured User-Agent string should match the rest of the standard headers like Accept and Accept-Encoding.
Since User-Agents indicate various software versions we want to keep our web scrapers up to date with most popular releases or even use many different user agent strings in our scraper pool to distribute our network. There are several public online user agent string databases like one provided by whatismybrowser.com - when scraping at scale it's important to have a healthy user agent selection!
Sec-Fetch Headers
The Sec-Fetch- family of headers (aka fetch metadata request headers) indicates some security details that can be used in web scraper fingerprinting:
Sec-Fetch-Site indicates origin of the request. When web scraping we want to use none for direct requests and same-site for dynamic data requests (XHR type of requests).
Sec-Fetch-Mode indicates navigation origin of the request. In web scraping we'd use navigate for direct requests and same-origin, cors or no-cors (depending on website's functionality) for dynamic data requests.
Sec-Fetch-User indicates whether requests was made by user or javascript. This header always has value of ?1 or is omitted.
Sec-Fetch-Dest indicates requested document type. In web scraping this is usually document for direct HTML requests and empty for dynamic data requests.
These are default values Chrome browser is using when working with HTTPS websites, however functionality might different in dynamic javascript powered websites, so it's always worth keeping an eye on these headers per web scraper basis.
Sec-CH (Client Hint) Headers
The Sec-CH- family of headers is a new and experimental take on user agent strings implemented by newer browser versions. These headers essentially contain the same data available in User-Agent header so it's important to match the values of these two headers to not get identified as a web scraper.
Python Script to extract details client hint details from User-Agent string
"""
Converts Chrome User-Agent string to sec-ch (client hint) headers.
Requires:
pip install user-agents
Usage:
$ python user-agent-to-sec-ch.py "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like G
ecko) Chrome/99.0.4844.51 Safari/537.36"
{'sec-ch-ua': ' Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': 'Windows'}
"""
import sys
from typing import Dict
from user_agents import parse
def ua_to_ch(ua_string: str) -> Dict[str, str]:
parsed = parse(ua_string)
major_version = parsed.browser.version[0]
assert parsed.browser.family == "Chrome", "only chrome user agent strings supported"
return {
"sec-ch-ua": f' Not A;Brand";v="{major_version}", "Chromium";v="{major_version}", "Google Chrome";v="{major_version}"',
"sec-ch-ua-mobile": f"?{int(parsed.is_mobile)}",
"sec-ch-ua-platform": parsed.get_os().split()[0],
}
if __name__ == "__main__":
from pprint import pprint
pprint(ua_to_ch(sys.argv[1]))
Optional Standard Headers
Web browser also tend to send a lot of optional standard headers that indicate some security, rendering or content features. Let's take a look at some of the common ones, what do they mean and how should we approach them when web scraping.
Referer Header
Indicates browsing history, which page referred the client to the current one.
This is a great header to mask scraping patterns. In other words, it's a good practice to set this header to something natural, like if we're scraping website.com/category/product.html we'd set referer to either website.com or website.com/category to appear as we're navigating the website as a normal user would. Sometimes we might even want to set it to a common search engine like google.com, bing.com or duckduckgo.com.
Another thing to note is that noreferrer links do not send this header. When following these links with a web scraper we should not send the Referer header either as it can be a easy give away.
Cookie Header
Cookies play a major role in web scraping and there are few things to keep an eye on here.
Firstly, many web pages rely on cookies to track session history which means web scrapers need to replicate or "hack" this behavior. Common scraping pattern is called "session pre-walking or warmup" - it's when web scrapers before connecting to the product page, warmup the HTTP session by scraping other pages like homepage and product category to collect session cookies and display natural connection pattern to the web server.
In other words, before we scrape webstore.com/product1 we'd visit webstore.com and webstore.com/products to ensure our scraper appears "human".
Further, ensuring that cookie string looks like one in the browser can also help to avoid identification. Often, HTTP client libraries would differ in their cookie string generation making web scrapers easy to identify.
Finally, sometimes it might be worth editing connection cookies and not sending some cookies that are used for analytics or tracking. Many websites accept that some users would run anti-tracking plugins in their browser so by getting rid of these cookies in our scraper we can avoid extra connection monitoring which could expose our web scraper.
Authorization Header
Authorization header is the standard way to provide secret auth token to a web server. In web scraping, we might see this used for XHR data resources in dynamic, javascript-powered websites. Authorization tokens are rarely dynamic so often we can copy them to web scraper code or they can be found in HTML body or javascript resources.
X Headers
Many modern websites take advantage of javascript front ends and can implement custom headers for dynamic functionality, analytics or authentication. These headers are usually prefixed with X- and are non-standard, however there are few we frequently see in web scraping so let's take a look at some:
x-api-key
Variations of this header indicate a microservice security key. Often, this key can be found in either HTML source (keep an eye on <script> tags) or preceding requests.
x-csrf-token
Csrf stands for cross-site forgery and this token is used to ensure that requests originate from the same source. When it comes to web scraping, usually this token is embedded in the HTML source so scrapers need to request HTML page to extract this token before connecting to data resources.
ScrapFly
As you can see the request headers subject is quite complicated and nuanced. ScrapFly API can smartly select the best header fingerprint for scraped target automatically, saving valuable development time and abstracting this complex logic away from sensitive web scraper code!
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:
Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
Millions of self-healing proxies of the highest possible trust score.
Constantly evolving and adapting to new anti-bot systems.
We've been doing this publicly since 2020 with the best bypass on the market!
Summary
In this introduction blog we've taken a look at the most common headers encountered in web scraping and how can they leak our web scrapers identity which can results in blocking or throttling. We've also covered how header order can be used to identify the web scraper and how can we reverse engineer web browser behavior so we can replicate it in our scraper.
Header fingerprinting is an increasingly complex subject - why not delegate it to ScrapFly? Check it out yourself for free!