One of the sneakiest and least known ways of detecting and fingerprinting web scraper traffic is Transport Layer Security (TLS) analysis. Every HTTPS connection has to establish a secure handshake, and the way this handshake is performed can lead to fingerprinting and web scraping blocking.
In this article we'll take a look at how TLS can leak the fact that connecting client is a web scraper and how can it be used to establish fingerprint to track the client across the web.
What Is TLS?
Transport Security Layer is what powers all HTTPS connections. It's what allows end-to-end encrypted communication between the client and the server.
In the context of web scraping we rarely care whether the website uses HTTP or HTTPS connections as that doesn't affect our data collection logic. However, an emerging fingerprinting technology is targeting this connection step to not only fingerprint users for tracking but also to block web scrapers.
TLS is a rather complicated protocol, and we don't need to understand all of it to identify our problem, though some basics will help. Let's take a quick TLS overview so we can understand how it can be used in fingerprinting.
At the beginning of every HTTPS connection the client and the server needs to greet each other and negotiate the way connection will be secured. This is called "Client Hello" handshake. Data wise it looks something like this:
There's a lot of data here and this is where spot the difference game begins: which values of this handshake can vary in different HTTP clients like web browser or programming libraries?
The first thing to note is that there are multiple TLS Versions: usually it's either 1.2 or 1.3 (the latest one).
This version determines the rest of the data used in the handshake. TLS 1.3 provides extra optimizations and less data so it's easier to fortify but whether it's 1.2 or 1.3 our goal remains the same - make it look like a real web browser. In reality, we have to fortify multiple versions as some websites do not support TLS 1.3 yet.
Further, we have the most important field: Cipher Suites.
This field is a list of what encryption algorithms the negotiating parties support. This list is ordered by priority and both parties settle on the first matching value.
So, we must ensure that our HTTP client list matches that of a common web browser, including the order.
Similarly to list of Cipher Suites we have list of Enabled Extensions.
These extensions signify features the client supports, and some metadata like server domain name. Just like with Cipher Suites we need to ensure these values and their order match that of a common web browser.
As we can see there are several values that can vary vastly across clients. For this, JA3 fingerprint technique is often used which essentially is a string of the varying values:
JA3 fingerprints are often further md5 hashed to reduce fingerprint length:
How To Read TLS Data?
Most common way to observe TLS handshakes is to use Wireshark packet analyzer:
Using filter tls we can easily observe TLS handshake when we submit a request in a web browser or a web scraper script. Look for "Client Hello" message which is the first step in the handshake process.
Wireshark even calculates the JA3 fingerprint for you:
To test JA3 fingerprint we made an open ScrapFly JA3 tool which makes it easy to test HTTP client fingerprints.
For example this is the results of requests library in Python:
This doesn't look a lot like Chrome or Firefox - meaning these Python web scrapers would be quite easy to identify! Let's take a look how could we remedy this.
How Does TLS Fingerprinting Lead To Blocking?
When it comes to blocking web scrapers the main goal is the difference spotting - is this client different from a general web browser?
We can see that JA3 fingerprint algorithm considers very few variables meaning there are relatively few unique fingerprint possibilities which makes it easy to create whitelist and blacklist databases.
ja3er.com is a common open JA3 fingerprint database. It allows to lookup fingerprints and see counts by user agent string which is useful for gather some context before committing to fingerprint faking.
Anti web scraping services collect massive JA3 fingerprint databases which are used to whitelist browser-like ones and blacklist common web scraping ones. Meaning to avoid blocking we must ensure that our JA3 fingerprint is whitelisted (matches common web browser) or unique enough.
How To Fake TLS Fingerprint?
Unfortunately, configuring TLS spoofing is pretty complicated and not easily achievable in many scenarios. Nevertheless, let's take a look at some common cases.
TLS Fortification in Python
In Python we can only configure "Cipher Suite" and "TLS version" variables, meaning every python HTTP client is vulnerable to TLS extension fingerprinting. We cannot achieve whitelisted fingerprints but by spoofing these two variables we can at least avoid the blacklists:
When we're scraping using Playwright, Puppeteer or Selenium we're using real browsers, so we get genuine TLS fingerprints which is great! That being said, when scraping at scale using diverse collection of browser/operating-system versions can help to spread out connection through multiple fingerprints rather than one.
Do Headless Browsers Have Different Fingerprints?
No. Generally, running browsers in headless mode (be it Selenium, Playwright or Puppeteer) should not change TLS fingerprint. Meaning JA3 and other TLS fingerprinting techniques cannot identify whether connecting browser is headless or not.
ScrapFly - Making it Easy
TLS fingerprinting is really powerful for identifying bots and unusual HTTPS clients in general. We've taken a look at how JA3 fingerprinting works and to treat and spoof it in web scraping.
As you see, TLS is a really complex security protocol which can be a huge time sink when it comes to fortifying web scrapers.
For this ScrapFly's intelligent web scraping API comes fully TLS configured to access even the hardest to access targets.