PerimeterX is one of the most popular anti-bot services on the market offering a wide range of protection against bots and scrapers. PerimeterX products Bot Defender, Page Defender and API Defender are all used to block web scrapers.
In this article, we'll take a look at how to bypass PerimeterX bot protection. We'll do this by taking a quick look at how it detects scrapers and how to modify our scraper code to prevent being detected by PerimeterX.
We'll also cover common PerimeterX errors and signs that indicate that requests have failed to bypass PerimeterX and their meaning. Let's dive in!
What is PerimeterX?
PerimeterX (aka Human) is a web service that protects websites, apps and APIs from automation such as scrapers. It uses a combination of web technologies and behavior analysis to determine whether the user is a human or a bot.
It is used by popular websites like Zillow.com, fiverr.com, and many others so by understanding PerimeterX bypass we can open up web scraping of many popular websites.
Next, let's take a look at some popular PerimeterX errors.
Popular PerimeterX Errors
This error is mostly encountered on the first request to the website though since PerimeterX is using behavior analysis it can also be encountered at any point during web scraping.
Let's take a look at how exactly PerimeterX is detecting web scrapers and bots and how the "Press and hold" button works.
How Does PerimeterX Detect Web Scrapers?
To detect web scraping, PerimeterX uses many different technologies to estimate whether the traffic is coming from a human user or a bot.
PerimeterX uses a combination of fingerprinting and connection analysis to calculate a trust score for each client. This score determines whether the user can access the website or not.
This complex process makes web scraping difficult as there are many factors at play here. However, if we take a look at each individual factor we can see that bypassing PerimeterX is very much possible!
TLS (or SSL) is the first step in HTTP connection establishment. It is used to encrypt the data that is being sent between the client and the server. Note that TLS is only applicable to https endpoints (not http).
First, the client and the server negotiate how encryption is done and this is where TLS fingerprinting comes into play. Different computers, programs and even programming libraries have different TLS capabilities.
So, if a scraper uses a library with different TLS capabilities compared to a regular web browser it can be identified quite easily. This is generally referred to as JA3 fingerprint.
For example, some libraries and tools used in web scraping, have unique TLS negotiation patterns that can be instantly recognized. While some use the same TLS techniques as a web browser and can be very difficult to differentiate.
The next step is IP address analysis. Since IP addresses come in many different shapes and sizes there's a lot of information that can be used to determine whether the client is a human or a bot.
To start, there are different types of IP addresses:
Residential are home addresses assigned by internet providers to average people. So, residential IP addresses provide a positive trust score as these are mostly used by humans and are expensive to acquire.
Mobile addresses are assigned by mobile phone towers and mobile users. So, mobile IPs also provide a positive trust score as these are mostly used by humans. In addition, since mobile towers might share and recycle IP addresses it makes it much more difficult to rely on IP addresses for bot identification.
Datacenter addresses are assigned to various data centers and server platforms like Amazon's AWS, Google Cloud etc. So, datacenter IPs provide a significant negative trust score as they are likely to be used by bots.
Using IP analysis PerimeterX can have an estimate of how likely the connecting client is a human. To start, most people browser from residential IPs while most mobile IPs are used for mobile traffic.
So, use high-quality residential or mobile proxies.
The next step is the HTTP connection itself. This includes HTTP connection details like:
Most of the web is using HTTP2 and many web scraping tools still use HTTP1.1 which is a dead giveaway. Many newer HTTP client libraries like httpx or cURL support HTTP2 though not by default.
HTTP2 can also be succeptible to fingerprinting so check ScrapFly's http2 fingerprint test page for more info.
Pay attention to X- prefixed headers and the usual suspects like User-Agent, Origin, Referer can be used to identify web scrapers.
Web browsers have a specific way of ordering request headers. So, if the headers are not ordered in the same way as a web browser it can be a critical giveaway. To add, some HTTP libraries (like requests in Python) do not respect the header order and can be easily identified.
So, make sure the headers in web scraper requests match a real web browser, including the ordering.
Hardware details and capabilities
Operating system details
Web browser details
That's loads of data that can be used in calculating the trust score.
So, introducing browser automation to your scraping pipeline can drastically raise the trust score.
Tip: many advanced scraping tools can even combine browser and HTTP scraping capabilities for optimal performance. Using resource-heavy browsers to establish a trust score and continue scraping using fast HTTP clients like httpx in Python (this feature is also available using Scrapfly sessions)
Even when scrapers' initial connection is indistinguishable from a real web browser, PerimeterX can still detect them through behavior analysis.
This is done by monitoring the connection and analyzing the behavior of the client. This includes:
Pages that are being visited. People browse in more chaotic patterns.
Connection speed and rate. People are slower and more random than bots.
Loading of resources like images, scripts, stylesheets etc.
The trust score is not a constant number and will be constantly adjusted.
So, it's important to distribute web scraper traffic through multiple agents using proxies and different fingerprint configurations to prevent behavior analysis.
For example, if browser automation tools are used different browser configurations should be used for each agent like screen size, operating system, web browser version, IP address etc.
How to Bypass PerimeterX (aka Human) Bot Protection?
Now that we're familiar with all of the ways PerimeterX can detect web scrapers, let's see how to bypass it.
In reality, we have two very different options:
We could reverse engineer and foritify against all of these techniques but PerimeterX is constantly updating their detection methods and it's a never-ending game of cat and mouse.
Alternatively, we can use real web browsers for scraping. This is the most practical and effective approach as it's much easier to ensure that the headless browser looks like a real one than to re-invent it.
However, many browser automation tools like Selenium, Playwright and Puppeteer leave data about their existence which need to be patched to achieve high trust scores. For that, see projects like Puppeteer stealth plugin and other similar stealth extensions that patch known leaks.
For sustained web scraping with PerimeterX bypass in 2023, these browsers should always be remixed with different fingerprint profiles: screen resolution, operating system, browser type all play an important role in PerimeterX's bot score.
Bypass with ScrapFly
While bypassing PerimeterX is possible, maintaining bypass strategies can be very time-consuming. This is where services like ScrapFly come in!
Using ScrapFly web scraping API we can hand over all of the web scraping complexity and bypass logic to an API!
Scrapfly is not only a PerimeterX bypasser but also offers many other web scraping features:
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient(key="YOUR API KEY")
result = scrapfly.scrape(ScrapeConfig(
# and set proxies by country like Japan
# and proxy type like residential:
To wrap this article let's take a look at some frequently asked questions regarding web scraping PerimeterX pages:
Is it legal to scrape PerimeterX protected pages?
Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.
Is it possible to bypass PerimeterX using cache services?
Yes, public page caching services like Google Cache or Archive.org can be used to bypass PerimeterX protected pages as Google and Archive is tend to be whitelisted. However, since caching takes time the cached page data is often outdated and not suitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.
Is it possible to bypass PerimeterX entirely and scrape the website directly?
No. PerimeterX integrates directly with the server software and is very difficult to reach the server without going through it. It is possible that some servers could have PerimeterX misconfigured but it's very unlikely.
In this article, we took a deep dive into PerimeterX anti-bot systems when web scraping.
Finally, we've taken a look at some frequently asked questions like alternative bypass methods and the legality of it all.
For an easier way to handle web scraper blocking and power up your web scrapers check out ScrapFly for free!