How to Bypass PerimeterX when Web Scraping in 2024

article feature image

PerimeterX is one of the most popular anti-bot protection services on the market, offering a wide range of protection against bots and scrapers. PerimeterX has different products, such as bot defender, page defender, and API defender, which are all used as anti bot systems to block web scrapers.

In this article, we'll explain how to bypass PerimeterX bot protection. We'll look at how PerimeterX detects scrapers and how to modify the scraper code to prevent being detected by PerimeterX.

We'll also cover common PerimeterX errors and signs that indicate PerimeterX detection and blocking. Let's dive in!

What is PerimeterX?

PerimeterX (previously Human) is a web service that detects bots on websites and apps from automated scripts, such as web scrapers. It uses a combination of web technologies and behavior analysis to determine whether the request sender is a human or a bot.

It is used as bot networks on popular websites like Zillow.com, Fiverr, and many others. So, by understanding PerimeterX bot detection system bypass, we can open the for scraping many target websites.

Next, let's take a look at some popular PerimeterX errors.

Most of the PerimeterX bot blocks result in HTTP status codes 400-500, 403 being the most common one. The response body usually contains a request to "enable JavaScript" or "press and hold" button:

screenshot of perimeterX block page on fiverr.com
PerimeterX block page on fiverr.com

This error is mostly encountered on the first request to the target website. However, PerimeterX uses behavioral analysis techniques, making it able to block requests at any point during the web scraping process.

Let's take a look at how exactly PerimeterX is detecting bots and web scrapers, which results in the "press and hold" CAPTCHA challenge.

How Does PerimeterX Detect Web Scrapers?

To detect bots and web scraping attempts, PerimeterX uses various different technologies to estimate whether the traffic comes from bots or genuine user interactions.

fingerprint technologies used by PerimeterX

PerimeterX uses a combination of various fingerprinting methods and connection analysis to calculate a trust score for each client. This score determines whether the user can access the website or not.

Based on the final trust score, the user is either allowed to access the website or blocked with a PerimeterX CAPTCHA block page, such as solving JavaScript challenges (i.e. the "press and hold" button).

trust score evaluation flow of PerimeterX anti bot service

This process makes it challenging to scrape a PerimeterX-protected website, as many factors play a role in the detection process. However, if we study each factor individually, we can see that bypassing PerimeterX is very much possible!

TLS Fingerprinting

TLS (or SSL) is the first step in the establishment of HTTP connections when a request is sent to a web page. It is used to encrypt the sent between the client and the server during the secure HTTPS channel.

The first step of this process is called the TLS handshake, where both the client and the server have to negotiate how the encryption methods. This is where the TLS fingerprinting comes into play. Since different computers, programs and even programming libraries have different TLS capabilities. It leads to creating a unique fingerprint.

So, if a web scraper uses a library with different TLS capabilities compared to a regular web browser, it can be identified quite easily. This is generally referred to as JA3 fingerprinting.

For example, some libraries and tools used in web scraping have unique TLS negotiation patterns that can be recognized instantly. On the other hand, other clients can use the same TLS techniques as a regular web browser, making it difficult to detect them as bots.

To validate TLS fingerprint, see ScrapFly's JA3 fingerprint web tool, which extracts the JA3 fingerprint used.

So, use web scraping tools and libraries that are resistant to JA3 fingerprinting to have a PerimeterX bypass with a high success rate.

For further details, refer to our dedicated guide on TLS fingerprinting.

How TLS Fingerprint is Used to Block Web Scrapers?

Learn how TLS can leak the fact that connecting client is a web scraper and how can it be used to establish fingerprint to track the client across the web.

How TLS Fingerprint is Used to Block Web Scrapers?

IP Address Fingerprinting

The next step of the PerimeterX anti-bot identification process is the IP address analysis. Since IP addresses come in different types, they can reveal a lot of information about the client. PerimeterX uses the IP address details to determine whether traffic is coming from bots or human users.

There are different types of IP addresses:

  • Residential, home addresses assigned by internet providers to retail individuals. So, residential IP addresses provide a positive trust score as these are mostly used by humans and are expensive to acquire.
  • Mobile, addresses assigned by mobile phone towers to mobile users. Mobile IPs also provide a positive trust score as these are mostly used by humans. In addition, since mobile towers might share and recycle IP addresses, it makes it more difficult for anti-bots to rely on IP addresses for bot identification.
  • Datacenter, addresses assigned to various data centers and server platforms like Amazon's AWS, Google Cloud etc. Datacenter IPs provide a significant negative trust score, as they are likely to be used by bots and scripts.

Using IP monitoring, the PerimeterX anti bot system can estimate how likely the connecting client is a human, as most people browse from residential IPs while most mobile IPs are used for mobile traffic.

Furthermore, PerimeterX can detect a high amount of traffic coming from the same IP address, leading to IP throttling or even blocking. Therefore, hiding the IP address and distributing the traffic across multiple IP addresses can prevent PerimeterX from detecting the IP origin.

For a more in-depth look, refer to our guide on IP address blocking.

How to Avoid Web Scraper IP Blocking?

Learn what are Internet Protocol addresses and how IP tracking technologies are used to block web scrapers.

How to Avoid Web Scraper IP Blocking?

HTTP Details

The next phase of the PerimeterX fingerprinting methods is the HTTP connections. This includes HTTP connection details like:

  • Protocol Version
    Most of the web is running on HTTP2 and many web scraping tools still use HTTP1.1, which is a dead giveaway that this is a bot. Many newer HTTP client libraries like httpx or cURL support HTTP2, but it's not enabled by default.
    HTTP2 can also be succeptible to fingerprinting. So, check ScrapFly's http2 fingerprint test page for more info.
  • Headers
    Pay attention to X- prefixed headers and the usual suspects like the User-Agent, Origin and Referer as they can be used to identify web scrapers.
  • Header Order
    Web browsers have a specific way of ordering request headers. So, if the headers are not ordered in the same way as a web browser, the request can be identified as coming from a bot. Moreover, some HTTP libraries, such as requests in Python, do not respect the header order and can be easily identified.

So, make sure the browser headers used in web scraper requests match a real web browser, including the ordering to bypass PerimeterX protected websites while scraping.

For more details, refer to our guide on the role of HTTP headers in web scraping.

How Headers Are Used to Block Web Scrapers and How to Fix It

See our full introduction article to request headers in the context of web scraper blocking. What do common headers mean, how are they used in web scraper identification and how can we fix this.

How Headers Are Used to Block Web Scrapers and How to Fix It

Javascript Fingerprinting

Finally, the most powerful tool in PerimeterX's arsenal is the JavaScript fingerprinting.

Since the server can execute arbitrary JavaScript code on the client's side, it can extract various details about the connecting client, such as:

  • Javascript runtime details.
  • Hardware details and capabilities.
  • Operating system details.
  • Web browser details.

That's loads of data that can be used while calculating the trust score.

Fortunately, JavaScript takes time to execute and is prone to false positives. This limits the practical Javascript fingerprinting application. In other words, not many users can wait 3 seconds for the page to load or tolerate false positives.

For an in-depth look, refer to our article on JavaScript use in web scraping detection.

Bypassing the JavaScript fingerprinting is the most difficult task here. In theory, it's possible to reverse engineer and simulate all of the JavaScript tasks PerimeterX is performing and feed it fake results, but it's not practical.

A more practical approach is to use a real web browser for web scraping.
This can be done using browser automation libraries like Selenium, Puppeteer or Playwright that can start a real headless browser and navigate it for web scraping.

So, introducing browser automation to your scraping pipeline can drastically raise the trust score for bypassing PerimeterX.

Tip: many advanced scraping tools can even combine browser and HTTP scraping capabilities for optimal performance. Using resource-heavy browsers to establish a trust score and continue scraping using fast HTTP clients like httpx in Python (this feature is also available using Scrapfly sessions)

Behavior Analysis

Even when scrapers' initial connection is the same as a real web browser, PerimeterX can still detect them through behavior analysis using machine learning algorithms.

This is done by monitoring the connection and analyzing the behavior of the client, including:

  • Pages that are being visited, humans browse in more chaotic patterns.
  • Connection speed and rate, humans are slower and more random than bots.
  • Loading of resources like images, scripts, stylesheets etc.

This means that the trust score is not a constant number and will be constantly adjusted based on the requests' behavior.

So, it's important to distribute web scraper traffic through multiple agents using proxies and different fingerprint configurations to prevent behavior analysis. For example, if browser automation tools are used different browser configurations should be used for each agent like screen size, operating system, web browser version, IP address etc.

How to Bypass PerimeterX (aka Human) Bot Protection?

Now that we're familiar with all of the ways PerimeterX can detect web scrapers, let's see how to bypass it.

In reality, we have two very different options:

We could reverse engineer and foritify against all of these techniques but PerimeterX is constantly updating their detection methods and it's a never-ending game of cat and mouse.

Alternatively, we can use real web browsers for scraping. This is the most practical and effective approach as it's much easier to ensure that the headless browser looks like a real one than to re-invent it.

However, many browser automation tools like Selenium, Playwright and Puppeteer leave data about their existence which need to be patched to achieve high trust scores. For that, see projects like Puppeteer stealth plugin and other similar stealth extensions that patch known leaks.

For sustained web scraping with PerimeterX bypass in 2024, these browsers should always be remixed with different fingerprint profiles: screen resolution, operating system, browser type all play an important role in PerimeterX's trust score.

Bypass with ScrapFly

While bypassing PerimeterX is possible, maintaining bypass strategies can be very time-consuming. This is where services like ScrapFly come in!

illustration of scrapfly's middleware

Using ScrapFly web scraping API we can hand over all of the web scraping complexity and bypass logic to an API!
Scrapfly is not only a PerimeterX bypasser but also offers many other web scraping features:

For example, to scrape pages protected by PerimeterX or any other anti scraping service, when using ScrapFly SDK all we need to do is enable the Anti Scraping Protection bypass feature:

from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR API KEY")
result = scrapfly.scrape(ScrapeConfig(
    url="https://fiverr.com/",
    asp=True,
    # we can also enable headless browsers to render web apps and javascript powered pages
    render_js=True,
    # and set proxies by country like Japan
    country="JP",
    # and proxy type like residential:
    proxy_pool=ScrapeConfig.PUBLIC_RESIDENTIAL_POOL,
))
print(result.scrape_result)

FAQ

To wrap this article let's take a look at some frequently asked questions regarding web scraping PerimeterX pages:

Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.

Is it possible to bypass PerimeterX using cache services?

Yes, public page caching services like Google Cache or Archive.org can be used to bypass PerimeterX protected pages as Google and Archive is tend to be whitelisted. However, since caching takes time the cached page data is often outdated and not suitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.

Is it possible to bypass PerimeterX entirely and scrape the website directly?

No. PerimeterX integrates directly with the server software and is very difficult to reach the server without going through it. It is possible that some servers could have PerimeterX misconfigured but it's very unlikely.

What are some other anti-bot services?

There are many other anti-bot WAF services like Cloudflare, Akamai, Datadome, Imperva Incapsula and Kasada. However, they function very similarly to PerimeterX. So, all the technical aspects in this tutorial can be applied to them as well.

Summary

In this article, we took a deep dive into how to bypass PerimeterX anti-bot system when web scraping.

To start, we've taken a look at how Perimeter X identifies web scrapers through TLS, IP and JavaScript client fingerprint analysis. We saw that using residential proxies and fingerprint-resistant libraries is a good start. Further, using real web browsers and remixing their browser fingerprinting data can make web scrapers much more difficult to detect.

Finally, we've taken a look at some frequently asked questions like alternative bypass methods and the legality of it all.

For an easier way to handle web scraper blocking and power up your web scrapers check out ScrapFly for free!

Related Posts

How to Know What Anti-Bot Service a Website is Using?

In this article we'll take a look at two popular tools: WhatWaf and Wafw00f which can identify what WAF service is used.

Use Curl Impersonate to scrape as Chrome or Firefox

Learn how to prevent TLS fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking.

FlareSolverr Guide: Bypass Cloudflare While Scraping

In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works, how to install and use it. Let's get started!