How to Bypass PerimeterX when Web Scraping in 2024

article feature image

PerimeterX is one of the most popular anti-bot services on the market, offering a wide range of protection against bots and scrapers. PerimeterX has different products, such as Bot Defender, Page Defender, and API Defender, which are all used to block web scrapers.

In this article, we'll explain how to bypass PerimeterX bot protection. We'll look at how PerimeterX detects scrapers and how to modify the scraper code to prevent being detected by PerimeterX.

We'll also cover common PerimeterX errors and signs that indicate PerimeterX detection and blocking. Let's dive in!

What is PerimeterX?

PerimeterX (previously Human) is a web service that protects websites, apps and APIs from automated scripts, such as web scrapers. It uses a combination of web technologies and behavior analysis to determine whether the request sender is a human or a bot.

It is used by popular websites like Zillow.com, Fiverr.com, and many others. So, by understanding PerimeterX bypass, we can open the for scraping many popular websites.

Next, let's take a look at some popular PerimeterX errors.

Most of the PerimeterX bot blocks result in HTTP status codes 400-500, 403 being the most common one. The response body usually contains a request to "enable JavaScript" or "press and hold" button.

screenshot of perimeterX block page on fiverr.com
PerimeterX block page on fiverr.com

This error is mostly encountered on the first request to the website. However, PerimeterX uses behavior analysis techniques, making it able to block requests at any point during the web scraping process.

Let's take a look at how exactly PerimeterX is detecting bots and web scrapers, which results in the "press and hold" CAPTCHA challenge.

How Does PerimeterX Detect Web Scrapers?

To detect web scraping attempts, PerimeterX uses many different technologies to estimate whether the traffic comes from a human user or a bot.

fingerprint technologies used by PerimeterX

PerimeterX uses a combination of fingerprinting and connection analysis to calculate a trust score for each client. This score determines whether the user can access the website or not.

Based on the final trust score, the user is either allowed to access the website or blocked with a PerimeterX block page, which can further be bypassed by solving JavaScript challenges (i.e. the "press and hold" button).

trust score evaluation flow of PerimeterX anti bot service

This process makes it challenging to scrape PerimeterX-protected websites, as many factors play a role in the detection process. However, if we study each factor individually, we can see that bypassing PerimeterX is very much possible!

TLS Fingerprinting

TLS (or SSL) is the first step in HTTP connection establishment. It is used to encrypt the sent between the client and the server during the secure HTTPS connections.

This first step of this process is called the TLS handshake, where both the client and the server have to negotiate how encryption is done. This is where the TLS fingerprinting comes into play. Since different computers, programs and even programming libraries have different TLS capabilities. It leads to creating a unique fingerprint.

So, if a scraper uses a library with different TLS capabilities compared to a regular web browser, it can be identified quite easily. This is generally referred to as JA3 fingerprint.

For example, some libraries and tools used in web scraping have unique TLS negotiation patterns that can be recognized instantly. On the other hand, other clients can use the same TLS techniques as a regular web browser, making it difficult to detect them as bots.

To validate TLS fingerprint, see ScrapFly's JA3 fingerprint web tool which extracts the JA3 fingerprint used.

So, use web scraping libraries and tools that are resistant to JA3 fingerprinting.

For more see our full introduction to TLS fingerprinting which covers TLS fingerprinting in greater detail.

IP Address Fingerprinting

The next step is IP address analysis. Since IP addresses come in many different types, they can reveal a lot of information about the client, which can be used to determine whether the client is a human or a bot.

There are different types of IP addresses:

  • Residential, home addresses assigned by internet providers to retail individuals. So, residential IP addresses provide a positive trust score as these are mostly used by humans and are expensive to acquire.
  • Mobile, addresses assigned by mobile phone towers to mobile users. Mobile IPs also provide a positive trust score as these are mostly used by humans. In addition, since mobile towers might share and recycle IP addresses, it makes it more difficult for anti-bots to rely on IP addresses for bot identification.
  • Datacenter, addresses assigned to various data centers and server platforms like Amazon's AWS, Google Cloud etc. Datacenter IPs provide a significant negative trust score, as they are likely to be used by bots and scripts.

Using IP analysis, PerimeterX can estimate how likely the connecting client is a human, as most people browse from residential IPs while most mobile IPs are used for mobile traffic.

So, use high-quality residential or mobile proxies to incease the PerimeterX bypassing chances.

For a more in-depth look, see our full introduction to IP blocking.

HTTP Details

The next step is the HTTP connection itself. This includes HTTP connection details like:

  • Protocol Version
    Most of the web is running on HTTP2 and many web scraping tools still use HTTP1.1, which is a dead giveaway that this is a bot. Many newer HTTP client libraries like httpx or cURL support HTTP2, but it's not enabled by default.
    HTTP2 can also be succeptible to fingerprinting. So, check ScrapFly's http2 fingerprint test page for more info.
  • Headers
    Pay attention to X- prefixed headers and the usual suspects like User-Agent, Origin and Referer as they can be used to identify web scrapers.
  • Header Order
    Web browsers have a specific way of ordering request headers. So, if the headers are not ordered in the same way as a web browser, the request can be identified as coming from a bot. Moreover, some HTTP libraries, such as requests in Python, do not respect the header order and can be easily identified.

So, make sure the headers in web scraper requests match a real web browser, including the ordering to bypass PerimeterX while scraping.

For more see our full introduction to request headers role in blocking

Javascript Fingerprinting

Finally, the most powerful tool in PerimeterX's arsenal is the JavaScript fingerprinting.

Since the server can execute arbitrary Javascript code on the client's side, it can extract a lot of information about the connecting user, such as:

  • Javascript runtime details
  • Hardware details and capabilities
  • Operating system details
  • Web browser details

That's loads of data that can be used while calculating the trust score.

Fortunately, JavaScript takes time to execute and is prone to false positives. This limits the practical Javascript fingerprinting application. In other words, not many users can wait 3 seconds for the page to load or tolerate false positives.

For an in-depth look, refer to our article on JavaScript use in web scraping detection.

Bypassing the JavaScript fingerprinting is the most difficult task here. In theory, it's possible to reverse engineer and simulate all of the JavaScript tasks PerimeterX is performing and feed it fake results, but it's not practical.

A more practical approach is to use a real web browser for web scraping.
This can be done using browser automation libraries like Selenium, Puppeteer or Playwright that can start a real headless browser and navigate it for web scraping.

So, introducing browser automation to your scraping pipeline can drastically raise the trust score for bypassing PerimeterX.

Tip: many advanced scraping tools can even combine browser and HTTP scraping capabilities for optimal performance. Using resource-heavy browsers to establish a trust score and continue scraping using fast HTTP clients like httpx in Python (this feature is also available using Scrapfly sessions)

Behavior Analysis

Even when scrapers' initial connection is the same as a real web browser, PerimeterX can still detect them through behavior analysis.

This is done by monitoring the connection and analyzing the behavior of the client, including:

  • Pages that are being visited, humans browse in more chaotic patterns.
  • Connection speed and rate, humans are slower and more random than bots.
  • Loading of resources like images, scripts, stylesheets etc.

This means that the trust score is not a constant number and will be constantly adjusted based on the requests' behavior.

So, it's important to distribute web scraper traffic through multiple agents using proxies and different fingerprint configurations to prevent behavior analysis.
For example, if browser automation tools are used different browser configurations should be used for each agent like screen size, operating system, web browser version, IP address etc.

How to Bypass PerimeterX (aka Human) Bot Protection?

Now that we're familiar with all of the ways PerimeterX can detect web scrapers, let's see how to bypass it.

In reality, we have two very different options:

We could reverse engineer and foritify against all of these techniques but PerimeterX is constantly updating their detection methods and it's a never-ending game of cat and mouse.

Alternatively, we can use real web browsers for scraping. This is the most practical and effective approach as it's much easier to ensure that the headless browser looks like a real one than to re-invent it.

However, many browser automation tools like Selenium, Playwright and Puppeteer leave data about their existence which need to be patched to achieve high trust scores. For that, see projects like Puppeteer stealth plugin and other similar stealth extensions that patch known leaks.

For sustained web scraping with PerimeterX bypass in 2024, these browsers should always be remixed with different fingerprint profiles: screen resolution, operating system, browser type all play an important role in PerimeterX's trust score.

Bypass with ScrapFly

While bypassing PerimeterX is possible, maintaining bypass strategies can be very time-consuming. This is where services like ScrapFly come in!

illustration of scrapfly's middleware

Using ScrapFly web scraping API we can hand over all of the web scraping complexity and bypass logic to an API!
Scrapfly is not only a PerimeterX bypasser but also offers many other web scraping features:

For example, to scrape pages protected by PerimeterX or any other anti scraping service, when using ScrapFly SDK all we need to do is enable the Anti Scraping Protection bypass feature:

from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR API KEY")
result = scrapfly.scrape(ScrapeConfig(
    url="https://fiverr.com/",
    asp=True,
    # we can also enable headless browsers to render web apps and javascript powered pages
    render_js=True,
    # and set proxies by country like Japan
    country="JP",
    # and proxy type like residential:
    proxy_pool=ScrapeConfig.PUBLIC_RESIDENTIAL_POOL,
))
print(result.scrape_result)

FAQ

To wrap this article let's take a look at some frequently asked questions regarding web scraping PerimeterX pages:

Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.

Is it possible to bypass PerimeterX using cache services?

Yes, public page caching services like Google Cache or Archive.org can be used to bypass PerimeterX protected pages as Google and Archive is tend to be whitelisted. However, since caching takes time the cached page data is often outdated and not suitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.

Is it possible to bypass PerimeterX entirely and scrape the website directly?

No. PerimeterX integrates directly with the server software and is very difficult to reach the server without going through it. It is possible that some servers could have PerimeterX misconfigured but it's very unlikely.

What are some other anti-bot services?

There are many other anti-bot WAF services like Cloudflare, Akamai, Datadome, Imperva Incapsula and Kasada. However, they function very similarly to PerimeterX. So, all the techincal aspects in this tutorial can be applied to them as well.

Summary

In this article, we took a deep dive into bypassing PerimeterX anti-bot systems when web scraping.

To start, we've taken a look at how Perimeter X identifies web scrapers through TLS, IP and javascript client fingerprint analysis. We saw that using residential proxies and fingerprint-resistant libraries is a good start. Further, using real web browsers and remixing their fingerprint data can make web scrapers much more difficult to detect.

Finally, we've taken a look at some frequently asked questions like alternative bypass methods and the legality of it all.

For an easier way to handle web scraper blocking and power up your web scrapers check out ScrapFly for free!

Related Posts

Use Curl Impersonate to scrape as Chrome or Firefox

Learn how to prevent TLS fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking.

FlareSolverr Guide: Bypass Cloudflare While Scraping

In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works, how to install and use it. Let's get started!

How to Bypass CAPTCHA While Web Scraping in 2024

Captchas can ruin web scrapers but we don't have to teach our robots how to solve them - we can just get around it all!