How to Bypass Datadome Anti Scraping in 2024

How to Bypass Datadome Anti Scraping in 2024

Datadome is an anti-bot and anti-scraping service used by websites like Leboncoin, Vinted, and Deezer, as well as other target websites to block non-human visitors.

In this article, we'll explain how to bypass Datadome anti-bot protection. We'll start by taking a quick look at what Datadome is, how to identify its blocking, and how it identifies web scrapers. Then, we'll have a look at existing techniques and tools for bypassing Datadome protected websites. Let's dive in!

What is Datadome?

Datadome is a paid WAF service that protects websites from automated requests. In the context of security, it's used to block malicious bots and scripts, which causes several critical issues such as DDoS attacks and fraud activities.

In the context of web scraping, it's used to protect the public data on websites, which is particularly popular with European websites.

Datadome Block Page Examples

Most of the Datadome bot blocks result in HTTP status codes 400-500, 403 being the most popular. The error message can appear in different forms, but it usually requests to enable JavaScript or solve a Datadome CAPTCHA.

screenshot of Datadome block page on Leboncoin
Datadome block page on Leboncoin website

These Datadome errors are mostly encountered on the first request to the website. However, Datadome utilizes AI behavior analysis, making it able to block requests after a few successful requests.

How does Datadome Detect Web Scrapers?

To identify web scrapers, Datadome employs various techniques to get an estimate on whether the connecting client is a bot or a real user.

fingerprint technologies used by Datadome

Datadome takes into consideration all the connection metrics, like encryption type (TLS), HTTP protocol used, and JavaScript engine to calculate a trust score.

Based on the final trust score, Datadome either lets the user in, blocks them, or requests a CAPTCHA challenge to be solved.

trust score evaluation flow of Datadome anti bot service

This complex process is done in real-time, making web scraping difficult as many factors can influence the trust score. However, by understanding each step of this process, we get a Datadome bypass with a high success rate. Let's take a look at each step in detail.

TLS Fingerprinting

TLS (or SSL) is the first step in the HTTP connection. When using encrypted connections, like HTTPs instead of HTTP, both the server and client have to negotiate the encryption methods. With the availability of various encryption methods and ciphers, the negotiation process can reveal significant information about the client.

This is generally referred to as JA3 fingerprinting. Different operating systems, web browsers, or programming libraries perform the TLS encryption handshake uniquely, which results in different JA3 fingerprints.

Therefore, using a web scraping tool that's resistant to JA3 fingerprinting is essential for avoiding Datadome CAPTCHA. For this, you can use one of the available tools, such as the ScrapFly's JA3 fingerprint web tool for validating the request's JA3 fingerprint.

For further details on TLS fingerprinting, refer to our dedicated guide.

How TLS Fingerprint is Used to Block Web Scrapers?

Learn how TLS can leak the fact that connecting client is a web scraper and how can it be used to establish fingerprint to track the client across the web.

How TLS Fingerprint is Used to Block Web Scrapers?

IP Address Fingerprinting

The next step of Datadome's trust score calculation process is the IP address analysis. Datadome has access to many different IP databases, which are used to look up the connecting client's IP address. This IP address lookup is used to identify the client's location, ISP, reputation, and other related information.

The most critical metric used here is the IP address type, as there are three different types of IP addresses:

  • Residential are home addresses assigned by internet providers to home networks. So, residential IP addresses provide a positive trust score as these are mostly used by human users and are expensive to acquire.

  • Mobile addresses are assigned by mobile phone towers and mobile users. So, mobile IPs also provide a positive trust score as these are mostly used by humans. In addition, since mobile towers might share and recycle IP addresses, it is much more difficult to rely on IP addresses for identification.

  • Datacenter addresses are assigned to various data centers and server platforms like Amazon's AWS, Google Cloud, etc. So, datacenter IPs provide a significant negative trust score as they are likely to be used by scripts.

Using IP analysis, Datadome can roughly estimate how likely the connecting client is a human or a bot. For example, very few people browse the web from IPs owned by data centers. Moreover, Datadome can block a client if the requesting rate is extensive in a short time window.

Moreover, Datadome can block a client if the requesting rate is extensive in a short time window.

So, rotate high-quality residential or mobile IP addresses to hide your IP address and bypass Datadome while scraping.

How to Avoid Web Scraper IP Blocking?

Learn what are Internet Protocol addresses and how IP tracking technologies are used to block web scrapers.

How to Avoid Web Scraper IP Blocking?

HTTP Details

The next area where an anti bot system like Datadome searches for is the HTTP details. The HTTP protocol is becoming increasingly complex, making it easier to identify connections from web scrapers.

Most of the web operates through HTTP2 or HTTP3, while most web scraping libraries still use HTTP1.1. So, if a connecting client uses HTTP1.1, it's likely that this is a bot. That being said, many modern libraries like Python's httpx and cURL support HTTP2, but it's not enabled by default.

http2 fingerprint test page

HTTP2 is also susceptible to HTTP2 fingerprinting which can be used to identify web scrapers. See our http2 fingerprint test page for more info.

Then, request headers and header order plays an important role in identifying web scrapers. Since most web browsers have strict header value and order rules, any mismatch like missing Origin or User-Agent header can leak the fact that the request sender is a bot.

Moreover, the default HTTP details of HTTP clients and browser automation tools can leak their usage, such as the default User-Agent of each client. Therefore, overriding these values using web scraping tools that hide their traces, such as the Undedected ChromeDriver, Puppeteer-stealth, and Curl Impersonate, can help mimic the HTTP details of human users.

So, make sure to use HTTP2 and match header values and the order of a real web browser to increase the chances of bypassing a Datadome protected website.

For further details, refer to our dedicated guide on web scraping headers.

How Headers Are Used to Block Web Scrapers and How to Fix It

See our full introduction article to request headers in the context of web scraper blocking. What do common headers mean, how are they used in web scraper identification and how can we fix this.

How Headers Are Used to Block Web Scrapers and How to Fix It

JavaScript Fingerprinting

Finally, the most complex and challenging step to address is JavaScript fingerprinting. Datadome is using the client's JavaScript engine to fingerprint the client machine for details like:

  • Javascript runtime information.
  • Hardware and operating system details.
  • Web browser information and capabilities.

This comprehensive set of data can be used through the trust score calculation process. Fortunately for us, the JavaScript fingerprinting takes time to execute, and it's prone to false positives. In other words, it's not as important as the other processes.

There are two ways to bypass DataDome CAPTCHA from JavaScript fingerprinting.

The obvious one is to inspect and reverse engineer all of the JavaScript code Datadome uses to fingerprint the client. This is a very time-consuming process and requires a lot of reverse-engineering knowledge. Moreover, it requires a lot of maintenance as Datadome is constantly updating its fingerprinting logic.

A more practical approach is to use an automated browser for web scraping. There are different automating libraries used to automate browsers, such as Selenium, Puppeteer, and Playwright.

So, introducing browser automation using tools like Selenium, Puppeteer or Playwright is the best way to bypass Datadome JavaScript fingerprinting

Many advanced scraping tools can even combine browser and HTTP scraping capabilities for optimal performance. Using resource-heavy browsers to establish a trust score and continue scraping using fast HTTP clients like httpx in Python - this feature is also available using ScrapFly's sessions.

Behavior Analysis

Datadome is using machine learning algorithms to analyze connection patterns and user profiles. So, even with the above steps passed, Datadome can still block the client if it detects suspicious behavior.

This means the trust score is not a static number but is constantly being adjusted based on the client's actions, and making it mimic human behavior can lead to a higher trust score.

So, it's important to distribute web scraper traffic through multiple different agents using proxies and different fingerprinting configurations to bypass Datadome. For example, when scraping using browser automation tools, it's important to use different browser profiles like screen size, operating systems, and rendering capabilities.

How to Bypass Datadome Anti Bot?

Now that we're familiar with all methods Datadome is using to detect web scrapers, let's see existing tools we can use to bypass it.

While Datadome bypass at scale requires a lot of technical reverse engineering knowledge we can use existing tools to have a fair amount of success bypassing it.

Start with Headless Browsers

To start, we know that Datadome is using JavaScript fingerprinting and challenges to detect web scrapers.

Reverse engineering challenges and solving them is really tough and requires a lot of time and knowledge. This is where headless browsers can help us.

Scraping using headless browsers is a common web scraping technique which uses tools like Selenium, Puppeteer or Playwright to automate a real browser without it's GUI elements.

Headless browsers can be used to execute javascript challenges and Datadome's fingerprinting which can be used to bypass the anti-bot systems saving us a lot of time trying to understand that javascript code.

Use High Quality Residential Proxies

As Datadome is using IP address analysis to determine the trust score, using high-quality residential or mobile proxies can help bypass the IP address fingerprinting.

Residential proxies are real IP addresses assigned by internet providers to retail individuals, making them look like real users.

Related: Intro to Proxies in Web Scraping

New to proxies? See our introduction to proxies in web scraping and how to evalute them

Related: Intro to Proxies in Web Scraping

Web scraping APIs like Scrapfly already use high quality proxies by default as that's often the best way to bypass anti-scraping protection at scale.

Try undetected-chromedriver

To bypass HTTP, TLS and Javascript fingerprinting, we can use a real headless web browser like Chrome or Firefox through Selenium, Playwright or Puppetteer automation libraries. However, these browsers are easily detected by Datadome as these libraries leak their usage through default fingerprints.

Headless browsers perform slightly differently and this is where undetected-chromedriver community patch can be helpfull which improves these tools to resist identification.

undetected-chromedriver patches Selenium chromedriver with various fixes that prevent headless browser identification by Datadome. This includes fixing the TLS, HTTP and Javascript fingerprints.

Try Puppeteer Stealth Plugin

Puppeteer for NodeJS is another popular headless browser automation library like Selenium that can be used to bypass PerimeterX.

Just like undetected-chromedriver for Selenium, Puppeteer Stealth Plugin is a community project for Puppeteer for bypassing the headless browser detection techniques. It patches the headlesss browser features to make it more difficult to differentiate from a real web browser.

Try curl-impersonate

curl-impersonate is a HTTP client tool that extends the popular libcurl HTTP client to mimic the behavior of a real web browser. It patches the TLS, HTTP and Javascript fingerprints to make the outgoing HTTP requests look like they're coming from a real web browser.

However, this works only curl-powered web scrapers which can be difficult to use especially compared to modern http libraries like fetch or requests. For more on curl use in scraping see how to scrape with curl intro.

Try Warming Up Scrapers

To bypass behavior analysis adjusting scraper behavior to appear more natural can drastically increase Datadome trust scores.

In real life, most human users don't visit product URLs directly. They often explore websites in steps like:

  • Start at homepage
  • Browser product categories or search
  • View individual product page

Prefixing scraping logic with this warmup behavior can make the scraper appear more natural and increase behavior analysis detection.

Rotate Real User Fingerprints

For sustained web scraping and Datadome bypass in 2024, these browsers should always be remixed with different, realistic fingerprint profiles:

  • screen resolution
  • operating system
  • browser type
  • installed browser extensions and plugins
    All of these features play an important role in Datadomes trust score calculation.

Each headless browser library can be configured to use different resolution and rendering capabilities. Distributing scraping through multiple real-looking browser configurations can prevent Datadome from detecting the scraper.

For more, see ScrapFly's browser fingerprint tool to see how your browser looks like to PerimeterX. This tool can be used to collect different browser fingerprints from real web browsers which can be used scraping.

Keep an Eye on New Tools

Open source web scraping is tough as each new discovered technique is quickly patched for by anti-bot services like Datadome which results in a cat and mouse game.

For best results tracking web scraping news sources and popular github repository changes can help to stay ahead of the curve:

If all that seems like too much trouble let Scrapfly handle it for you! 👇

Bypass Datadome with ScrapFly

Bypassing Datadome anti-bot while possible is very difficult - let Scrapfly do it for you!

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:

  • Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
  • Millions of self-healing proxies of the highest possible trust score.
  • Constantly evolving and adapting to new anti-bot systems.
  • We've been doing this publicly since 2020 with the best bypass on the market!

It takes Scrapfly several full-time engineers to maintain this system, so you don't have to!

For example, to scrape pages protected by Datadome or any other anti-scraping service, when using ScrapFly SDK all we need to do is enable the Anti Scraping Protection bypass feature:

from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR API KEY")
result = scrapfly.scrape(ScrapeConfig(
    url="https://www.leboncoin.fr/",
    asp=True,
    # we can also enable headless browsers to render web apps and javascript powered pages
    render_js=True,
    # and set proxies by country like France
    country="FR",
    # and proxy type like residential:
    proxy_pool=ScrapeConfig.PUBLIC_RESIDENTIAL_POOL,
))
print(result.scrape_result)

FAQ

To wrap this article let's take a look at some frequently asked questions regarding web scraping Datadome protected pages:

Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.

Is it possible to bypass Datadome using cache services?

Yes, public page caching services like Google Cache or Archive.org can be used to bypass Datadome protected pages as Google and Archive tend to be whitelisted. However, not all pages are cached and the ones that are are often outdated making them unsuitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.

Is it possible to bypass Datadome entirely and scrape the website directly?

This is more of an internet security problem as that would be possible only by taking advantage of a vulnerability. This can be illegal in some countries and is often very difficult to do either way.

What are some other anti-bot services?

There are many other anti-bot WAF services like Cloudflare, Akamai, Imperva Incapsula, PerimeterX and Kasada. That being said, they function very similarly to Datadome so everything in this tutorial can be applied to them to prevent their bot detection.

Datadome Bypass Summary

In this article, we took a deep dive into Datadome anti-bot protection when web scraping. We went through the details of each technique DataDome uses to detect and block web scrapers and how to avoid them. In a nutshell, these are:

  • Using JA3 fingerprints that are similar to those of normal browsers.
  • Using Residential or Mobile proxies to hide the IP address.
  • Using HTTP details similar to normal browsers by managing headers and enabling HTTP2.
  • Using automated browsers to avoid JavaScript fingerprinting.

Finally, we've taken a look at some frequently asked questions like alternative bypass methods and their anti bot bypassing legality. For an easier way to handle web scraper blocking and power up your web scrapers, check out ScrapFly for free!

Related Posts

What is CreepJS Browser Fingerprint and How to Bypass It

In this article, we will explore the inner workings of CreepJS, one of the prominent browser fingerprinting tools and how to bypass it.

How to Know What Anti-Bot Service a Website is Using?

In this article we'll take a look at two popular tools: WhatWaf and Wafw00f which can identify what WAF service is used.

Use Curl Impersonate to scrape as Chrome or Firefox

Learn how to prevent TLS fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking.