Web Scraping Without Blocking With Undetected ChromeDriver

by Mazen Ramadan Aug 22, 2024

#blocking #python #tools

Web Scraping Without Blocking With Undetected ChromeDriver

Web scraping blocking can happen for different reasons, requiring attention to various details. But what about about simple tools that can avoid web scraping blocking?

In this article, we'll explain the Undetected ChromeDriver and how to use it to avoid web scraping blocking. Let's dive in!

What is the Undetected ChromeDriver?

The Undetected ChromeDriver is a modified Web Driver for Selenium. It mimics regular browsers' behavior by various techniques, such as:

Changing Selenium's variable names to appear as normal web browsers.
Randomizing User-Agent strings.
Adding randomized delays between sending requests or executing actions.
Maintaining cookies and sessions correctly while browsing a website.
Simulating mouse clicks and moves, which makes browsing behavior appear natural.
Allowing for adding proxies, which prevents IP blocking and rate limiting.

The Undetected ChromeDriver uses the above techniques to avoid specific anti-scraping challenges, such as Cloudflare, Imperva and Datadome.

5 Tools to Scrape Without Blocking and How it All Works

Tutorial on how to avoid web scraper blocking. What is javascript and TLS (JA3) fingerprinting and what role request headers play in blocking.

Before proceeding to web scraping with Undetected ChromeDriver, let's install it.

Setup

In this Undetected ChromeDriver for web scraping guide, we'll use nowsecure.nl as our target website. Which uses Cloudflare to detect web scrapers. We'll use the Undetected ChromeDriver to bypass this protection. It can be installed using the pip terminal command:

pip install undetected-chromedriver

The above command will also install Selenium as it's used by the undetected-chromedriver under the hood.

How to Avoid Blocking using Undetected ChromeDriver?

Undetected ChromeDriver is a modified version of the web driver used by Selenium, which can avoid web scraping detection - let's take a look at it.

Comparing with Selenium

To confirm its capabilities, let's try our target website with standard Selenium code. For this, we'll use common selenium installation:

$ pip install selenium webdriver-manager

Then, let's scrape a secured web page that uses Cloudflare to detect web scrapers:

# pip install webdriver-manager
from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver 
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options 
import time

# Add selenium option
options = Options()
options.headless = False

# Configure Selenium options and download the default web driver automatically
driver = webdriver.Chrome(options=options, service=ChromeService(ChromeDriverManager().install()))
# Maximize the browser widnows size
driver.maximize_window()

# Go the target website
driver.get("https://nowsecure.nl/")
# Wait for security check
time.sleep(4)
# Take screenshot
driver.save_screenshot('screenshot.png')
# Close the driver
driver.close()

Here, we intialize a Selenium headless browser with the default web driver that's installed using the web-driver-manager. Then, we send a request to the target website and take a screenshot. Here is what we got:

Selenium headless browser got detected — scraper being detected and blocked by Cloudflare

We can see that the website detected us and requested a Cloudflare challenge to solve before proceeding to the website.

Bypass with Undetected Chromedriver

Now, that we know what failure looks like let's bypass this challenge using the Undetected ChromeDriver:

import undetected_chromedriver as uc
import time

# Add the driver options
options = uc.ChromeOptions() 
options.headless = False

# Configure the undetected_chromedriver options
driver = uc.Chrome(options=options) 

with driver:
    # Go to the target website
    driver.get("https://nowsecure.nl/")
# Wait for security check
time.sleep(4)

# Take a screenshot
driver.save_screenshot('screenshot.png')
# Close the browsers
driver.quit()

We initialize an undetected_chromedriver object, go to the target website and take a screenshot. Here is the screenshot we got:

example undetected chromedriver result for nowsecure.nl tester — Driver passing Cloudflare detection

Shortcomings

We successfully avoided web scraping detection using the Undetected ChromeDriver for Cloudflare. However, it can still struggle against more advanced anti-bot systems.

For example, let's try bypass Opensea.io protection:

import undetected_chromedriver as uc
import time

# Add the driver options
options = uc.ChromeOptions() 
options.headless = False

# Configure that driver options
driver = uc.Chrome(options=options) 

with driver:
    # Go to the target website
    driver.get("https://opensea.io/")
# Wait for security check
time.sleep(4)

# Take a screenshot
driver.save_screenshot('opensea.png')
# Close the browsers
driver.quit()

In this example, OpenSea instantly detects and blocks our scraper:

opensea block page when undetected chromedriver is detected — OpenSea.io blocking undetected chromedriver scraper

There are a few more things we can do to empower our use Undetected ChromeDriver for web scraping. Let's take a look at them next!

How to Add Proxies to Undetected ChromeDriver?

Proxies are essential for avoiding IP blocking while scraping by splitting the traffic between multiple IP addresses. Here is how you can add proxies to the Undetected ChromeDriver:

import undetected_chromedriver as uc

# Add the driver options
options = uc.ChromeOptions() 
options.headless = False
# For proxies without authentication
options.add_argument(f'--proxy-server=https://proxy_ip:port')
# For proxies with authentication
options.add_argument(f'--proxy-server=https://proxy_username:proxy_password@proxy_ip:proxy_port')
# Configure that driver options
driver = uc.Chrome(options=options)

Although you can add proxies to the undetected chrome driver, there is no direct implementation for proxy rotation.

How to Rotate Proxies in Web Scraping

In this article we explore proxy rotation. How does it affect web scraping success and blocking rates and how can we smartly distribute our traffic through a pool of proxies for the best results.

We have seen that the Undetected ChromeDriver can't avoid advanced anti-scraping challenges and can't rotate proxies by itself. Let's look at a better solution!

Powering Up with ScrapFly

Undetected Chromedriver is a powerful tool though it doesn't solve all web scraping issues and this is where Scrapfly can fill in the rest!

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:

Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
Millions of self-healing proxies of the highest possible trust score.
Constantly evolving and adapting to new anti-bot systems.
We've been doing this publicly since 2020 with the best bypass on the market!

For example, by using the ScrapFly asp feature, we can easily scrape websites without getting blocked:

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your API key")

api_response: ScrapeApiResponse = scrapfly.scrape(
    scrape_config=ScrapeConfig(
        url="https://opensea.io",
        # Activate the JavaScript rendering feature to render images
        render_js=True,
        # Enable the asp to bypass anti scraping protections
        asp=True,
        # Take a screenshot
        screenshots={"opensea": "fullpage"},
    )
)

# Save the screenshot
scrapfly.save_screenshot(api_response=api_response, name="opensea")

# Use the built-in selector to parse the HTML
selector = api_response.selector

# Empty array to save the data into
data = []

# Loop through the trending NFT names
for nft_title in selector.xpath("//span[@data-id='TextBody']/div/div[1]/text()"):
    data.append(nft_title.get())

print(data)
# ['Akumu Dragonz', 'Nakamigos-CLOAKS', 'RTFKT x Nike Dunk Genesis CRYPTOKICKS', 'Skyborne - Genesis Immortals', 'Arbitrum Odyssey NFT', 'Parallel Alpha', 'Skyborne - Nexian Gems', 'Milady Maker', 'Otherside Vessels', 'BlockGames Dice']

Common Errors

Using Chromedriver with undetectable patches can introduce new errors. Here are some common errors and how to fix them.

ModuleNotFoundError: No module named 'undetected_chromedriver'

This error means something went wrong with unedetected_chromedriver installation. This usually happens when pip install installs the package to a different python environment. So, ensure that python and pip point to the same Python environment.

Alternatively, use package manager like poetry:

# create a project
$ mkdir my_project && cd my_project
$ poetry init --Dependency undetected_chromedriver
$ touch myscript.py   # create your python script
$ poetry run python myscript.py

This will ensure that undetected_chromedriver is installed in the same environment as your script.

This version of ChromeDriver only supports Chrome version X

Another common error is related to the browser version availability.

Message: unknown error: cannot connect to chrome at 127.0.0.1:33505
from session not created: This version of ChromeDriver only supports Chrome version 118
Current browser version is 117.0.5938.149

This error means available chrome browser on the current machine is too old for undetectable chrome driver to work. To fix this, simply update your Chrome browser to the newer version. Alternatively, if you have multiple chrome browser version available (such as beta) you can specify one using the browser_executable_path parameter:

import undetected_chromedriver as uc
driver = uc.Chrome(
    browser_executable_path="/usr/bin/google-chrome-beta"
)

FAQ

To wrap up this guide on Undetected ChromeDriver for web scraping, let's look at some frequently asked questions.

Are there alternatives to the Undetected ChromeDriver?

Yes, Puppeteer Stealth is a library that can bypass anti-scraping challenges similar to the Undetected ChromeDriver. However, it's based on Node.js and unavailable in Python yet.

How to avoid web scraping blocking?

There are multiple factors that lead to web scraping blocking including headers, IP addresses, security handshakes and javascript execution. To scrape without getting blocked, you must pay attention to these details. For more information, refer to our dedicated article on scraping without getting blocked.

Summary on Undetected Chromedriver use for Scraping

In this article, we explained how to use the Undetected ChromeDriverr to scrape without getting blocked. Which works by mimicking normal web browsers configurations to appear natural.

We have seen that the Undetected ChromeDriver can avoid specific anti-scraping challenges like Cloudflare. However, it can be detected by advanced anti-bot systems.

Web Scraping Without Blocking With Undetected ChromeDriver

What is the Undetected ChromeDriver?

5 Tools to Scrape Without Blocking and How it All Works

Setup

How to Avoid Blocking using Undetected ChromeDriver?

Comparing with Selenium

Bypass with Undetected Chromedriver

Shortcomings

How to Add Proxies to Undetected ChromeDriver?

How to Rotate Proxies in Web Scraping

Powering Up with ScrapFly

Common Errors

ModuleNotFoundError: No module named 'undetected_chromedriver'

This version of ChromeDriver only supports Chrome version X

FAQ

Are there alternatives to the Undetected ChromeDriver?

How to avoid web scraping blocking?

Summary on Undetected Chromedriver use for Scraping

Explore this Article with AI

Related Knowledgebase

What Python libraries support HTTP2?

What are some PhantomJS alternatives for automating browsers?

Python httpx vs requests vs aiohttp - key differences

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

How to edit Local Storage data using browser Devtools

How to handle popup dialogs in Playwright?

How to use proxies with Python httpx?

How to scrape images from a website?

How to check if element exists in Playwright?

What are some ways to parse JSON datasets in Python?

How to select dictionary key recursively in Python?

How to use cURL in Python?

Related Articles

FlareSolverr Guide: Bypass Cloudflare While Scraping

How to Know What Anti-Bot Service a Website is Using?

Selenium Wire Tutorial: Intercept Background Requests

Intro to Parsing HTML and XML with Python and lxml

Use Curl Impersonate to scrape as Chrome or Firefox