How to Use Headless Browser Chrome Extensions for Web Scraping

How to Use Headless Browser Chrome Extensions for Web Scraping

Headless browser automation is an essential component for web scraping, allowing for various actions, such as scrolling, clicking buttons, and submitting forms. But about extending the headless browser's capabilities with Chrome extensions!

In this article, we'll explain how to install Chrome extensions with various headless browser libraries, such as Selenium, Playwright and Puppeteer. We'll also explore different useful Chrome extensions for web scraping. Let's dive in!

Do Headless Browsers Support Chrome Extensions?

Google Chrome extensions are built on top of the Chromium engine, which is also used by other popular web browsers such as Microsoft Edge. Most headless browser automation libraries, such as Selenium, support the Google Chrome WebDriver. Therefore, it's possible to use Chrome extensions with headless browser libraries.

By default Chrome extensions can't run in the headless mode, as it requires a few layers found on the actual browser GUI. However, using the experimental option --headless new or --headless chrome can enable extension use in the headless mode. This option is still new and can be unstable.

Why Use Browser Extensions While Web Scraping?

There are numerous extensions available for Google Chrome that provide additional features and functionalities to the web browser. These extensions can be utilized with different stages of the web scraping process, from establishing the HTTP connection to the HTML parsing. To solidify this concept, let's have a look at a few useful Chrome extensions for Web Scraping.

Useful Chrome Extensions For Web Scraping

User-Agent Switcher

The User-Agent header is one of the most popular HTTP headers, which requires attention while web scraping. It includes different details about the request sender browser, such as the name, version and operating system. However, the User-Agent header used by headless browser libraries is common and many websites recognize it, leading to the request being identified and blocked. Therefore, the User-Agent switcher extension can be beneficial for avoiding web scraping blocking by rotating the User-Agent header.

Referer Control

The Referer header contains details of the previously visited domain URL, which is analyzed by antibots to determine if the request is automated. This header is automatically managed by the browser when navigating its web pages naturally. However, it can be neglected while web scraping specific pages and URLs. The Referer control extension can be used to define rules that automatically define this header on each website.

Canvas Fingerprint Defender

Websites can use the HTML Canvas to render images on the web page using the browser JavaScript. This image rendering can lead to creating a Canvas fingerprint, which includes various details about the operating system and browser. And since users browse the web through different operating systems and web browsers, this fingerprint is unique. However, the Canvas fingerprint of headless browsers is common, which allows websites to detect the usage of headless browsers. The Canvas fingerprint defender extension alters the browser details and feeds a fake fingerprint, preventing the headless browser from being detected.

We have only touched the surface of Chrome extensions for web scraping usage. Other useful extensions are:

Setup

In this article, we'll explain how to install Chrome extensions with Selenium, Playwright and Puppeteer. Selenium and Playwright can be installed in Python using the following pip command:

pip install selenium webdriver-manager playwright

The above command will install Selenium and the webdriver-manager, which we'll use later to install the ChromeDriver required by Selenium. As for Playwright's web drivers, they can be installed using the following command:

playwright install

For puppeteer, it can be installed in Node.js using npm or yarn:

npm install puppeteer
# or
yarn add puppeteer

For this article, we'll be using the famous Ad Blocker extension with each headless browser library. To start, we need to get the actual extension binaries in the crx format. We can download the extension binaries in different ways, such as the crxextractor tool or the Get CRX Chrome extension. After downloading the crx file, move it to the directory where the code will be executed.

The downloaded extension file will only work in Selenium. However, for Playwright and Puppeteer, we need to extract the file to get its source code. To do that, simply rename the file to change its format to .zip and then extract the zip file content.

Now that we have the extension in both formats, crx and extracted folder. Let's install it with the headless browser libraries!

Chrome Extensions in Selenium

To load Chrome extensions in Selenium, we can use either the unpacked extension folder or the packed crx file:

import os
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

# get the absoulte path of te extension folder
path_to_extension = os.path.abspath("./AdBlock")

options = webdriver.ChromeOptions()
options.add_extension("./AdBlock.crx") # load the extension using the crx file
options.add_argument(f"load-extension={path_to_extension}") # load the extension using the unpacked folder
# automatically install the ChromeDriver binaries and the configured options
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
# go the Chrome extensions page
driver.get("chrome://extensions/")
# take a screenshot
driver.save_screenshot('selenium_chrome_extensions.png')
# close the browser
driver.quit()

Here, we load the Chrome extension to Selenium WebDrvier twice in two different ways. The first one is using the packed extension through the add_extension method. The other one is using the extracted extention through the add_argument method. Next, we use webdriver_manager to install the Chrome web driver and load the previously created options. Finally, we head over to the Chrome extensions page and take a screenshot. Here is the screenshot we got:

Chrome extensions page
Chrome extensions on Selenium headless browser

Our Selenium Chrome extensions installation is a success. However, the extensions can require additional configurations for optimal usage, especially when web scraping. Let's automate the extension configurations from our code:

import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

def update_extension_options():
    """click option buttons"""
    # CSS selectors of the option checkboxex
    elements = ["#acceptable_ads_privacy", "#enable_twitch_channel_allowlist",
                "#enable_data_collection_v2"]
    for element in elements:
        driver.find_element(By.CSS_SELECTOR, f"{element}").click()

options = webdriver.ChromeOptions()
options.add_extension("./AdBlock.crx") # load the extension using the crx file
# automatically install the ChromeDriver binaries and the configured options
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
# wait the extension to finish installation
time.sleep(10)
# go to the extension options
driver.get("chrome-extension://gighmmpiobklfepjocnamgkkbiglidom/options.html#general")
# update the extension options
update_extension_options()
# take a screenshot
driver.save_screenshot('selenium_chrome_extensions.png')
# close the browser
driver.quit()

The above code is almost the same as the previous snippet except for a few changes. We defined an update_extension_options function, which iterates over a few option buttons and clicks each button using CSS selectors. Then, we head over the extension options page and call the function we defined to update it. Here are the updated options:

Chrome extension options page
Updated extension options

We have successfully installed a Chrome extension with Selenium and updated its options. Although we have covered a simple extension, this process can be applied to other extensions.

For more details on Selenium, refer to our previous article.

Web Scraping with Selenium and Python Tutorial + Example Project

Dive deep into Selenium and its common functions used in web scraping dynamic pages and web applications.

Web Scraping with Selenium and Python Tutorial + Example Project

Chrome Extensions in Playwright

Playwright Chrome extensions can only be added using the unpacked extension folder:

import os
from playwright.sync_api import sync_playwright

path_to_extension = os.path.abspath("./AdBlock")
user_data_dir = "./"

with sync_playwright() as playwright:
    # intialize a persistent Chrome instance
    context = playwright.chromium.launch_persistent_context(
        # user data directory argument is required
        user_data_dir,
        # run the browser headful mode
        headless=False,
        # add the Chrome extension
        args=[
            f"--disable-extensions-except={path_to_extension}",
            f"--load-extension={path_to_extension}",
        ],
    )
    # intitialize a new browser page
    page = context.new_page()
    # go the Chrome extensions page
    page.goto("chrome://extensions/")
    # take a screenshot
    page.screenshot(path="playrwight_chrome_extensions.png")

Here, we initialize a sync Playwright instance in the headful mode. Next, we add the Chrome extention to the browser using the load-extension argument. Finally, we request the Chrome extension page and take a screenshot.

The next step is to update the extension options like we did before:

import os
import time
from playwright.sync_api import sync_playwright

path_to_extension = os.path.abspath("./AdBlock")
user_data_dir = "./"

def update_extension_options():
    """click option buttons"""
    # CSS selectors of the option checkboxex
    elements = ["#acceptable_ads_privacy", "#enable_twitch_channel_allowlist",
                "#enable_data_collection_v2"]
    for element in elements:
        page.click(element)

with sync_playwright() as playwright:
    # intialize a persistent Chrome instance
    context = playwright.chromium.launch_persistent_context(
        # user data directory argument is required
        user_data_dir,
        # run the browser headful mode
        headless=False,
        # add the Chrome extension
        args=[
            f"--disable-extensions-except={path_to_extension}",
            f"--load-extension={path_to_extension}",
        ],
    )
    # intitialize a new browser page
    page = context.new_page()
    # wait the extension to finish installation
    time.sleep(10)
    # go to the extension options
    page.goto("chrome-extension://agmafgjokfhoibnnnppdbpkddlibhmme/options.html#general")
    # update the extension options
    update_extension_options()    
    # take a screenshot
    page.screenshot(path="playrwight_chrome_extensions.png")

Here, we add additional functionality to our code. We wait for 10 seconds for the extension to finish the installation and then head over to the extension options page. Finally, we update the extension options using selectors and take a screenshot.

The above script is implemented for the Python version of Playwright. However, you can apply it for different Playwright clients in other languages, as the underlying API is the same.

For more details about Playwright, refer to our previous article.

Web Scraping with Playwright and Python

Learn how Playwright works and compares to its alternatives. How to execute common tasks like browser navigation, button clicking, text input and data parsing

Web Scraping with Playwright and Python

Chrome Extensions in Puppeteer

Similar to Playwright, Puppeteer Chrome extensions can be added using the unpacked extension folder:

const puppeteer = require('puppeteer');
const path = require('path');

// get the absolute extension path
const pathToExtension = path.join(process.cwd(), 'AdBlock');

async function run(){
    // launch a browser instance
    const browser = await puppeteer.launch({
        // launch the browser in the headful mode
        headless: false,
        // add the extension to the browser
        args: [
        `--disable-extensions-except=${pathToExtension}`,
        `--load-extension=${pathToExtension}`,
        ],
    });
    // open a new browser tab
    let page = await browser.newPage();
    // go the Chrome extensions page
    await page.goto('chrome://extensions/');
    // take a screenshot
    await page.screenshot({ path: `./puppeteer_chrome_extensions.png` });
    // close the tab and browser
    await page.close();
    await browser.close();
};

run();

In the above code, we launch a browser and add the extension using the load-extension argument. Then, we go to the Chrome extensions page and take a screenshot.

The next step is to modify the extension options as we did before:

const puppeteer = require('puppeteer');
const path = require('path');
const  { setTimeout } = require('timers/promises');

// get the absolute extension path
const pathToExtension = path.join(process.cwd(), 'AdBlock');

async function updateExtensionOptions(page){
    // CSS selectors of the option checkboxex
    elements = ["#acceptable_ads_privacy", "#enable_twitch_channel_allowlist",
                "#enable_data_collection_v2"]
    for (const element of elements) {
        await page.click(element);
    };
};

async function run(){
    // launch a browser instance
    const browser = await puppeteer.launch({
        // launch the browser in the headful mode
        headless: false,
        // add the extension to the browser
        args: [
        `--disable-extensions-except=${pathToExtension}`,
        `--load-extension=${pathToExtension}`,
        ],        
    });
    // wait the extension to finish installation
    await setTimeout(10000);
    // open a new browser tab
    let page = await browser.newPage();    
    // go to the extension options
    await page.goto("chrome-extension://agmafgjokfhoibnnnppdbpkddlibhmme/options.html#general")
    // update the extension options    
    await updateExtensionOptions(page)
    // take a screenshot
    await page.screenshot({ path: `./puppeteer_chrome_extensions.png` });
    // close the tab and browser
    await page.close();
    await browser.close();
};

run();

Here, we define an updateExtensionOptions function, which iterates over option elements to update them. Next, we call this function after requesting the extension options page, take a screenshot and close the browser.

For more details about Puppeteer, refer to our previous article.

Web Scraping With a Headless Browser: Puppeteer

Learn about the Puppeteer API overview and how to use it for web scraping features. You will also learn about the common challenges associated with Puppeteer and how to overcome them.

Web Scraping With a Headless Browser: Puppeteer

FAQ

To wrap up this guide on headless browser Chrome extensions, let's have a look at some frequently asked questions.

How to install Chrome extensions in Selenium Python?

To install Chrome extensions with Selenium, you can either use the packed extension file in crx or the extracted extension folder. Next, you can load the extension into the browser using the add_extension method when using the packed file or using the load-extension argument when using the extracted extension folder.

Can I run browser extensions in the headless mode?

At the time of writing, Chromium browsers don't support extensions in the headless mode by default but the experimental option --headless new or --headless chrome can enable extension use in the headless mode.

How to get Chrome extension files for headless browsers?

To get the Chrome extension source, you can either use 3rd party tools, such as the crxextractor and Get CRX extension, or you can copy them manually from your operating system files.

Headless Browser Extensions Summary

Browser extensions are add-on tools that add more capabilities to the web browser. We have seen that using Chrome extensions for web scraping can add additional features, such as avoiding blocking and assisting in HTML parsing.

We have also gone through a step-by-step guide on installing Chrome extensions with different headless browser libraries, such as Selenium, Playwright and Puppeteer.

Related Posts

Guide to Python requests POST method

Discover how to use Python's requests library for POST requests, including JSON, form data, and file uploads, along with response handling tips.

Guide to Python Requests Headers

Our guide to request headers for Python requests library. How to configure and what do they mean.

What is Parsing? From Raw Data to Insights

Learn about the fundamentals of parsing data, across formats like JSON, XML, HTML, and PDFs. Learn how to use Python parsers and AI models for efficient data extraction.