Intro to Web Scraping using Selenium Grid

article feature image

Scraping using headless browsers tools like Selenium library for Python is becoming increasingly popular web scraping technique. Unfortunately, scaling up Selenium powered scrapers can be a difficult challenge. This is where Selenium Grid for web scraping comes in - it can multiple headless browsers that take turns processing scraping requests.

In this guide, we'll explain how to use Selenium Grid with Python for scraping at scale. We'll cover essential topics like how to install Selenium grid using Docker, managing browser instances and concurrent web scraping using Selenium Grid. Let's dig in!

What is Selenium Grid?

Selenium Grid is a server that runs multiple headless browser scripts on the same remote machine. It allows to automate web browsers for different operating systems and versions in parallel.

The Selenium Grid server consists of two main components:

  • Hub
    A remote server that accepts incoming WebDriver requests as JSON and then routes them to the Nodes for execution.
  • Node
    A virtual device that executes commands based on routing instructions from the Hub. It consists of a specific web browser and version running on a particular operating system.
illustration of selenium grid architecture
the hub manages the grid of node works

Selenium Grid allows for scaling headless browser execution by queuing or running them in parallel on different nodes. Unlike the regular Selenium WebDriver, which runs headless browsers using a dedicated WebDriver on the local machine.

Before we jump into web scraping using Selenium Grid details, let's take a look at the installation process.

How to Install Selenium Grid Using Docker

When it comes to selenium grid setup the easiest approach is to using Docker. If you don't have Docker installed, you can install it from the official Docker installation page .

To spin up to the Selenium Grid server, we'll use the following docker-compose.yml file:

version: '3.8'

services:
  hub:
    image: selenium/hub:4.13.0
    ports:
      - 4442:4442
      - 4443:4443
      - 4444:4444
    environment:
      GRID_MAX_SESSION: 8     

  chrome_node_1:
    image: selenium/node-chrome:4.13.0
    depends_on:
      - hub
    environment:
      SE_EVENT_BUS_HOST: hub
      SE_EVENT_BUS_PUBLISH_PORT: 4442
      SE_EVENT_BUS_SUBSCRIBE_PORT: 4443
      SE_NODE_STEREOTYPE: "{\"browserName\":\"chrome\",\"browserVersion\":\"117\",\"platformName\": \"Windows 10\"}"

Here, we initialize a Hub and Node services and specify the Node's browser name, version and operating system. If you want to use a specific Hub or Node version, refer to the official Sleneium page on Docker.

After adding the above docker-compose file, run it using the following Docker command:

docker-compose up --build

To ensure your installation is correct, go to the Selenium Grid URL on http://localhost:4444 which should result in a Selenium Grid Dashboard view:

Selenium Grid UI server on the browser
Selenium Grid UI on the browser

We only have a single Node with the browser specifications declared in the docker-compose file. We can also see additional variables, let's go over them:

  • Sessions: the number of headless browsers running on a specific Node.
  • Max concurrency: the number of sessions that can run in parallel in each Node, which is set to one by default.
  • Queue size: the number of sessions waiting in the queue. Since we don't have any sessions in the queue, it's set to zero.

Now that we have Selenium Grid Docker up and running, let's install a few python libraries to start scraping through the selenium grid server:

  • Selenium: for interacting with the Selenium Grid server.
  • BeautifulSoup: for parsing the HMTL and selecting elements.
pip install selenium bs4

Web Scraping With Selenium Grid

In this Selenium Grid web scraping guide, we'll scrape product data from web-scraping.dev/productso page:

web scraping target website
Web scraping target website

There are several pages on this website and each page contains multiple products. We'll divide our Selenium Grid Scraping code into two parts: product discover and product scraping.

First, we'll crawl over product pages to scrape each product link. Then, we'll scrape all product links concurrently.

Here is how to use Selenium Grid to scrape product links:

from selenium import webdriver
from bs4 import BeautifulSoup

def get_driver():
    options = webdriver.ChromeOptions()
    # Disable sharing memory across the instances
    options.add_argument('--disable-dev-shm-usage')
    # Initialize a remote WebDriver
    driver = webdriver.Remote(
        command_executor="http://127.0.0.1:4444/wd/hub",
        options=options
    )
    return driver

def scrape_product_links():
    links = []
    driver = get_driver()
    # Iterate over product pages
    for page_number in range(1, 6):
        page_link = f"https://web-scraping.dev/products?page={page_number}"
        # Go to the page link
        driver.get(page_link)
        soup = BeautifulSoup(driver.page_source, "html.parser")
        # Iterate over product boxes
        for product_box in soup.select("div.row.product"):
            # Get the link of each product
            link = product_box.select_one("a").attrs["href"]
            links.append(link)
    return links

links = scrape_lroduct_links()
print(links)

Using the get_driver function, we send a request to the Selenium Grid Hub to initialize a remote headless browser. Then, we loop through all product pages to get each product's link using the CSS selector.

Here are the links we got:

Output
[
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
    "https://web-scraping.dev/product/4",
    "https://web-scraping.dev/product/5",
    "https://web-scraping.dev/product/6",
    "https://web-scraping.dev/product/7",
    "https://web-scraping.dev/product/8",
    "https://web-scraping.dev/product/9",
    "https://web-scraping.dev/product/10",
    "https://web-scraping.dev/product/11",
    "https://web-scraping.dev/product/12",
    "https://web-scraping.dev/product/13",
    "https://web-scraping.dev/product/14",
    "https://web-scraping.dev/product/15",
    "https://web-scraping.dev/product/16",
    "https://web-scraping.dev/product/17",
    "https://web-scraping.dev/product/18",
    "https://web-scraping.dev/product/19",
    "https://web-scraping.dev/product/20",
    "https://web-scraping.dev/product/21",
    "https://web-scraping.dev/product/22",
    "https://web-scraping.dev/product/23",
    "https://web-scraping.dev/product/24",
    "https://web-scraping.dev/product/25",
]

We successfully scraped all product links using the remote headless browsers. However, our Selenium Grid server isn't scalable as it only allows running one headless browser at a time. For example, let's try to run two headless browsers in parallel:

from selenium import webdriver

def get_driver():
    options = webdriver.ChromeOptions()
    options.add_argument('--disable-dev-shm-usage')  
    driver = webdriver.Remote(
        command_executor="http://127.0.0.1:4444/wd/hub",
        options=options
    )
    return driver

# Initialize two headless browsers at the same time
driver1 = get_driver()
print("Driver 1 is running")
# The below code won't get executed
driver2 = get_driver()
print("Driver 2 is running")

driver1.quit()
print("Driver 1 is closed")
driver2.quit()
print("driver 2 is closed")

Here, we initialize two headless browsers at the same time. The driver1 will run while driver2 will wait in the queue:

selenium grid queue size display in web dashboard
note the queue size in the bottom left corner

The script will wait for driver2 to get initialized before proceeding with the rest of the code. And since the max concurrency in our Node is set to one, the driver2 will block the script flow.

To solve this issue, we must configure Selenium Grid to spin up more Nodes or change the max concurrency in each Node. Let's see how!

Concurrency in Selenium Grid

Selenium Grid allows for customizing the number of Nodes and the maximum concurrency of each Node. To do that, we'll change the docker-compose.yml file we created earlier with additional services for handling extra nodes:

version: '3.8'

services:
  hub:
    image: selenium/hub:4.13.0
    ports:
      - 4442:4442
      - 4443:4443
      - 4444:4444
    environment:
      GRID_MAX_SESSION: 8     

  chrome_node_1:
    image: selenium/node-chrome:4.13.0
    depends_on:
      - hub
    environment:
      SE_EVENT_BUS_HOST: hub
      SE_EVENT_BUS_PUBLISH_PORT: 4442
      SE_EVENT_BUS_SUBSCRIBE_PORT: 4443
      SE_NODE_MAX_SESSIONS: 2
      SE_NODE_STEREOTYPE: "{\"browserName\":\"chrome\",\"browserVersion\":\"117\",\"platformName\": \"Windows 10\"}"

  chrome_node_2:
    image: selenium/node-chrome:4.13.0
    depends_on:
      - hub
    environment:
      SE_EVENT_BUS_HOST: hub
      SE_EVENT_BUS_PUBLISH_PORT: 4442
      SE_EVENT_BUS_SUBSCRIBE_PORT: 4443
      SE_NODE_MAX_SESSIONS: 2
      SE_NODE_STEREOTYPE: "{\"browserName\":\"chrome\",\"browserVersion\":\"117\",\"platformName\": \"macOS\"}"

We add a chrome Node running on macOS and change the maximum concurrency variable to two concurrent sessions in each Node. This means that we can scrape with four headless browsers in parallel now!

To apply these changes, stop the docker containers and build them again using the following commands:

docker-compose down
docker-compose up --build

Head over to the Selenium Grid server in the browser to see the new changes:

Selenium Grid server with two Nodes
Selenium Grid server with two Nodes

Now, let's try to run two headless browsers at the same time:

from selenium import webdriver

def get_driver():
    options = webdriver.ChromeOptions()
    options.add_argument('--disable-dev-shm-usage')  
    driver = webdriver.Remote(
        command_executor="http://127.0.0.1:4444/wd/hub",
        options=options
    )
    return driver

# Initialize two headless browsers at the same time
driver1 = get_driver()
print("Driver 1 is running")
# The below code will get executed
driver2 = get_driver()
print("Driver 2 is running")

driver1.quit()
print("Driver 1 is closed")
driver2.quit()
print("driver 2 is closed")

Each Node will run a headless browser:

Selenium Grid server with two headless browsers running at the same time

Now that we can run multiple headless browsers in parallel. Let's use it to apply concurrent web scraping with Selenium Grid.

Web Scraping Concurrently with Selenium Grid

Selenium Grid can run multiple node workers in parallel meaning we can use multiple nodes at the same time for concurrent web scraping. This can give our web scrapers a huge boost in speed and performance.

In this section, we'll take a look at how to utilize Selenium through Python threads to scrape using multiple selenium grid workers.

Web Scraping Speed: Processes, Threads and Async

For more on concurrency, parallelism and python scraper details see our beginner-friendly introduction to all of these concept.s

Web Scraping Speed: Processes, Threads and Async

Now, let's implement concurrent web scraping with Selenium Grid. We'll split the links we got earlier into two batches and scrape them in concurrently:

from concurrent.futures import ThreadPoolExecutor
from selenium import webdriver
from bs4 import BeautifulSoup
import json


def get_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--headless")
    driver = webdriver.Remote(
        command_executor="http://127.0.0.1:4444/wd/hub", options=options
    )
    return driver


def scrape_product_links():
    links = []
    driver = get_driver()
    for page_number in range(1, 6):
        page_link = f"https://web-scraping.dev/products?page={page_number}"
        driver.get(page_link)
        soup = BeautifulSoup(driver.page_source, "html.parser")
        for product_box in soup.select("div.row.product"):
            link = product_box.select_one("a").attrs["href"]
            links.append(link)
    # Close the headless browser instance
    driver.quit()
    return links


def scrape_product_data(links_array: list, data_array: list, driver_name: str):
    driver = get_driver()
    for link in links_array:
        # Print the current running driver
        print(driver_name, "is scraping product number", link.split("/")[-1])
        driver.get(link)
        soup = BeautifulSoup(driver.page_source, "html.parser")
        # Get the product data
        data_array.append(parse_product_html(soup))
    driver.quit()


def parse_product_html(soup):
    product_data = {
        "product": soup.select_one("h3.card-title").text,
        "price": soup.select_one("span.product-price").text,
        # Extract all image links and save it to an array
        "images": [image["src"] for image in soup.select("div.product-images > img")],
    }
    return product_data


if __name__ == "__main__":
    # Get all product links
    links = scrape_product_links()
    # Get the middle index to split the links array in half
    middle_index = len(links) // 2
    # List of jobs to get executed
    executors_list = []
    # An empty array to save the data
    data = []
    # Create a ThreadPoolExecutor with a maximum of 4 worker threads
    with ThreadPoolExecutor(max_workers=4) as executor:
        # Add the two concurrent tasks to scrape product data from different parts of the 'links' list
        executors_list.append(
            executor.submit(scrape_product_data, links[:middle_index], data, "driver1")
        )
        executors_list.append(
            executor.submit(scrape_product_data, links[middle_index:], data, "driver2")
        )

    # Wait for all tasks to complete
    for x in executors_list:
        pass

    # Print the data in JSON format
    print(json.dumps(data, indent=4))

We use the scrape_product_links function created earlier to get product links.
Next, we create a ThreadPoolExecutor with a maximum of four thread workers. Then, we add two scrape_product_data functions and append them to the execution list, each function will scrape a link batch.

The above code will create two headless browsers running on Selenium Grid and run the scraping logic concurrently:

driver1 is scraping product number 1
driver2 is scraping product number 13
driver1 is scraping product number 2
driver2 is scraping product number 14
driver1 is scraping product number 3
driver2 is scraping product number 15
driver1 is scraping product number 4

Here are the results we got:

Output
[
    {
        "product": "Box of Chocolate Candy",
        "price": "$9.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/orange-chocolate-box-small-1.png",
            "https://web-scraping.dev/assets/products/orange-chocolate-box-small-2.png",
            "https://web-scraping.dev/assets/products/orange-chocolate-box-small-3.png",
            "https://web-scraping.dev/assets/products/orange-chocolate-box-small-4.png"
        ]
    },
    {
        "product": "Box of Chocolate Candy",
        "price": "$9.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/orange-chocolate-box-small-1.png",
            "https://web-scraping.dev/assets/products/orange-chocolate-box-small-2.png",
            "https://web-scraping.dev/assets/products/orange-chocolate-box-small-3.png",
            "https://web-scraping.dev/assets/products/orange-chocolate-box-small-4.png"
        ]
    },
    {
        "product": "Dark Red Energy Potion",
        "price": "$4.99",
        "images": [
            "https://web-scraping.dev/assets/products/darkred-potion.png"
        ]
    },
    {
        "product": "Dark Red Energy Potion",
        "price": "$4.99",
        "images": [
            "https://web-scraping.dev/assets/products/darkred-potion.png"
        ]
    },
    {
        "product": "Teal Energy Potion",
        "price": "$4.99",
        "images": [
            "https://web-scraping.dev/assets/products/teal-potion.png"
        ]
    },
    {
        "product": "Teal Energy Potion",
        "price": "$4.99",
        "images": [
            "https://web-scraping.dev/assets/products/teal-potion.png"
        ]
    },
    {
        "product": "Red Energy Potion",
        "price": "$4.99",
        "images": [
            "https://web-scraping.dev/assets/products/red-potion.png"
        ]
    },
    {
        "product": "Red Energy Potion",
        "price": "$4.99",
        "images": [
            "https://web-scraping.dev/assets/products/red-potion.png"
        ]
    },
    {
        "product": "Blue Energy Potion",
        "price": "$4.99",
        "images": [
            "https://web-scraping.dev/assets/products/blue-potion.png"
        ]
    },
    {
        "product": "Blue Energy Potion",
        "price": "$4.99",
        "images": [
            "https://web-scraping.dev/assets/products/blue-potion.png"
        ]
    },
    {
        "product": "Dragon Energy Potion",
        "price": "$4.99",
        "images": [
            "https://web-scraping.dev/assets/products/dragon-potion.png"
        ]
    },
    {
        "product": "Dragon Energy Potion",
        "price": "$4.99",
        "images": [
            "https://web-scraping.dev/assets/products/dragon-potion.png"
        ]
    },
    {
        "product": "Hiking Boots for Outdoor Adventures",
        "price": "$89.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/hiking-boots-1.png",
            "https://web-scraping.dev/assets/products/hiking-boots-2.png"
        ]
    },
    {
        "product": "Hiking Boots for Outdoor Adventures",
        "price": "$89.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/hiking-boots-1.png",
            "https://web-scraping.dev/assets/products/hiking-boots-2.png"
        ]
    },
    {
        "product": "Women's High Heel Sandals",
        "price": "$59.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/women-sandals-beige-1.png",
            "https://web-scraping.dev/assets/products/women-sandals-beige-2.png"
        ]
    },
    {
        "product": "Women's High Heel Sandals",
        "price": "$59.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/women-sandals-beige-1.png",
            "https://web-scraping.dev/assets/products/women-sandals-beige-2.png"
        ]
    },
    {
        "product": "Running Shoes for Men",
        "price": "$49.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/men-running-shoes.png"
        ]
    },
    {
        "product": "Running Shoes for Men",
        "price": "$49.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/men-running-shoes.png"
        ]
    },
    {
        "product": "Kids' Light-Up Sneakers",
        "price": "$29.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/kids-light-up-sneakers-red-1.png",
            "https://web-scraping.dev/assets/products/kids-light-up-sneakers-red-2.png",
            "https://web-scraping.dev/assets/products/kids-light-up-sneakers-red-3.png",
            "https://web-scraping.dev/assets/products/kids-light-up-sneakers-red-4.png"
        ]
    },
    {
        "product": "Kids' Light-Up Sneakers",
        "price": "$29.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/kids-light-up-sneakers-red-1.png",
            "https://web-scraping.dev/assets/products/kids-light-up-sneakers-red-2.png",
            "https://web-scraping.dev/assets/products/kids-light-up-sneakers-red-3.png",
            "https://web-scraping.dev/assets/products/kids-light-up-sneakers-red-4.png"
        ]
    },
    {
        "product": "Classic Leather Sneakers",
        "price": "$79.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/classic-leather-sneakers-black.png"
        ]
    },
    {
        "product": "Classic Leather Sneakers",
        "price": "$79.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/classic-leather-sneakers-black.png"
        ]
    },
    {
        "product": "Cat-Ear Beanie",
        "price": "$14.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/cat-ear-beanie-pink.png"
        ]
    },
    {
        "product": "Cat-Ear Beanie",
        "price": "$14.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/cat-ear-beanie-pink.png"
        ]
    },
    {
        "product": "Box of Chocolate Candy",
        "price": "$9.99 ",
        "images": [
            "https://web-scraping.dev/assets/products/orange-chocolate-box-small-1.png",
            "https://web-scraping.dev/assets/products/orange-chocolate-box-small-2.png",
            "https://web-scraping.dev/assets/products/orange-chocolate-box-small-3.png",
            "https://web-scraping.dev/assets/products/orange-chocolate-box-small-4.png"
        ]
    }
]

We successfully scraped all products concurrently using Selenium Grid. However, scraping at scale can be complex and requires lots of configurations. Let's explore a more efficient solution!

Web Scraping Concurrently With ScrapFly

ScrapFly is a web scraping API that allows for scraping at scale by providing:

By using the ScrapFly concurrency feature, we can easily scale web scrapers by scraping multiple targets together:

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
import asyncio
from bs4 import BeautifulSoup
import json

scrapfly = ScrapflyClient(key="Your API key")

# Scrape product links sequentially
def scrape_product_links():
    links = []
    for page_number in range(1, 6):
        page_link = f"https://web-scraping.dev/products?page={page_number}"
        api_response: ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url=page_link))
        soup = BeautifulSoup(api_response.scrape_result["content"], "html.parser")
        for product_box in soup.select("div.row.product"):
            link = product_box.select_one("a").attrs["href"]
            links.append(link)
    return links

# Parse product data from the HTML
def parse_product_html(soup):
    product_data = {
        "product": soup.select_one("h3.card-title").text,
        "price": soup.select_one("span.product-price").text,
        "images": [image['src'] for image in soup.select("div.product-images > img")]
    }
    return product_data

# Scrape all product data concurrently
async def scrape_product_data(links: list, data_array: list):
    # Add all links to the concurrent scraping target
    targets = [ScrapeConfig(url=url) for url in links]
    async for product_response in scrapfly.concurrent_scrape(scrape_configs=targets):
        page_content = product_response.content
        soup = BeautifulSoup(page_content, "html.parser")
        data_array.append(parse_product_html(soup))

# Empty array to store the data
data = []
links = scrape_product_links()
# Run the concurrent scraping function
asyncio.run(scrape_product_data(links, data))

# Print the result in JSON format
print (json.dumps(data, indent=4))

FAQ

To wrap up this guide on Selenium Grid for web scraping, let's take a look at some frequently asked questions.

Is Concurrent Web Scraping using Selenium Grid possible?

Yes, Selenium Grid runs all of its worker in parallel though Selenium client doesn't support asynchronous Python. However, using python threads or subprocesses we can run multiple scrapers concurrently. For more see scraping using multiple processors.

What is the difference between Playwright, Selenium Web Driver and Selenium Grid?

Both Playwright and Selenium are automation frameworks that allow for running headless browsers locally. While Selenium Grid is server that complements Selenium WebDriver by allowing for running multiple headless browsers in parallel on a remote machine.

What are the limitations of Selenium Grid for Web Scraping?

Although Selenium Grid can be configured to scale web scrapers, it can't prevent headless browsers from getting blocked. For more information on this matter, refer to our previous article on scraping without getting blocked.

Selenium grid vs Selenium webdriver?

Selenium webdriver is the tool that automates a single web browser instance while Selenium Grid is a tool that orchestrates multiple Selenium webdrivers in parallel.

Selenium Grid For Web Scraping Summary

In summary, Selenium Grid is a remote server that executes WebDriver commands, allowing for running various headless browser scripts in parallel.

In this guide, we've taken a look at a major web scraping problem - how to speed up selenium web scrapers using Selenium Grid service. We explore docker setup, selenium grid configuration and how to use it for concurrent web scraping with Python. In overall, Selenium Grid is a powerful tool when it comes to Selenium scaling but it can be challenging to use in concurrent Python.

Related Posts

How to Scrape With Headless Firefox

Discover how to use headless Firefox with Selenium, Playwright, and Puppeteer for web scraping, including practical examples for each library.

Selenium Wire Tutorial: Intercept Background Requests

In this guide, we'll explore web scraping with Selenium Wire. We'll define what it is, how to install it, and how to use it to inspect and manipulate background requests.

Web Scraping Dynamic Web Pages With Scrapy Selenium

Learn how to scrape dynamic web pages with Scrapy Selenium. You will also learn how to use Scrapy Selenium for common scraping use cases, such as waiting for elements, clicking buttons and scrolling.