Scraping using headless browsers tools like Selenium library for Python is becoming increasingly popular web scraping technique. Unfortunately, scaling up Selenium powered scrapers can be a difficult challenge. This is where Selenium Grid for web scraping comes in - it can multiple headless browsers that take turns processing scraping requests.
In this guide, we'll explain how to use Selenium Grid with Python for scraping at scale. We'll cover essential topics like how to install Selenium grid using Docker, managing browser instances and concurrent web scraping using Selenium Grid. Let's dig in!
What is Selenium Grid?
Selenium Grid is a server that runs multiple headless browser scripts on the same remote machine. It allows to automate web browsers for different operating systems and versions in parallel.
The Selenium Grid server consists of two main components:
Hub
A remote server that accepts incoming WebDriver requests as JSON and then routes them to the Nodes for execution.
Node
A virtual device that executes commands based on routing instructions from the Hub. It consists of a specific web browser and version running on a particular operating system.
the hub manages the grid of node works
Selenium Grid allows for scaling headless browser execution by queuing or running them in parallel on different nodes. Unlike the regular Selenium WebDriver, which runs headless browsers using a dedicated WebDriver on the local machine.
Before we jump into web scraping using Selenium Grid details, let's take a look at the installation process.
How to Install Selenium Grid Using Docker
When it comes to selenium grid setup the easiest approach is to using Docker. If you don't have Docker installed, you can install it from the official Docker installation page .
To spin up to the Selenium Grid server, we'll use the following docker-compose.yml file:
Here, we initialize a Hub and Node services and specify the Node's browser name, version and operating system. If you want to use a specific Hub or Node version, refer to the official Sleneium page on Docker.
After adding the above docker-compose file, run it using the following Docker command:
docker-compose up --build
To ensure your installation is correct, go to the Selenium Grid URL on localhost:4444 which should result in a Selenium Grid Dashboard view:
Selenium Grid UI on the browser
We only have a single Node with the browser specifications declared in the docker-compose file. We can also see additional variables, let's go over them:
Sessions: the number of headless browsers running on a specific Node.
Max concurrency: the number of sessions that can run in parallel in each Node, which is set to one by default.
Queue size: the number of sessions waiting in the queue. Since we don't have any sessions in the queue, it's set to zero.
Now that we have Selenium Grid Docker up and running, let's install a few python libraries to start scraping through the selenium grid server:
Selenium: for interacting with the Selenium Grid server.
BeautifulSoup: for parsing the HMTL and selecting elements.
pip install selenium bs4
Web Scraping With Selenium Grid
In this Selenium Grid web scraping guide, we'll scrape product data from web-scraping.dev/productso page:
Web scraping target website
There are several pages on this website and each page contains multiple products. We'll divide our Selenium Grid Scraping code into two parts: product discover and product scraping.
First, we'll crawl over product pages to scrape each product link. Then, we'll scrape all product links concurrently.
Here is how to use Selenium Grid to scrape product links:
from selenium import webdriver
from bs4 import BeautifulSoup
def get_driver():
options = webdriver.ChromeOptions()
# Disable sharing memory across the instances
options.add_argument('--disable-dev-shm-usage')
# Initialize a remote WebDriver
driver = webdriver.Remote(
command_executor="http://127.0.0.1:4444/wd/hub",
options=options
)
return driver
def scrape_product_links():
links = []
driver = get_driver()
# Iterate over product pages
for page_number in range(1, 6):
page_link = f"https://web-scraping.dev/products?page={page_number}"
# Go to the page link
driver.get(page_link)
soup = BeautifulSoup(driver.page_source, "html.parser")
# Iterate over product boxes
for product_box in soup.select("div.row.product"):
# Get the link of each product
link = product_box.select_one("a").attrs["href"]
links.append(link)
return links
links = scrape_lroduct_links()
print(links)
Using the get_driver function, we send a request to the Selenium Grid Hub to initialize a remote headless browser. Then, we loop through all product pages to get each product's link using the CSS selector.
We successfully scraped all product links using the remote headless browsers. However, our Selenium Grid server isn't scalable as it only allows running one headless browser at a time. For example, let's try to run two headless browsers in parallel:
from selenium import webdriver
def get_driver():
options = webdriver.ChromeOptions()
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Remote(
command_executor="http://127.0.0.1:4444/wd/hub",
options=options
)
return driver
# Initialize two headless browsers at the same time
driver1 = get_driver()
print("Driver 1 is running")
# The below code won't get executed
driver2 = get_driver()
print("Driver 2 is running")
driver1.quit()
print("Driver 1 is closed")
driver2.quit()
print("driver 2 is closed")
Here, we initialize two headless browsers at the same time. The driver1 will run while driver2 will wait in the queue:
note the queue size in the bottom left corner
The script will wait for driver2 to get initialized before proceeding with the rest of the code. And since the max concurrency in our Node is set to one, the driver2 will block the script flow.
To solve this issue, we must configure Selenium Grid to spin up more Nodes or change the max concurrency in each Node. Let's see how!
Concurrency in Selenium Grid
Selenium Grid allows for customizing the number of Nodes and the maximum concurrency of each Node. To do that, we'll change the docker-compose.yml file we created earlier with additional services for handling extra nodes:
We add a chrome Node running on macOS and change the maximum concurrency variable to two concurrent sessions in each Node. This means that we can scrape with four headless browsers in parallel now!
To apply these changes, stop the docker containers and build them again using the following commands:
docker-compose down
docker-compose up --build
Head over to the Selenium Grid server in the browser to see the new changes:
Selenium Grid server with two Nodes
Now, let's try to run two headless browsers at the same time:
from selenium import webdriver
def get_driver():
options = webdriver.ChromeOptions()
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Remote(
command_executor="http://127.0.0.1:4444/wd/hub",
options=options
)
return driver
# Initialize two headless browsers at the same time
driver1 = get_driver()
print("Driver 1 is running")
# The below code will get executed
driver2 = get_driver()
print("Driver 2 is running")
driver1.quit()
print("Driver 1 is closed")
driver2.quit()
print("driver 2 is closed")
Each Node will run a headless browser:
Now that we can run multiple headless browsers in parallel. Let's use it to apply concurrent web scraping with Selenium Grid.
Web Scraping Concurrently with Selenium Grid
Selenium Grid can run multiple node workers in parallel meaning we can use multiple nodes at the same time for concurrent web scraping. This can give our web scrapers a huge boost in speed and performance.
In this section, we'll take a look at how to utilize Selenium through Python threads to scrape using multiple selenium grid workers.
Now, let's implement concurrent web scraping with Selenium Grid. We'll split the links we got earlier into two batches and scrape them in concurrently:
from concurrent.futures import ThreadPoolExecutor
from selenium import webdriver
from bs4 import BeautifulSoup
import json
def get_driver():
options = webdriver.ChromeOptions()
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--headless")
driver = webdriver.Remote(
command_executor="http://127.0.0.1:4444/wd/hub", options=options
)
return driver
def scrape_product_links():
links = []
driver = get_driver()
for page_number in range(1, 6):
page_link = f"https://web-scraping.dev/products?page={page_number}"
driver.get(page_link)
soup = BeautifulSoup(driver.page_source, "html.parser")
for product_box in soup.select("div.row.product"):
link = product_box.select_one("a").attrs["href"]
links.append(link)
# Close the headless browser instance
driver.quit()
return links
def scrape_product_data(links_array: list, data_array: list, driver_name: str):
driver = get_driver()
for link in links_array:
# Print the current running driver
print(driver_name, "is scraping product number", link.split("/")[-1])
driver.get(link)
soup = BeautifulSoup(driver.page_source, "html.parser")
# Get the product data
data_array.append(parse_product_html(soup))
driver.quit()
def parse_product_html(soup):
product_data = {
"product": soup.select_one("h3.card-title").text,
"price": soup.select_one("span.product-price").text,
# Extract all image links and save it to an array
"images": [image["src"] for image in soup.select("div.product-images > img")],
}
return product_data
if __name__ == "__main__":
# Get all product links
links = scrape_product_links()
# Get the middle index to split the links array in half
middle_index = len(links) // 2
# List of jobs to get executed
executors_list = []
# An empty array to save the data
data = []
# Create a ThreadPoolExecutor with a maximum of 4 worker threads
with ThreadPoolExecutor(max_workers=4) as executor:
# Add the two concurrent tasks to scrape product data from different parts of the 'links' list
executors_list.append(
executor.submit(scrape_product_data, links[:middle_index], data, "driver1")
)
executors_list.append(
executor.submit(scrape_product_data, links[middle_index:], data, "driver2")
)
# Wait for all tasks to complete
for x in executors_list:
pass
# Print the data in JSON format
print(json.dumps(data, indent=4))
We use the scrape_product_links function created earlier to get product links.
Next, we create a ThreadPoolExecutor with a maximum of four thread workers. Then, we add two scrape_product_data functions and append them to the execution list, each function will scrape a link batch.
The above code will create two headless browsers running on Selenium Grid and run the scraping logic concurrently:
driver1 is scraping product number 1
driver2 is scraping product number 13
driver1 is scraping product number 2
driver2 is scraping product number 14
driver1 is scraping product number 3
driver2 is scraping product number 15
driver1 is scraping product number 4
We successfully scraped all products concurrently using Selenium Grid. However, scraping at scale can be complex and requires lots of configurations. Let's explore a more efficient solution!
Web Scraping Concurrently With ScrapFly
ScrapFly is a web scraping API that allows for scraping at scale by providing:
Cloud headless browsers, allowing for scraping JavaScript loaded content without running headless browsers yourself.
By using the ScrapFly concurrency feature, we can easily scale web scrapers by scraping multiple targets together:
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
import asyncio
from bs4 import BeautifulSoup
import json
scrapfly = ScrapflyClient(key="Your API key")
# Scrape product links sequentially
def scrape_product_links():
links = []
for page_number in range(1, 6):
page_link = f"https://web-scraping.dev/products?page={page_number}"
api_response: ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url=page_link))
soup = BeautifulSoup(api_response.scrape_result["content"], "html.parser")
for product_box in soup.select("div.row.product"):
link = product_box.select_one("a").attrs["href"]
links.append(link)
return links
# Parse product data from the HTML
def parse_product_html(soup):
product_data = {
"product": soup.select_one("h3.card-title").text,
"price": soup.select_one("span.product-price").text,
"images": [image['src'] for image in soup.select("div.product-images > img")]
}
return product_data
# Scrape all product data concurrently
async def scrape_product_data(links: list, data_array: list):
# Add all links to the concurrent scraping target
targets = [ScrapeConfig(url=url) for url in links]
async for product_response in scrapfly.concurrent_scrape(scrape_configs=targets):
page_content = product_response.content
soup = BeautifulSoup(page_content, "html.parser")
data_array.append(parse_product_html(soup))
# Empty array to store the data
data = []
links = scrape_product_links()
# Run the concurrent scraping function
asyncio.run(scrape_product_data(links, data))
# Print the result in JSON format
print (json.dumps(data, indent=4))
FAQ
To wrap up this guide on Selenium Grid for web scraping, let's take a look at some frequently asked questions.
Is Concurrent Web Scraping using Selenium Grid possible?
Yes, Selenium Grid runs all of its worker in parallel though Selenium client doesn't support asynchronous Python. However, using python threads or subprocesses we can run multiple scrapers concurrently. For more see scraping using multiple processors.
What is the difference between Playwright, Selenium Web Driver and Selenium Grid?
Both Playwright and Selenium are automation frameworks that allow for running headless browsers locally. While Selenium Grid is server that complements Selenium WebDriver by allowing for running multiple headless browsers in parallel on a remote machine.
What are the limitations of Selenium Grid for Web Scraping?
Although Selenium Grid can be configured to scale web scrapers, it can't prevent headless browsers from getting blocked. For more information on this matter, refer to our previous article on scraping without getting blocked.
Selenium grid vs Selenium webdriver?
Selenium webdriver is the tool that automates a single web browser instance while Selenium Grid is a tool that orchestrates multiple Selenium webdrivers in parallel.
Selenium Grid For Web Scraping Summary
In summary, Selenium Grid is a remote server that executes WebDriver commands, allowing for running various headless browser scripts in parallel.
In this guide, we've taken a look at a major web scraping problem - how to speed up selenium web scrapers using Selenium Grid service. We explore docker setup, selenium grid configuration and how to use it for concurrent web scraping with Python. In overall, Selenium Grid is a powerful tool when it comes to Selenium scaling but it can be challenging to use in concurrent Python.
We'll take a look at to find businesses through Google Maps search system and how to scrape their details using either Selenium, Playwright or ScrapFly's javascript rendering feature - all of that in Python.