Web Scraping with Selenium and Python

Web Scraping with Selenium and Python

Modern web is becoming increasingly complex and reliant on Javascript making web-scraping often difficult even for small tasks. Usually, web scraper in python do not execute javascript and related web browser workflows, thus making some targets difficult to reach. In other words, you might see the content in your web-browser, but your scraper can't see it.

For this, browser automation is frequently used in web-scraping to utilize browser rendering power to access all of the content. We've already briefly covered 3 available tools Playwright, Puppeteer and Selenium in our overview article Scraping Dynamic Websites Using Browser Automation article and in this one we'll dig a bit deeper into understanding Selenium.

In this tutorial we'll take a look at what is selenium, its common functions used in web-scraping and finally we'll finish off with an example of scraping content of https://twitch.tv.

What is Selenium?

Selenium was initially a tool created to test website's behavior, but it quickly became a general web browser automation tool used in general task automation and web-scraping.

This tool is quite wide-spread and is capable of automating multiple browsers like Chrome, Firefox, Opera and even Internet Explorer through middleware controlled called webdriver.

Webdriver is the first browser automation protocol designed by W3C organization, and it's essentially a middleware protocol service that sits between the client and the browser, translating client commands to web browser actions.

Selenium webdriver translates our python client's commands to something a web browser can understand

Currently, it's one of two available protocols for web browser automation (other being Chrome Devtools Protocol) and while it's an older protocol it's still capable and perfectly viable for web scraping - let's take a look at what can it do!

For installation instructions, see official Selenium installation instructions

When it comes to web scraping, we essentially need few basic functionalities: navigating to web pages, waiting for elements to load and button click/page scrolling. Occasionally, we might need more advanced functionalities, such as text inputs or keyboard pressed. Both of these basic and advanced functions are easily accessible in Selenium, let's take a look!

0:44
/0:44
Quick sneak peek of what we'll be doing. When developing with Selenium we can use interactive shell like ipython to test our workflow in real time!

Let's start off with our basics: navigation and conditions
For this we'll start with our example project: we'll be scraping current streams from https://twitch.tv art section where users stream their art creation process. For this we'll be collecting data like stream name, viewer count and author.

Our current task is to:

  1. Start a Chrome web browser
  2. Go to https://https://www.twitch.tv/directory/game/Art
  3. Wait for page to finish loading
  4. Pick up HTML content of the current browser instance
  5. Parse data from the HTML content

First, let's dig into the starting of our browser:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.twitch.tv/directory/game/Art")

If we run this script, we'll see a browser window open up and take us our url. However, often when web-scraping we don't want to have our screen to be taken up with all the GUI elements, for this we can use something called headless mode which strips the browser of all GUI elements and lets it run silently in the background. In Selenium, we can enable it through options keyword argument:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
...

# configure webdriver
options = Options()
options.headless = True  # hide GUI
options.add_argument("--window-size=1920,1080")  # set window size to native GUI size
options.add_argument("start-maximized")  # ensure window is full-screen

...
driver = webdriver.Chrome(options=options)
#                         ^^^^^^^^^^^^^^^
driver.get("https://www.twitch.tv/directory/game/Art")

Additionally, when web-scraping we don't need to render images, which is a slow and intensive process. In Selenium, we can instruct Chrome browser to skip image rendering through chrome_options keyword argument:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait

# configure webdriver
options = Options()
options.headless = True  # hide GUI
options.add_argument("--window-size=1920,1080")  # set window size to native GUI size
options.add_argument("start-maximized")  # ensure window is full-screen

...
# configure chrome browser to not load images and javascript
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option(
    # this will disable image loading
    "prefs", {"profile.managed_default_content_settings.images": 2}
)
...

driver = webdriver.Chrome(options=options, chrome_options=chrome_options)
#                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
driver.get("https://www.twitch.tv/directory/game/Art")

If we were to set out options.headless setting back to False we'd see that all the pages load without any media images. They are still there, but they're not being downloaded and embedded into our viewport - saving us loads of resources and time!

Finally, we can retrieve a fully rendered page and start parsing for data. Our driver is able to deliver us the content of current browser window through driver.page_source attribute but if we call it to early we'll get almost empty page as nothing has loaded yet!

Fortunately, Selenium has many ways of checking whether the page has loaded, however the most reliable one is to actually check whether an element is present in the page via CSS selectors:

For parsing and CSS selectors, see our in-depth guide Web Scraping With Python 101: Parsing

from parsel import Selector
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# configure webdriver
options = Options()
options.headless = True  # hide GUI
options.add_argument("--window-size=1920,1080")  # set window size to native GUI size
options.add_argument("start-maximized")  # ensure window is full-screen
# configure chrome browser to not load images and javascript
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option(
    "prefs", {"profile.managed_default_content_settings.images": 2}
)

driver = webdriver.Chrome(options=options, chrome_options=chrome_options)
driver.get("https://www.twitch.tv/directory/game/Art")
# wait for page to load
element = WebDriverWait(driver=driver, timeout=5).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, 'div[data-target=directory-first-item]'))
)
print(driver.page_source)

Here, we are using a special WebDriverWait object which blocks our program until a specific condition is met. In this case, our condition is a presence of an element which we select through CSS selector.

Parsing Data

We've started a browser, told it to go to twitch.tv and wait for page to load and retrieve the page contents. With these contents at hand, we can finish up our project and parse related data:

from parsel import Selector

sel = Selector(text=driver.page_source)
parsed = []
for item in sel.xpath("//div[contains(@class,'tw-tower')]/div[@data-target]"):
    parsed.append({
        'title': item.css('h3::text').get(),
        'url': item.css('.tw-link::attr(href)').get(),
        'username': item.css('.tw-link::text').get(),
        'tags': item.css('.tw-tag ::text').getall(),
        'viewers': ''.join(item.css('.tw-media-card-stat::text').re(r'(\d+)')),
    })

While selenium offer parsing capabilities of its own, they are sub-par to what's available in python's ecosystem. It's much more efficient to pick up the HTML source of the rendered page and use parsel or beautifulsoup packages to parse this content in a more efficient and pythonic fashion. In this example, we've used parsel to extract content using XPATH and CSS selectors.

For parsing with BeautifulSoup, see our in-depth introduction Web Scraping with Python and BeautifulSoup


In this section, we covered the first basic Selenium based web scraper. We've launched an optimized instance of a browser, told it to go to our web page, wait for content to load and return us a rendered document!

These basic functions will get you pretty far in web-scraping already, however some edge cases might require more advanced automation functionality such as element button clicking, input of text and custom javascript execution - let's take a look at these.

Advanced Selenium Functions

Selenium is a pretty powerful automation library that is capable of much more than what we've discovered through our twitch.tv example.

For starters, sometimes we might need to click buttons and input text into forms to access content we want to web scrape. For this, let's take a look how can we use Twitch.tv search bar:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.get("https://www.twitch.tv/")
search_box = driver.find_element_by_css_selector('input[aria-label="Search Input"]') 
search_box.send_keys(
    'fast painting'
)
# either press the enter key
search_box.send_keys(Keys.ENTER)
# or click search button
search_button = driver.find_element_by_css_selector('button[icon="NavSearch"]')
search_button.click()

In the example above, we used a CSS selector to find our search box and input some keys. Then, to submit our search, we have an option to either send a literal ENTER key or find search button and click it to submit our search form.

Finally, the last important feature used in web-scraping is javascript execution. Selenium essentially provides us with a full, running Javascript Interpreter which allows us to fully control the page document and a big chunk of the browser itself!

To illustrate this, let's take a look at scrolling.
Since Twitch is using so-called "endless pagination" to get results from the 2nd page, we must instruct our browser to scroll to the bottom to trigger loading of the next page:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://www.twitch.tv/directory/game/Art")
# find last item and scroll to it
driver.execute_script("""
let items=document.querySelectorAll('.tw-tower>div');
items[items.length-1].scrollIntoView();
""")

In this example, we used javascript execution to find all items in the page and then scroll the view to the last element, which generates the second page of results.
There are many ways to scroll content in Selenium controlled web browser, but using scrollIntoView() method is one of the most reliable ways to navigate the browser's viewport.


In this section, we've covered the main advanced Selenium functions used in web scraping: keyboard inputs, button clicking and javascript execution. With this complete knowledge, we're ready to scrape complex javascript powered websites such as twitch.tv!

That being said, Selenium is not without its faults, and the biggest issues when it comes to developing web-scrapers with selenium is scaling. Browser are resource heavy and slow, to add, Selenium doesn't support asynchronous programming which might speed things up like Playwright and Puppeteer does (as we've covered in Scraping Dynamic Websites Using Browser Automation) so we at ScrapFly offer a scalable Selenium like javascript rendering service - let's take a quick look!

ScrapFly's Alternative

ScrapFly's API implements core web browser automation functions: page rendering, session/proxy management, custom javascript evaluation and page loading rules - all of which help create a highly scalable and easy to manage web scraper.

One important feature of ScrapFly's API is seamless mixing of browser rendering and traditional http requests - allowing developers to optimize scrapers to their full scraping potential.

Let's quickly take a look at how can we replicate our twitch.tv scraper in ScrapFly's SDK:

from parsel import Selector
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse


scrapfly = ScrapflyClient(key="YOUR KEY")
result = scrapfly.scrape(
    ScrapeConfig(
        url="https://www.twitch.tv/directory/game/Art",
        render_js=True,
        # ^ indicate to use browser rendering for this request
        country="US"
        # ^ applies proxy to request to ensure we are getting results in english language
    ),
)

parsed = []
for item in result.selector.xpath("//div[contains(@class,'tw-tower')]/div[@data-target]"):
#                  ^ ScrapFly offers a shortcut to parsel's Selector object
    parsed.append({
        'title': item.css('h3::text').get(),
        'url': item.css('.tw-link::attr(href)').get(),
        'username': item.css('.tw-link::text').get(),
        'tags': item.css('.tw-tag ::text').getall(),
        'viewers': ''.join(item.css('.tw-media-card-stat::text').re(r'(\d+)')),
    })

print(json.dumps(parsed, indent=2))

ScrapFly's API simplifies the whole process to few parameter configurations. Not only that, but it automatically configures the backend browser for the best browser configurations and determines when the content has fully loaded for the given scrape target!

For more on ScrapFly's browser rendering and more, see our official documentation: https://scrapfly.io/docs/scrape-api/javascript-rendering

Summary and Further Reading

In this short Selenium overview tutorial, we took a look at how we can use this web browser automation package for web-scraping.
We reviewed most of the common functions used in scraping such as navigation, button clicking, text input, waiting for content and custom javascript execution.
We also reviewed some common performance idioms such as headless browsing and disabling of image loading.

This knowledge should help you get started with Selenium powered web scraping. Further, we advise taking a look at avoiding bot detection with selenium-stealth and various scaling technique like using multiple browser instances/tabs, asynchronous programming or task distribution.

Get Your FREE API Key
Discover Scrapfly

Related post

Web Scraping with Python and BeautifulSoup

Introduction to web scraping with Python and BeautifulSoup package. Tips, tricks and best practices as well as real life example.

Scraping Dynamic Websites Using Browser

Introduction to using web automation tools such as Puppeteer, Playwright, Selenium and ScrapFly to render dynamic websites for web scraping

Web Scraping With Python 102: Parsing

Introduction to parsing content from web scraped html documents. What libraries to use and common html parsing idioms in Python.