Web Scraping with Selenium and Python

article feature image

Modern web is becoming increasingly complex and reliant on Javascript which makes traditional web scraping difficult. Traditional web scrapers in python cannot execute javascript, meaning they struggle with dynamic web pages and this is where Selenium - a browser automation toolkit - comes in handy!

Browser automation is frequently used in web-scraping to utilize browser rendering power to access dynamic content. Additionally, it's often used to avoid web scraper blocking as real browsers tend to blend in with the crowd easier than raw HTTP requests.

We've already briefly covered 3 available tools Playwright, Puppeteer and Selenium in our overview article Scraping Dynamic Websites Using Browser and in this one, we'll dig a bit deeper into understanding Selenium - the most popular browser automation toolkit out there.

In this Selenium with Python tutorial we'll take a look at what is Selenium, its common functions used in web scraping dynamic pages and web applications. We'll cover some general tips and tricks, common challenges and wrap it all up with an example project by scraping twitch.tv

Web Scraping With Python Tutorial

For general introduction to web scraping in python see our extensive introduction tutorial which is focused on using HTTP clients rather than web browsers.

Web Scraping With Python Tutorial

What is Selenium?

Selenium was initially a tool created to test website's behavior, but it quickly became a general web browser automation tool used in web-scraping and other automation tasks.

This tool is quite wide-spread and is capable of automating different browsers like Chrome, Firefox, Opera and even Internet Explorer through middleware controlled called Selenium webdriver.

Webdriver is the first browser automation protocol designed by W3C organization, and it's essentially a middleware protocol service that sits between the client and the browser, translating client commands to web browser actions.

illustration of selenium webdriver middleware-like functionality

Selenium webdriver translates our python client's commands to something a web browser can understand

Currently, it's one of two available protocols for web browser automation (other being Chrome Devtools Protocol) and while it's an older protocol it's still capable and perfectly viable for web scraping - let's take a look at what can it do!

Installing Selenium

Selenium webdriver for python can be installed through pip command:

$ pip install selenium

However, we also need webdriver enabled browsers. We recommend Firefox and Chrome browsers:

For more installation instructions, see official Selenium installation instructions

When it comes to web scraping, we essentially need few basic functionalities of Selenium API: navigating to web pages, waiting for elements to load and button click/page scrolling. Occasionally, we might need more advanced functionalities, such as text inputs or keyboard pressed. Both of these basic and advanced functions are easily accessible in Selenium, let's take a look!

0:44
/0:44
Quick sneak peek of what we'll be doing. When developing with Selenium we can use interactive shell like ipython to test our workflow in real time!

Let's start off with our basics: navigation and conditions
For this we'll start with our example project: we'll be scraping current streams from https://www.twitch.tv/ art section where users stream their art creation process. For this we'll be collecting dynamic data like stream name, viewer count and author.

Our current task is to:

  1. Start a Chrome web browser
  2. Go to https://www.twitch.tv/directory/game/Art
  3. Wait for page to finish loading
  4. Pick up HTML content of the current browser instance
  5. Parse data from the HTML content

Before we begin let's install Selenium itself:

$ pip install selenium
$ pip show selenium
Version: 3.141.0

To start with our scraper code let's create a selenium webdriver object and launch a Chrome browser:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.twitch.tv/directory/game/Art")

If we run this script, we'll see a browser window open up and take us our twitch url. However, often when web-scraping we don't want to have our screen to be taken up with all the GUI elements, for this we can use something called headless mode which strips the browser of all GUI elements and lets it run silently in the background. In Selenium, we can enable it through options keyword argument:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
...

# configure webdriver
options = Options()
options.headless = True  # hide GUI
options.add_argument("--window-size=1920,1080")  # set window size to native GUI size
options.add_argument("start-maximized")  # ensure window is full-screen

...
driver = webdriver.Chrome(options=options)
#                         ^^^^^^^^^^^^^^^
driver.get("https://www.twitch.tv/directory/game/Art")

Additionally, when web-scraping we don't need to render images, which is a slow and intensive process. In Selenium, we can instruct Chrome browser to skip image rendering through chrome_options keyword argument:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait

# configure webdriver
options = Options()
options.headless = True  # hide GUI
options.add_argument("--window-size=1920,1080")  # set window size to native GUI size
options.add_argument("start-maximized")  # ensure window is full-screen

...
# configure chrome browser to not load images and javascript
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option(
    # this will disable image loading
    "prefs", {"profile.managed_default_content_settings.images": 2}
)
...

driver = webdriver.Chrome(options=options, chrome_options=chrome_options)
#                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
driver.get("https://www.twitch.tv/directory/game/Art")
driver.quit()

If we were to set out options.headless setting back to False we'd see that all the pages load without any media images. They are still there, but they're not being downloaded and embedded into our viewport - saving us loads of resources and time!

Finally, we can retrieve a fully rendered page and start parsing for data. Our driver is able to deliver us the content of current browser window (called page source) through driver.page_source attribute but if we call it to early we'll get almost empty page as nothing has loaded yet!

Fortunately, Selenium has many ways of checking whether the page has loaded, however the most reliable one is to actually check whether an element is present in the page via CSS selectors:

from parsel import Selector
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# configure webdriver
options = Options()
options.headless = True  # hide GUI
options.add_argument("--window-size=1920,1080")  # set window size to native GUI size
options.add_argument("start-maximized")  # ensure window is full-screen
# configure chrome browser to not load images and javascript
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option(
    "prefs", {"profile.managed_default_content_settings.images": 2}
)

driver = webdriver.Chrome(options=options, chrome_options=chrome_options)
driver.get("https://www.twitch.tv/directory/game/Art")
# wait for page to load
element = WebDriverWait(driver=driver, timeout=5).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, 'div[data-target=directory-first-item]'))
)
print(driver.page_source)

Here, we are using a special WebDriverWait object which blocks our program until a specific condition is met. In this case, our condition is a presence of an element which we select through CSS selector.

Parsing Dynamic Data

We've started a browser, told it to go to twitch.tv and wait for page to load and retrieve the page contents. With these contents at hand, we can finish up our project and parse related dynamic data:

from parsel import Selector

sel = Selector(text=driver.page_source)
parsed = []
for item in sel.xpath("//div[contains(@class,'tw-tower')]/div[@data-target]"):
    parsed.append({
        'title': item.css('h3::text').get(),
        'url': item.css('.tw-link::attr(href)').get(),
        'username': item.css('.tw-link::text').get(),
        'tags': item.css('.tw-tag ::text').getall(),
        'viewers': ''.join(item.css('.tw-media-card-stat::text').re(r'(\d+)')),
    })

While selenium offer parsing capabilities of its own, they are sub-par to what's available in python's ecosystem. It's much more efficient to pick up the HTML source of the rendered page and use parsel or beautifulsoup packages to parse this content in a more efficient and pythonic fashion. In this example, we've used parsel to extract content using XPATH and CSS selectors.

Web Scraping with Python and BeautifulSoup

For parsing with BeautifulSoup, see our in-depth article which covers introduction, tips and tricks and best practices

Web Scraping with Python and BeautifulSoup

In this section, we covered the first basic Selenium based web scraper. We've launched an optimized instance of a browser, told it to go to our web page, wait for content to load and return us a rendered document!

These basic functions will get you pretty far in web-scraping already, however some edge cases might require more advanced automation functionality such as element button clicking, input of text and custom javascript execution - let's take a look at these.

Advanced Selenium Functions

Selenium is a pretty powerful automation library that is capable of much more than what we've discovered through our twitch.tv example.

For starters, sometimes we might need to click buttons and input text into forms to access content we want to web scrape. For this, let's take a look how can we use Twitch.tv search bar. We'll find HTML elements for search box and search button and send our inputs there:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.get("https://www.twitch.tv/")
search_box = driver.find_element_by_css_selector('input[aria-label="Search Input"]') 
search_box.send_keys(
    'fast painting'
)
# either press the enter key
search_box.send_keys(Keys.ENTER)
# or click search button
search_button = driver.find_element_by_css_selector('button[icon="NavSearch"]')
search_button.click()

In the example above, we used a CSS selector to find our search box and input some keys. Then, to submit our search, we have an option to either send a literal ENTER key or find search button and click it to submit our search form.

Finally, the last important feature used in web-scraping is javascript execution. Selenium essentially provides us with a full, running Javascript Interpreter which allows us to fully control the page document and a big chunk of the browser itself!

To illustrate this, let's take a look at scrolling.
Since Twitch is using so-called "endless pagination" to get results from the 2nd page, we must instruct our browser to scroll to the bottom to trigger loading of the next page:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://www.twitch.tv/directory/game/Art")
# find last item and scroll to it
driver.execute_script("""
let items=document.querySelectorAll('.tw-tower>div');
items[items.length-1].scrollIntoView();
""")

In this example, we used javascript execution to find all web elements in the page that represent videos and then scroll the view to the last element, which tells the page to generate the second page of results.
There are many ways to scroll content in Selenium controlled web browser, but using scrollIntoView() method is one of the most reliable ways to navigate the browser's viewport.


In this section, we've covered the main advanced Selenium functions used in web scraping: keyboard inputs, button clicking and javascript execution. With this complete knowledge, we're ready to scrape complex javascript powered websites such as twitch.tv!

That being said, Selenium is not without its faults, and the biggest issues when it comes to developing web-scrapers with selenium package is scaling. Browser are resource heavy and slow, to add, Selenium doesn't support asynchronous programming which might speed things up like Playwright and Puppeteer does (as we've covered in Scraping Dynamic Websites Using Browser Automation) so we at ScrapFly offer a scalable Selenium like javascript rendering service - let's take a quick look!

ScrapFly's Alternative

ScrapFly's API implements core web browser automation functions: page rendering, session/proxy management, custom javascript evaluation and page loading rules - all of which help create a highly scalable and easy to manage web scraper.

One important feature of ScrapFly's API is seamless mixing of browser rendering and traditional http requests - allowing developers to optimize scrapers to their full scraping potential.

Let's quickly take a look at how can we replicate our twitch.tv scraper in ScrapFly's SDK:

from parsel import Selector
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse


scrapfly = ScrapflyClient(key="YOUR KEY")
result = scrapfly.scrape(
    ScrapeConfig(
        url="https://www.twitch.tv/directory/game/Art",
        render_js=True,
        # ^ indicate to use browser rendering for this request.
        country="US"
        # ^ applies proxy to request to ensure we are getting results in english language.
        asp=True,
        # ^ bypass common anti web scraping protection services.
    ),
)

parsed = []
for item in result.selector.xpath("//div[contains(@class,'tw-tower')]/div[@data-target]"):
#                  ^ ScrapFly offers a shortcut to parsel's Selector object
    parsed.append({
        'title': item.css('h3::text').get(),
        'url': item.css('.tw-link::attr(href)').get(),
        'username': item.css('.tw-link::text').get(),
        'tags': item.css('.tw-tag ::text').getall(),
        'viewers': ''.join(item.css('.tw-media-card-stat::text').re(r'(\d+)')),
    })

print(json.dumps(parsed, indent=2))

ScrapFly's API simplifies the whole process to few parameter configurations. Not only that, but it automatically configures the backend browser for the best browser configurations and determines when the content has fully loaded for the given scrape target!

For more on ScrapFly's browser rendering and more, see our official javascript rendering documentation.

FAQ

We've learned a lot in this article, let's digest some of it into a neat frequently asked questions list:

Error: Geckodriver executable needs to be in PATH

This error usually mean geckodriver - Firefox's rendering engine - is not installed on the machine. You can see the official release page for download instructions
Alternatively, we can use any other Firefox instance by changing executable_path argument in webdriver initiation: webdriver.Firefox(executable_path=r'your\path\geckodriver.exe')

How to take a screenshot in Selenium?

To take screenshots we can use webdriver commands: webdriver.save_screenshot() and webdriver.get_screenshot_as_file(). Screenshots are very useful for debugging headless browser workflows.

How to type specific keyboard keys in Selenium?

To send non-character keyboard keys we can use defined constants in from selenium.webdriver.common.keys import Keys constant. For example Keys.ENTER will send the enter key.

How to select a drop down value in Selenium?

To select drop down values we can take advantage of Selenium's UI utils. from selenium.webdriver.support.ui import Select object allows us to select values and execute various actions:

from selenium.webdriver.support.ui import Select

select = Select(driver.find_element_by_id('dropdown-box'))
# select by visible text
select.select_by_visible_text('first option')
# or by value
select.select_by_value('1')
How to scroll Selenium browser to a specific object?

The best way to reliably scroll through dynamic pages is to use javascript code execution. For example to scroll to the last product item we'd use the scrollIntoView() javascript function:

driver.execute_script("""
let items=document.querySelectorAll('.product-box .product');
items[items.length-1].scrollIntoView();
""")
How to capture http requests in Selenium?

When web browser connects to a web page it performs many http requests from document itself to image and data requests. For this selenium-wire python package can be used which extends Selenium with request/response capturing capabilities:

driver.get('https://www.google.com')
for request in driver.requests:
    if request.response:
        print(
            request.url,
            request.response.status_code,
            request.response.headers['Content-Type']
        )
Can Selenium be used with Scrapy?

Scrapy is a popular web scraping framework in Python however because of differing architectures making scrapy and selenium work together is tough. Check out these open source attempts scrapy-selenium and scrapy-headless.

Summary and Further Reading

In this short Python with Selenium tutorial, we took a look at how we can use this web browser automation package for web-scraping.
We reviewed most of the common functions used in scraping such as navigation, button clicking, text input, waiting for content and custom javascript execution.
We also reviewed some common performance idioms such as headless browsing and disabling of image loading.

This knowledge should help you get started with Selenium web scraping. Further, we advise taking a look at avoiding bot detection for that see our complete guide: How to Avoid Web Scraping Blocking: Javascript Guide.

Related post

How to Scrape YellowPages.com

Tutorial on how to scrape yellowpages.com business and review data using Python. How to avoid blocking to scrape data at scale and other tips.

How to Scrape Amazon.com

Tutorial on how to scrape Amazon.com's product and review data using Python and how to avoid blocking to scrape this information at scale.

How to Scrape Zillow.com

Tutorial on how to scrape Zillow.com sale and rent property data, using Python and how to avoid blocking to scrape at scale.

How to Scrape TripAdvisor.com

Tutorial on how to scrape TripAdvisor.com hotel, restaurant, tour and review data using Python and how to avoid blocking to scrape at scale.