Scraping Dynamic Websites Using Web Browsers

article feature image

The web is becoming increasingly more complex and dynamic. Many modern websites rely heavily on javascript to render interactive data using frameworks such as React, Angular, Vue.js etc.

This can be a major pain point in web scraper development as traditional web scrapers do not run a full browser capable of running complex javascript code to render dynamic pages.

In this in-depth tutorial, we'll take a look at how can we use headless browsers to scrape data from dynamic web pages. What are existing available tools and how to use them? And what are some common challenges, tips and shortcuts when it comes to scraping using web browsers.

What is a dynamic web page?

One of the most commonly encountered web scraping issues is: why can't my scraper see the data I see in the web browser?

comparison what scraper sees vs browser

On the left we see what the browser sees; on the right is our http webscraper - where did everything go?

Dynamic pages use complex javascript-powered web technologies that unload processing to the client. In other words, it gives users the logic pieces but they have to put them together to see the whole, rendered web page.

How to scrape dynamic web pages?

There are a few ways to deal with dynamic javascript-generated content when scraping:

First, we could reverse engineer website's behavior and replicate it in our scraper program. Unfortunately, as the complexity of the web grows, this approach is becoming very time-consuming, and difficult. To add, each individual website has to be reverse engineered explicitly which can be extremely time-consuming when scraping multiple sources.

Alternatively, we can automate a real browser to scrape dynamic web pages for us by integrating it into our web scraper program - that's exactly what we'll be doing today!

Example Scrape Task

For this article, we'll be using a real-world web-scraping example - we'll be scraping online experience data from https://www.airbnb.com/experiences.
We'll keep our demo task short and see how can we fully render a single experience page like: https://www.airbnb.com/experiences/2496585.

example target illustration

Example of a page we'll be scraping using browser automation

Airbnb is one of the biggest websites that is using dynamic pages generated by React Javascript front-end framework. Without browser emulation, we'd have to reverse-engineer the website's javascript code before we could see and scrape its full HTML content.

However, with browser automation things are much simpler: we can launch a web browser, go to the page, wait for the dynamic contents to page load and render, and finally scrape the full page source contents and parse it.

We'll implement a short solution to this challenge in 4 different approaches: Selenium, Puppeteer, Playwright and ScrapFly's API and see how they match up!

Browser Automation

Modern browsers such as Chrome and Firefox (and their derivatives) come with automation protocols built-in which allow other programs to control these web browsers.

Currently, there are two popular browser automation protocols:

  • older webdriver protocol which is implemented through an extra browser layer called webdriver. Webdriver intercepts action requests and issues browser control commands.
  • newer Chrome DevTools Protocol (CDP for short). Unlike webdriver, CDP control layer is already implemented in modern browsers implicitly.

In this article, we'll be mostly covering CDP, but the developer experience of these protocols is very similar, often even interchangeable. For more on these protocols see official documentation pages Chrome DevTools Protocol and WebDriver MDN documentation

This is great for scraping as our scraper program can launch a headless web browser that runs in the background and executes our scraper's commands. For our Airbnb example, our scrape script would be as simple as:

  1. Launch a headless web browser and connect to it.
  2. Tell it to go to some URL.
  3. Wait for details to load.
  4. Return the loaded HTML page back to our scraper.
  5. Parse and process the data in our scraper script.

Let's take a look at how this can be done in different browser automation clients.

Selenium

Selenium is one of the first big automation clients created for automating website testing. It supports both of browser control protocols: webdriver and CDP (only since Selenium v4+).

Selenium is the oldest tool on our list today meaning, it has a huge community and loads of features as well as being supported in almost every programming language and running almost every web browser:

Languages: Java, Python, C#, Ruby, JavaScript, Perl, PHP, R, Objective-C and Haskell
Browsers: Chrome, Firefox, Safari, Edge, Internet Explorer (and their derivatives)
Pros: Big community that has been around for a while - meaning loads of free resources. Easy to understand synchronous API for common automation tasks.
Meaning it has a huge community, a big feature set and a robust underlying structure.

Web Scraping with Selenium and Python

For more on Selenium, see our extensive introduction tutorial which covers Selenium usage in Python, common tips and tricks, best practices and an example project!

Web Scraping with Selenium and Python

So, how does one go about scraping dynamic web pages with python and selenium?
Let's take a look at how can we use Selenium webdriver to solve our airbnb.com scraper problem:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import visibility_of_element_located

browser = webdriver.Chrome()
browser.get("https://www.airbnb.com/experiences/272085")
title = (
    WebDriverWait(driver=browser, timeout=10)
    .until(visibility_of_element_located((By.CSS_SELECTOR, "h1")))
    .text
)
content = browser.page_source
print(title)
print(content)
browser.close()

Here, we started by initiating a web browser window and requesting a location to a single Airbnb experience page. Next, we tell our program to wait for the first header to appear which indicates that full HTML content has loaded.
With the page rendered fully, we can pop its HTML content and parse it just as we would parse the results using a traditional HTTP client-powered web scraper.

However, one downside of using Selenium is that it doesn't support asynchronous programming, meaning every time we tell the browser to do something it'll block our program until it's done.
When working with bigger scale web-scrapers, non-blocking IO is an important scaling feature as we can scrape multiple targets much faster.
For this, let's take a look at other clients that do support asynchronous programming out of the box: Puppeteer and Playwright

Puppeteer

Puppeteer is an asynchronous web browser automation library for Javascript by Google (as well as Python through the unofficial Pyppeteer package).

Languages: Javascript, Python (unofficial)
Browsers: Chrome, Firefox (Experimental)
Pros: First strong implementation of CDP, maintained by Google, intended to be a general browser automation tool.

Compared to Selenium, puppeteer supports fewer languages but it fully implements CDP protocol and has a strong team by Google behind it. Puppeteer also describes itself as a general-purpose browser automation client rather than fitting itself into the web testing niche - which is good news as web-scraping issues receive official support.

Let's take a look at how our airbnb.com example would look in puppeteer and javascript:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://airbnb.com/experiences/272085');
  await page.waitForSelector("h1");
  await page.content();
  await browser.close();
})();

As you can see, Puppeteer usage looks almost identical to that of our Selenium example except for await the keyword that indicates async nature of our program (we'll cover the value of async programming further below).

Web Scraping With a Headless Browser: Puppeteer

For more on Puppeteer, see our extensive introduction tutorial which covers Puppeteer usage in NodeJS, common tips and tricks, best practices and an example project!

Web Scraping With a Headless Browser: Puppeteer

Puppeteer is great, but Chrome browser + Javascript might not be the best option when it comes to maintaining complex web-scraping systems. For that, let's continue our browser automation journey and take a look at Playwright, which is implemented in many more languages and browsers, making it more accessible and easier to scale.

Playwright

Playwright is a synchronous and asynchronous web browser automation library available in multiple languages by Microsoft.
The main goal of Playwright is reliable end-to-end modern web app testing, though it still implements all of the general-purpose browser automation functions (like Puppeteer and Selenium) and has a growing web-scraping community.

Languages: Javascript, .Net, Java and Python
Browsers: Chrome, Firefox, Safari, Edge, Opera
Pros: Feature rich, cross-language, cross-browser and provides both asynchronous and synchronous client implementations. Maintained by Microsoft.

Web Scraping with Playwright and Python

For more on Playwright see our full hands-on introduction which covers example project of scraping twitch.tv as well as common challenges, tips and trips.

Web Scraping with Playwright and Python

Let's continue with our airbnb.com example and see how it would look in Playwright and Python:

# asynchronous example:
import asyncio
from playwright.async_api import async_playwright

async def run():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch()
        pages = await browser.new_page()
        await page.goto('https://airbnb.com/experiences/272085')
        await page.wait_for_selector('h1')
        return url, await page.content()
asyncio.run(run())

# synchronous example:
from playwright.sync_api import sync_playwright
def run():
    with sync_playwright() as pw:
        browser = pw.chromium.launch()
        pages = browser.new_page()
        page.goto('https://airbnb.com/experiences/272085')
        page.wait_for_selector('h1')
        return url, page.content()

As you can see, playwright's API doesn't differ much from that of Selenium or Puppeteer, and it offers both a synchronous client for simple script convenience and an asynchronous client for additional performance scaling.

Playwright seems to tick all the boxes for browser automation: it's implemented in many languages, supports most web browsers and offers both async and sync clients.

ScrapFly's API

Web browser automation can be a difficult, time-consuming process: there are a lot of moving parts and variables - so many things that can go wrong. For this at ScrapFly we're offering to do the heavy lifting for you!

ScrapFly's API implements core web browser automation functions: page rendering, session/proxy management, custom javascript evaluation and page loading rules - all of which help produce highly scalable and easy to manage web scraper.

One important feature of ScrapFly's API is the seamless mixing of browser rendering and traditional http requests - allowing developers to optimize scrapers to their full scraping potential.

Let's quickly take a look at how can we replicate our airbnb.com scraper in ScrapFly's Python SDK:

import asyncio
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse


async def run():
    scrapfly = ScrapflyClient(key="YOURKEY", max_concurrency=5)
    to_scrape = [
        ScrapeConfig(
            url="https://www.airbnb.com/experiences/272085",
            render_js=True,
            wait_for_selector="h1",
        ),
    ]
    results = await scrapfly.concurrent_scrape(to_scrape)
    print(results[0]['content'])

ScrapFly's API simplifies the whole process to a few parameter configurations. Not only that, but it automatically configures the backend browser for the best browser configurations for the given scrape target!

For more on ScrapFly's browser rendering and more, see our official documentation: https://scrapfly.io/docs/scrape-api/javascript-rendering

Which One To Choose?

We've covered three major browser automation clients: Selenium, Puppeteer and Playwright - so which one you should stick with?

Well, it entirely depends on the project working on, but both Playwright and Puppeteer have a huge advantage over Selenium by providing asynchronous clients which are much easier to scale.

Further, if your project is in Javascript both puppeteer and playwright provide equally brilliant clients, though for other languages Playwright seems to be the best all-around solution.

With all that being said, we at ScrapFly offer a generic, scalable and easy-to- manage solution. Our API simplifies the entire process and implements many optimizations and handles all accessibility issues such as proxies, anti-bot protection and other issues your scraper might encounter.


In the next section, let's take a quick look at common challenges browser-based web scrapers have to deal with which can help you decide which one of these libraries to choose.

Challenges and Tips

While automating a single instance of a browser appears to be an easy task, when it comes to web-scraping there are a lot of extra challenges that need solving, such as:

  • Avoiding being blocked.
  • Session persistence.
  • Proxy integration.
  • Scaling and resource usage optimization.

Unfortunately, none of the browser automation clients are designed for web-scraping first, thus solutions to these problems have to be implemented by each developer either through community extensions or custom code.

In the next section let's take a look at a few common challenges and existing common wisdom for dealing with them.

Fingerprinting and Blocking

Unfortunately, modern web browsers provide so much information about themselves that they can be easily identified and potentially blocked from accessing a website.

To prevent this, automated browsers need to be fortified against fingerprinting which can be done by applying various stealth patches that hide information leaks and h

To understand fingerprinting let's take a look at what is browser fingerprint. For example, https://abrahamjuliot.github.io/creepjs/ is a public analysis tool that displays fingerprinting information:

illustration of browser fingerprint

Fingerprint analysis of a Chrome browser controller by Playwright (Python)

In the above screenshot, we can see that the analysis tool is successfully identifying us as a headless browser. Websites could use such analysis to identify us as a bot and block our access.

Having this analysis information, we can start plugging these holes up and fortifying our web scraper to prevent detection. In other words, we can tell our controlled browser to lie about its configuration - making bot identification more difficult and the scraping process easier!

How Javascript is Used to Block Web Scrapers? In-Depth Guide

For more on this see our in-depth article on how javascript fingerprint is used to web scraper identification.

How Javascript is Used to Block Web Scrapers? In-Depth Guide

Fortunately, the web scraping community has been working on this problem for years and there are existing tools that patch up major fingerprinting holes automatically:

Playwright:

Puppeteer:

Selenium:

Despite that, there are many domain-specific and secret techniques that still manage to fingerprint a browser even with all of these fortifications applied. It's often referred to as an endless cat-and-mouse game issue, and it's worth being aware of when scaling web scrapers.

We at ScrapyFly invest a lot of time in this cat-and-mouse game and provide the best approaches available to your scraped targets automatically. So if you ever feel overwhelmed by this issue, take advantage of our countless hours of work!

Scaling - Asynchronous Clients

Web browsers are some of the most complex software in the world, thus unsurprisingly they use a lot of resources and are typically quite difficult to work with reliably. When it comes to scaling web scrapers powered by web browsers, there's one easy thing we can do that'll yield an immediate, significant performance boost: use multiple asynchronous clients!

In this article, we've covered Selenium, Playwright and Pupeteer. Unfortunately, Selenium only offers synchronous implementation, which means our scraper program has to sit idle while the browser is doing blocking work (such as loading a page).
Alternatively, with async clients offered by Puppeteer or Playwright we can optimize our scrapers to avoid unnecessary waiting. Meaning we can run multiple browsers in parallel which is a very efficient way to speed up the whole process:

illustration sync vs async

synchronous scraper compared to an asynchronous one

In this illustration, we see how synchronous scraper is waiting for the browser to finish loading page before it can continue, while asynchronous scraper on the right using 4 different browsers can eliminate this wait. In this imaginary scenario, our async scraper can perform 4 requests while sync only manages one however in real life this number could be significantly higher!

Let's see how would this look in code. For this example, we'll use Python and Playwright and schedule 3 different URLs to be scraped asynchronously:

import asyncio
from asyncio import gather
from playwright.async_api import async_playwright
from playwright.async_api._generated import Page
from typing import Tuple


async def scrape_3_pages_concurrently():
    async with async_playwright() as pw:
        # launch 3 browsers
        browsers = await gather(*(pw.chromium.launch() for _ in range(3)))
        # start 1 tab each on every browser
        pages = await gather(*(browser.new_page() for browser in browsers))

        async def get_loaded_html(page: Page, url: str) -> Tuple[str, str]:
            """go to url, wait for DOM to load and return url and return page content"""
            await page.goto(url)
            await page.wait_for_load_state("domcontentloaded")
            return url, await page.content()

        # scrape 3 pages asynchronously on 3 different pages
        urls = [
            "http://url1.com",
            "http://url2.com",
            "http://url3.com",
        ]
        htmls = await gather(*(
            get_loaded_html(page, url)
            for page, url in zip(pages, urls)
        ))
        return htmls


if __name__ == "__main__":
    asyncio.run(scrape_3_pages_concurrently())

In this short example, we start 3 web browser instances, then we can use them asynchronously to retrieve multiple pages.

Despite the integrated asynchronous nature of these clients, the scalability issue still remains difficult to solve. Mainly because there are so many things that can go wrong when it comes to web browsers: what if a browser tab crashes? How to deal with persistent session information? How to ensure efficient retry strategies? How to reduce browser bootup time?

There are all important questions that will eventually come up when scaling web scrapers, so we would advise starting thinking about them early!

Disabling Unnecessary Load

Since we'd be running a real web browser that is intended for humans and not robots, we'd be wasting a lot of computing and network resources on retrieving stuff robots don't need: such as displaying images, styling and accessibility features.

To get a noticeable optimization boost, we can modify our browser to block requests to non-critical resources. For example, in Playwright (Python) we can implement these simple route options to block image rendering:

page = await browser.new_page()
# block requests to png, jpg and jpeg files
await page.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
await page.goto("https://example.com")
await browser.close()

For image-heavy targets such as e-commerce websites, this simple rule could significantly increase loading speeds and reduce the bandwidth usage.

That's just the tip of the iceberg - since browsers are such complex software projects, there's a lot of space to optimize for our unique purpose of web scraping.

At ScrapFly, we provide many optimization techniques automatically for the quickest rendering experience possible: we have a sophisticated custom browser pool on standby that implements smart shared caching and resource managing techniques to trim off every possible millisecond!

FAQ

To wrap this article up let's take a look at some frequently asked questions about web scraping using headless browsers that we couldn't quite fit into this article:

How can I tell whether it's a dynamic website?

The easiest way to determine whether any of the dynamic content is present on the web page is to disable javascript in your browser and see if data is missing. Sometimes data might not be visible in the browser but is still present in the page source code - we can click "view page source" and look for data there. Often, dynamic data is located in javascript variables under <script> HTML tags. For more on that see How to Scrape Hidden Web Data

Should I parse HTML using browser or do it in my scraper code?

While the browser has a very capable javascript environment generally using HTML parsing libraries (such as beautifulsoup in Python) will result in faster and easier-to-maintain scraper code.
A popular scraping idiom is to wait for the dynamic data to load and then pull the whole rendered page source (HTML code) into scraper code and parse the data there.

Can I scrape web applications or SPAs using browser automation?

Yes, web applications or Single Page Apps (SPA) function the same as any other dynamic website. Using browser automation toolkits we can click around, scroll and replicate all the user interactions a normal browser could do!

What are static page websites?

Static websites are essentially the opposite of dynamic websites - all the content is always present in the page source (HTML source code). However, static page websites can still use javascript to unpack or transform this data on page load, so browser automation can still be beneficial.

Can I scrape a javascript website with python without using browser automation?

When it comes to using python in web scraping dynamic content we have two solutions: reverse engineer the website's behavior or use browser automation.

That being said, there's a lot of space in the middle for niche, creative solutions. For example, a common tool used in web scraping is Js2Py which can be used to execute javascript in python. Using this tool we can quickly replicate some key javascript functionality without the need to recreate it in Python.

What is a headless browser?

A headless browser is a browser instance without visible GUI elements. This means headless browsers can run on servers that have no displays. Headless chrome and headless firefox also run much faster compared to their headful counterparts making them ideal for web scraping.

Summary

In this overview article, we took a glance at the capabilities of the most popular browser automation clients in the context of web-scraping. We took a look at the classic Selenium client, newer Google's approach - Puppeteer and Microsoft's Playwright which aims to do everything on everything.

Finally, we covered a few common challenges when it comes to scaling and managing browser emulation based web-scrapers and how ScrapFly's API is designed to solve all of these issues for you!

Browser automation, while resource intensive and difficult to scale, can be a great generic solution for web-scraping dynamically generated websites and web apps - give it a shot!

Related Posts

Quick Intro to Parsing JSON with JMESPath in Python

Introduction to JMESPath - JSON query language which is used in web scraping to parse JSON datasets for scrape data.

How to Scrape Redfin Real Estate Property Data in Python

Tutorial on how to scrape Redfin.com sale and rent property data, using Python and how to avoid blocking to scrape at scale.

How to Scrape Real Estate Property Data using Python

Introduction to scraping real estate property data. What is it, why and how to scrape it? We'll also list dozens of popular scraping targets and common challenges.