Scraping Dynamic Websites Using Browser

Scraping Dynamic Websites Using Browser

The web is becoming increasingly more complex and dynamic. Many websites these days rely heavily on javascript to render interactive data (frameworks such as React, Angular, Vue.js) which for web-scrapers can be a challenging problem.

Traditional web-scrapers use an HTTP client to request specific web resources, however most of the time web servers expect the client to be a browser with all the browser capabilities such as javascript execution and styling.

Thus, web-scraper developers would often run into a very common difficulty: the scraper program sees different data than the browser.

On the left we see what the browser sees; on the right is our http webscraper - where did everything go?

There are a few ways to deal with this dynamic javascript generated content in the context of web-scraping:

First, we could reverse engineer website's behavior and replicate it in our scraper program. However, as the complexity of the web grows, this approach is becoming very time-consuming, difficult and is required to be done for every website individually.
Alternatively, we can automate a real browser to do the heavy lifting for us by integrating it into our web scraper program.

In this article, we'll take a look at the latter - what are the most common browser automation approaches for web-scraping? What are common pitfalls and challenges when using this approach? And ScrapFly's very own approach to this problem simplifies it all!

Example Scrape Task

For this article, we'll be using a real world web-scraping example:

We'll be scraping online experience data from https://airbnb.com/experiences. We'll keep our demo task short and see how can we fully render a single experience page: https://www.airbnb.com/experiences/2496585.

example of a page we'll be scraping using browser automation

Airbnb is one of the biggest websites that is using dynamic front-end generated by React Javascript framework. Without browser emulation, we'd have to reverse-engineering how the website functions before we could see and scrape it's full HTML content. However, with browser emulation things are much simpler for us: we can go to the page, wait for the contents to load and render, and finally pull the full page contents for parsing.

We'll implement a short solution to this challenge in 4 different approaches: Selenium, Puppeteer, Playwright and ScrapFly's API and see how they match up!

Browser Automation

Modern browsers such as Chrome and Firefox (and their derivatives) come with automation protocols built-in. These protocols introduce a standard way for programs to control an active browser instance in a headless manner (meaning without a GUI element) which is great for web-scraping as we can execute the entire browser execution pipeline on a dedicated server rendering our webpages for us.

Currently, there are two popular browser automation protocols:

  • older webdriver protocol which is implemented through an extra browser layer called webdriver. Webdriver intercepts action requests and issues browser control commands.
  • newer Chrome DevTools Protocol (CDP for short). Unlike webdriver, CDP control layer is already implemented in modern browsers implicitly.

In this article we'll be mostly covering CDP, but the developer experience of these protocols is very similar, often even interchangeable.

For more on these protocols see official documentation pages Chrome DevTools Protocol and WebDriver MDN documentation

Let's take a look at the most common browser automation clients and how can we solve our real world example in each one of them.

Selenium

Selenium is one of the first big automation clients created for automating web-site testing. It supports two browser control protocols: older webdriver protocol and Selenium v4 Chrome devtools Protocol (CDP). It's a very popular package that is implemented in multiple languages, as well as supporting all major web browsers. Meaning it has a huge community, a big feature set and a robust underlying structure.

Languages: Java, Python, C#, Ruby, JavaScript, Perl, PHP, R, Objective-C and Haskell
Browsers: Chrome, Firefox, Safari, Edge, Internet Explorer (and their derivatives)
Pros: Big community that has been around for a while - meaning loads of free resources. Easy to understand synchronous API for common automation tasks.

Let's take a look at how we could use Selenium to solve our airbnb.com scraper problem. Our goal is to render retrieve the fully rendered HTML page of this location, so we could later parse the available data.

# Python with Selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import visibility_of_element_located

browser = webdriver.Chrome()
browser.get("https://www.airbnb.com/experiences/272085")
title = (
    WebDriverWait(driver=browser, timeout=10)
    .until(visibility_of_element_located((By.CSS_SELECTOR, "h1")))
    .text
)
content = browser.page_source
print(title)
print(content)
browser.close()

Here, we started by initiating a web browser window and requesting a location to a single Airbnb experience page. Next we tell our program to wait for the first header to appear which indicates that full HTML content has loaded.
With the page rendered fully, we can pop its HTML content and parse it just as we would parse the results of a http client scraper.

However, one downside of using selenium is that it doesn't support asynchronous programming, meaning every time we tell the browser to do something it'll block our program until it's done.
When working with bigger scale web-scrapers, non-blocking IO is an important scaling feature as we can scrape multiple targets much faster.
For this, let's take a look at other clients that do support asynchronous programming out of the box: puppeteer and playwright

Puppeteer

Puppeteer is an asynchronous web browser automation library for Javascript by Google (as well as Python through the unofficial Pyppeteer package).

Languages: Javascript, Python (unofficial)
Browsers: Chrome, Firefox (Experimental)
Pros: First strong implementation of CDP, maintained by Google, intended to be a general browser automation tool.

Compared to Selenium, puppeteer supports fewer languages but it fully implements CDP protocol and has a strong team by Google behind it. Puppeteer also describes itself as a general purpose browser automation client rather than fitting itself into the web testing niche - which is good news as web-scraping receives official support.

Let's take a look at how our airbnb.com example would look in puppeteer and javascript:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://airbnb.com/experiences/272085');
  await page.waitForSelector("h1");
  await page.content();
  await browser.close();
})();

As you can see, Puppeteer usage looks almost identical to that of our Selenium example except for await the keyword that indicates async nature of our program (we'll cover the value of async programming further below).

Puppeteer is great, but Chrome browser + Javascript might not be the best option when it comes to maintaining complex web-scraping systems. For that, let's continue our browser automation journey and take a look at Playwright, which is implemented in many more languages and browsers, making it more accessible and easier to scale.

Playwright

Playwright is synchronous and asynchronous web browser automation library for multiple languages by Microsoft. The main goal is Playwright is reliable end-to-end modern web app testing, however it still implements all general purpose browser automation functions (like Puppeteer) and has a growing web-scraping community.

Languages: Javascript, .Net, Java and Python
Browsers: Chrome, Firefox, Safari, Edge, Opera
Pros: Feature rich with multiple languages, browser support, both asynchronous and synchronous client implementations. Maintained by Microsoft.

Let's continue with our airbnb.com example and see how it would look in Playwright:

import asyncio
from playwright.async_api import async_playwright
from playwright.sync_api import sync_playwright
from playwright.async_api._generated import Page

# asynchronous example
async def run():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch()
        pages = await browser.new_page()
        await page.goto('https://airbnb.com/experiences/272085')
        await page.wait_for_selector('h1')
        return url, await page.content()

# synchronous example
def run():
    with sync_playwright() as pw:
        browser = pw.chromium.launch()
        pages = browser.new_page()
        page.goto('https://airbnb.com/experiences/272085')
        page.wait_for_selector('h1')
        return url, page.content()

As you can see, playwright's API doesn't differ much from that of Selenium's or Puppeteers, and it offers both synchronous client for simple script convenience and asynchronous client for additional performance scaling.

Playwright seems to tick all the boxes for browser automation: it's implemented in many languages, browsers and offers both async and sync clients.

ScrapFly's API

Web browser automation can be a difficult, time-consuming process: there are a lot of moving parts and variables - so many things that can go wrong. For this at ScrapFly we're offering to do the heavy lifting for you!

ScrapFly's API implements core web browser automation functions: page rendering, session/proxy management, custom javascript evaluation and page loading rules - all of which help produce highly scalable and easy to manage web scraper.

One important feature of ScrapFly's API is the seamless mixing of browser rendering and traditional http requests - allowing developers to optimize scrapers to their full scraping potential.

Let's quickly take a look at how can we replicate our airbnb.com scraper in ScrapFly's Python SDK:

import asyncio
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse


async def run():
    scrapfly = ScrapflyClient(key="YOURKEY", max_concurrency=5)
    to_scrape = [
        ScrapeConfig(
            url="https://www.airbnb.com/experiences/272085",
            render_js=True,
            wait_for_selector="h1",
        ),
    ]
    results = await scrapfly.concurrent_scrape(to_scrape)
    print(results[0]['content'])

ScrapFly's API simplifies the whole process to a few parameter configurations. Not only that, but it automatically configures the backend browser for the best browser configurations for the given scrape target!

For more on ScrapFly's browser rendering and more, see our official documentation: https://scrapfly.io/docs/scrape-api/javascript-rendering

Which One To Choose?

We've covered three major browser automation clients: Selenium, Puppeteer and Playwright - so which one you should stick with?

Well, it entirely depends on the project working on, but both Playwright and Puppeteer have a huge advantage over selenium by providing asynchronous clients which are much easier to scale when it comes to web-scraping.
Further, if your project is in Javascript both puppeteer and playwright provide equally brilliant clients, however for other languages Playwright seems to be the best all-around solution.

With all that being said, we at ScrapFly offer a generic, scalable and easy to manage solution.
Our API simplifies the entire process and implements many optimizations and handles all accessibility issues such as proxies, anti-bot protection and so on. your scraper might encounter.

For more on ScrapFly's capabilities, see our handy use case guide: https://scrapfly.io/use-case

Finally, to wrap this up, let's take a quick look at common challenges browser based web-scrapers have to deal with and our advice on how to approach them.

Challenges and Tips

While automating a single instance of a browser appears to be an easy task, when it comes to web-scraping there are a lot of extra challenges that need solving, such as:

  • Ban prevention.
  • Session persistence.
  • Proxy integration.
  • Scaling and resource usage optimization.

Unfortunately, none of the browser automation clients are designed for web-scraping first, thus solutions to these problems have to be implemented by each developer either through community extensions or custom code.

In the next section let's take a look at a few common challenges and existing common wisdom for dealing with them.

Fingerprints

Unfortunately, modern web-browsers provide so much information about themselves that they can be easily identified and potentially blocked from accessing a web-site. To prevent this, automated browsers need to be fortified against fingerprinting, and generally this is referred to as "stealthing".

First, to understand stealthing let's take a look at what is browser fingerprint. For example, https://abrahamjuliot.github.io/creepjs/ is a public analysis tool that displays fingerprinting information:

Fingerprint analysis of a Chrome browser controller by Playwright (Python)

Fingerprint test being run on headless Chrome browser via Playwright

In the above screenshot, we can see that the analysis tool is successfully identifying us as a headless browser. Websites could use such analysis to identify us as a bot and block our access.
However, having this analysis information, we can start plugging these holes up and fortifying our web scraper to prevent detection. In other words, we can tell our controlled browser to lie about its configuration - making bot identification more difficult and scraping process easier!

Fortunately, web-scraping community has been working on this problem for years and there are existing tools that patch up major fingerprinting holes automatically:

Playwright:

Puppeteer:

Selenium:

Despite that, there are many domain-specific and secret techniques that still manage to fingerprint a browser even with all of these fortifications applied. It's often referred to as an endless cat-and-mouse game issue, and it's worth being aware of when scaling web-scrapers.

We at ScrapyFly invest a lot of time in this cat-and-mouse game and provide the best approaches available to your scraped targets automatically. So if you ever feel overwhelmed by this issue, take advantage of our countless hours of work!

Scaling - Asynchronous Clients

Web browsers are some of the most complex software in the world, thus unsurprisingly they use a lot of resources and are typically quite difficult to work with reliably. When it comes to scaling web scrapers powered by web browsers, there's one easy thing we can do that'll yield an immediate, significant performance boost: use multiple asynchronous clients!

In this article we've covered Selenium, Playwright and Pupeteer.
Unfortunately, Selenium only offers synchronous implementation, which means our scraper program has to sit idle while browser is doing blocking work (such as loading a page).
Alternatively, with async clients offered by Puppeteer or Playwright we can optimize our scrapers to avoid unnecessary waiting. Meaning we can run multiple browsers in parallel which is a very efficient way to speed up the whole process:

synchronous scraper compared to an asynchronous one

In this illustration, we see how synchronous scraper is waiting for the browser to finish loading page before it can continue, while asynchronous scraper on the right using 4 different browsers can eliminate this wait. In this imaginary scenario, our async scraper can perform 4 requests while sync only manages one however in real life this number could be significantly higher!

Let's see how would this look in code. For this example, we'll use Python and Playwright and schedule 3 different URLs to be scraped asynchronously:

import asyncio
from asyncio import gather
from playwright.async_api import async_playwright
from playwright.async_api._generated import Page
from typing import Tuple


async def scrape_3_pages_concurrently():
    async with async_playwright() as pw:
        # launch 3 browsers
        browsers = await gather(*(pw.chromium.launch() for _ in range(3)))
        # start 1 tab each on every browser
        pages = await gather(*(browser.new_page() for browser in browsers))

        async def get_loaded_html(page: Page, url: str) -> Tuple[str, str]:
            """go to url, wait for DOM to load and return url and return page content"""
            await page.goto(url)
            await page.wait_for_load_state("domcontentloaded")
            return url, await page.content()

        # scrape 3 pages asynchronously on 3 different pages
        urls = [
            "http://url1.com",
            "http://url2.com",
            "http://url3.com",
        ]
        htmls = await gather(*(
            get_loaded_html(page, url)
            for page, url in zip(pages, urls)
        ))
        return htmls


if __name__ == "__main__":
    asyncio.run(scrape_3_pages_concurrently())

In this short example we start 3 web browser instances, then we can use them asynchronously to retrieve multiple pages.

Despite the integrated asynchronous nature of these clients, the scalability issue still remains difficult to solve. Mainly because there are so many things that can go wrong when it comes to web browsers: what if a browser tab crashes? How to deal with persistent session information? How to ensure efficient retry strategies? How to reduce browser bootup time?

There are all important questions that will eventually come up when scaling web-scrapers, so we would advise starting thinking about them early!

Disabling Unnecessary Load

Since we'd be running a real web browser that is intended for humans and not robots, we'd actually be wasting a lot of computing and network resources on retrieving stuff robots don't need: such as images, styling and accessibility features.
To get a noticeable optimization boost, we can modify our browser to block requests to non-critical resources. For example, in Playwright (Python) we can implement these simple route options:

page = await browser.new_page()
# block requests to png, jpg and jpeg files
await page.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
await page.goto("https://example.com")
await browser.close()

For heavy images targets such as e-commerce websites, this simple rule could reduce website load speeds up to 10 times!

Finally, that's just the tip of the iceberg - since browsers are such complex software projects, there's a lot of space to optimize for our unique purpose of web-scraping. At ScrapFly, we provide many optimization techniques automatically for the quickest rendering experience possible: we have a sophisticated custom browser pool on standby that implements smart shared caching and resource managing techniques to trim off every possible millisecond!

Summary

In this overview article, we took a glance at the capabilities of the most popular browser automation clients in the context of web-scraping. We took a look at the classic Selenium client, newer Google's approach - Puppeteer and Microsoft's Playwright which aims to do everything on everything.

Finally, we covered a few common challenges when it comes to scaling and managing browser emulation based web-scrapers and how ScrapFly's API is designed to solve all of these issues for you!

Browser automation, while resource intensive and difficult to scale, can be a great generic solution for web-scraping dynamically generated websites and web apps - give it a shot!


Banner image: "Wall-E Working" by Arthur40A is licensed under CC BY-SA 2.0

Get Your FREE API Key
Discover Scrapfly

Related post

Web Scraping with Selenium and Python

Introduction to web scraping dynamic javascript powered websites and web apps using Selenium browser automation library and Python.

Web Scraping with Python and BeautifulSoup

Introduction to web scraping with Python and BeautifulSoup package. Tips, tricks and best practices as well as real life example.

Web Scraping With Python 102: Parsing

Introduction to parsing content from web scraped html documents. What libraries to use and common html parsing idioms in Python.