Scraping Dynamic Websites Using Browser

article feature image

The web is becoming increasingly more complex and dynamic. Many websites these days rely heavily on javascript to render interactive data (frameworks such as React, Angular, Vue.js) which for web-scrapers can be a challenging problem. These are called dynamic websites as they serve dynamic content which loads content without changing page URL.

Traditional web-scrapers use an HTTP client to request specific web resources, however most of the time web servers expect the client to be a browser with all the browser capabilities such as javascript execution and styling.
Thus, web-scraper developers would often run into a very common problem, when scraping dynamic web pages: the scraper program sees different data than the browser.

comparison what scraper sees vs browser

On the left we see what the browser sees; on the right is our http webscraper - where did everything go?

There are a few ways to deal with this dynamic javascript generated content in the context of web-scraping:

First, we could reverse engineer website's behavior and replicate it in our scraper program. Unfortunately, as the complexity of the web grows, this approach is becoming very time-consuming, difficult and is required to be done for every website individually.
Alternatively, we can automate a real browser to scrape dynamic web page for us by integrating it into our web scraper program.

In this article, we'll take a look at the latter - what are the most common browser automation approaches for dynamic web scraping with python? What are common pitfalls and challenges when scraping dynamic web pages? And ScrapFly's very own approach to this challenge, which simplifies it all!

Example Scrape Task

For this article, we'll be using a real world web-scraping example:

We'll be scraping online experience data from We'll keep our demo task short and see how can we fully render a single experience page like:

example target illustration

Example of a page we'll be scraping using browser automation

Airbnb is one of the biggest websites that is using dynamic pages generated by React Javascript front-end framework. Without browser emulation, we'd have to reverse-engineering how the website's javascript code before we could see and scrape it's full HTML content.
However, with browser emulation things are much simpler: we can go to the page, wait for the dynamic contents to page load and render, and finally pull the full page source contents for parsing.

We'll implement a short solution to this challenge in 4 different approaches: Selenium, Puppeteer, Playwright and ScrapFly's API and see how they match up!

Browser Automation

Modern browsers such as Chrome and Firefox (and their derivatives) come with automation protocols built-in. These protocols introduce a standard way for programs to control an active browser instance in a headless manner (meaning without a GUI element) which is great for web-scraping as we can execute the entire browser execution pipeline on a dedicated server rendering our webpages for us.

Currently, there are two popular browser automation protocols:

  • older webdriver protocol which is implemented through an extra browser layer called webdriver. Webdriver intercepts action requests and issues browser control commands.
  • newer Chrome DevTools Protocol (CDP for short). Unlike webdriver, CDP control layer is already implemented in modern browsers implicitly.

In this article we'll be mostly covering CDP, but the developer experience of these protocols is very similar, often even interchangeable.

For more on these protocols see official documentation pages Chrome DevTools Protocol and WebDriver MDN documentation

Let's take a look at the most common browser automation clients and how can we solve our real world example in each one of them.


Selenium is one of the first big automation clients created for automating web-site testing. It supports two browser control protocols: older webdriver protocol and Selenium v4 Chrome devtools Protocol (CDP). It's a very popular package that is implemented in multiple languages, as well as supporting all major web browsers. Meaning it has a huge community, a big feature set and a robust underlying structure.

Web Scraping with Selenium and Python

For more on Selenium, see our extensive introduction tutorial which covers Selenium usage in Python, common tips and tricks, best practices and an example project!

Web Scraping with Selenium and Python

Languages: Java, Python, C#, Ruby, JavaScript, Perl, PHP, R, Objective-C and Haskell
Browsers: Chrome, Firefox, Safari, Edge, Internet Explorer (and their derivatives)
Pros: Big community that has been around for a while - meaning loads of free resources. Easy to understand synchronous API for common automation tasks.

So, what about scraping dynamic web pages with python and selenium?
Let's take a look at how we could use Selenium webdriver to solve our scraper problem. Our goal is to retrieve the fully rendered HTML page of this airbnb experience item:

# Python with Selenium
from selenium import webdriver
from import By
from import WebDriverWait
from import visibility_of_element_located

browser = webdriver.Chrome()
title = (
    WebDriverWait(driver=browser, timeout=10)
    .until(visibility_of_element_located((By.CSS_SELECTOR, "h1")))
content = browser.page_source

Here, we started by initiating a web browser window and requesting a location to a single Airbnb experience page. Next we tell our program to wait for the first header to appear which indicates that full HTML content has loaded.
With the page rendered fully, we can pop its HTML content and parse it just as we would parse the results of a http client scraper.

However, one downside of using selenium is that it doesn't support asynchronous programming, meaning every time we tell the browser to do something it'll block our program until it's done.
When working with bigger scale web-scrapers, non-blocking IO is an important scaling feature as we can scrape multiple targets much faster.
For this, let's take a look at other clients that do support asynchronous programming out of the box: puppeteer and playwright


Puppeteer is an asynchronous web browser automation library for Javascript by Google (as well as Python through the unofficial Pyppeteer package).

Languages: Javascript, Python (unofficial)
Browsers: Chrome, Firefox (Experimental)
Pros: First strong implementation of CDP, maintained by Google, intended to be a general browser automation tool.

Compared to Selenium, puppeteer supports fewer languages but it fully implements CDP protocol and has a strong team by Google behind it. Puppeteer also describes itself as a general purpose browser automation client rather than fitting itself into the web testing niche - which is good news as web-scraping receives official support.

Let's take a look at how our example would look in puppeteer and javascript:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('');
  await page.waitForSelector("h1");
  await page.content();
  await browser.close();

As you can see, Puppeteer usage looks almost identical to that of our Selenium example except for await the keyword that indicates async nature of our program (we'll cover the value of async programming further below).

Web Scraping With a Headless Browser: Puppeteer

For more on Puppeteer, see our extensive introduction tutorial which covers Puppeteer usage in NodeJS, common tips and tricks, best practices and an example project!

Web Scraping With a Headless Browser: Puppeteer

Puppeteer is great, but Chrome browser + Javascript might not be the best option when it comes to maintaining complex web-scraping systems. For that, let's continue our browser automation journey and take a look at Playwright, which is implemented in many more languages and browsers, making it more accessible and easier to scale.


Playwright is synchronous and asynchronous web browser automation library for multiple languages by Microsoft. The main goal is Playwright is reliable end-to-end modern web app testing, however it still implements all general purpose browser automation functions (like Puppeteer) and has a growing web-scraping community.

Languages: Javascript, .Net, Java and Python
Browsers: Chrome, Firefox, Safari, Edge, Opera
Pros: Feature rich with multiple languages, browser support, both asynchronous and synchronous client implementations. Maintained by Microsoft.

Let's continue with our example and see how it would look in Playwright:

import asyncio
from playwright.async_api import async_playwright
from playwright.sync_api import sync_playwright
from playwright.async_api._generated import Page

# asynchronous example
async def run():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch()
        pages = await browser.new_page()
        await page.goto('')
        await page.wait_for_selector('h1')
        return url, await page.content()

# synchronous example
def run():
    with sync_playwright() as pw:
        browser = pw.chromium.launch()
        pages = browser.new_page()
        return url, page.content()

As you can see, playwright's API doesn't differ much from that of Selenium's or Puppeteers, and it offers both synchronous client for simple script convenience and asynchronous client for additional performance scaling.

Playwright seems to tick all the boxes for browser automation: it's implemented in many languages, browsers and offers both async and sync clients.

ScrapFly's API

Web browser automation can be a difficult, time-consuming process: there are a lot of moving parts and variables - so many things that can go wrong. For this at ScrapFly we're offering to do the heavy lifting for you!

ScrapFly's API implements core web browser automation functions: page rendering, session/proxy management, custom javascript evaluation and page loading rules - all of which help produce highly scalable and easy to manage web scraper.

One important feature of ScrapFly's API is the seamless mixing of browser rendering and traditional http requests - allowing developers to optimize scrapers to their full scraping potential.

Let's quickly take a look at how can we replicate our scraper in ScrapFly's Python SDK:

import asyncio
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

async def run():
    scrapfly = ScrapflyClient(key="YOURKEY", max_concurrency=5)
    to_scrape = [
    results = await scrapfly.concurrent_scrape(to_scrape)

ScrapFly's API simplifies the whole process to a few parameter configurations. Not only that, but it automatically configures the backend browser for the best browser configurations for the given scrape target!

For more on ScrapFly's browser rendering and more, see our official documentation:

Which One To Choose?

We've covered three major browser automation clients: Selenium, Puppeteer and Playwright - so which one you should stick with?

Well, it entirely depends on the project working on, but both Playwright and Puppeteer have a huge advantage over selenium by providing asynchronous clients which are much easier to scale when it comes to web-scraping.
Further, if your project is in Javascript both puppeteer and playwright provide equally brilliant clients, however for other languages Playwright seems to be the best all-around solution.

With all that being said, we at ScrapFly offer a generic, scalable and easy to manage solution.
Our API simplifies the entire process and implements many optimizations and handles all accessibility issues such as proxies, anti-bot protection and so on. your scraper might encounter.

For more on ScrapFly's capabilities, see our handy use case guide:

Finally, to wrap this up, let's take a quick look at common challenges browser based web-scrapers have to deal with and our advice on how to approach them.

Challenges and Tips

While automating a single instance of a browser appears to be an easy task, when it comes to web-scraping there are a lot of extra challenges that need solving, such as:

  • Ban prevention.
  • Session persistence.
  • Proxy integration.
  • Scaling and resource usage optimization.

Unfortunately, none of the browser automation clients are designed for web-scraping first, thus solutions to these problems have to be implemented by each developer either through community extensions or custom code.

In the next section let's take a look at a few common challenges and existing common wisdom for dealing with them.


Unfortunately, modern web-browsers provide so much information about themselves that they can be easily identified and potentially blocked from accessing a web-site. To prevent this, automated browsers need to be fortified against fingerprinting, and generally this is referred to as "stealthing".

First, to understand stealthing let's take a look at what is browser fingerprint. For example, is a public analysis tool that displays fingerprinting information:

illustration of browser fingerprint

Fingerprint analysis of a Chrome browser controller by Playwright (Python)

In the above screenshot, we can see that the analysis tool is successfully identifying us as a headless browser. Websites could use such analysis to identify us as a bot and block our access.
However, having this analysis information, we can start plugging these holes up and fortifying our web scraper to prevent detection. In other words, we can tell our controlled browser to lie about its configuration - making bot identification more difficult and scraping process easier!

Fortunately, web-scraping community has been working on this problem for years and there are existing tools that patch up major fingerprinting holes automatically:




Despite that, there are many domain-specific and secret techniques that still manage to fingerprint a browser even with all of these fortifications applied. It's often referred to as an endless cat-and-mouse game issue, and it's worth being aware of when scaling web-scrapers.

We at ScrapyFly invest a lot of time in this cat-and-mouse game and provide the best approaches available to your scraped targets automatically. So if you ever feel overwhelmed by this issue, take advantage of our countless hours of work!

Scaling - Asynchronous Clients

Web browsers are some of the most complex software in the world, thus unsurprisingly they use a lot of resources and are typically quite difficult to work with reliably. When it comes to scaling web scrapers powered by web browsers, there's one easy thing we can do that'll yield an immediate, significant performance boost: use multiple asynchronous clients!

In this article we've covered Selenium, Playwright and Pupeteer.
Unfortunately, Selenium only offers synchronous implementation, which means our scraper program has to sit idle while browser is doing blocking work (such as loading a page).
Alternatively, with async clients offered by Puppeteer or Playwright we can optimize our scrapers to avoid unnecessary waiting. Meaning we can run multiple browsers in parallel which is a very efficient way to speed up the whole process:

illustration sync vs async

synchronous scraper compared to an asynchronous one

In this illustration, we see how synchronous scraper is waiting for the browser to finish loading page before it can continue, while asynchronous scraper on the right using 4 different browsers can eliminate this wait. In this imaginary scenario, our async scraper can perform 4 requests while sync only manages one however in real life this number could be significantly higher!

Let's see how would this look in code. For this example, we'll use Python and Playwright and schedule 3 different URLs to be scraped asynchronously:

import asyncio
from asyncio import gather
from playwright.async_api import async_playwright
from playwright.async_api._generated import Page
from typing import Tuple

async def scrape_3_pages_concurrently():
    async with async_playwright() as pw:
        # launch 3 browsers
        browsers = await gather(*(pw.chromium.launch() for _ in range(3)))
        # start 1 tab each on every browser
        pages = await gather(*(browser.new_page() for browser in browsers))

        async def get_loaded_html(page: Page, url: str) -> Tuple[str, str]:
            """go to url, wait for DOM to load and return url and return page content"""
            await page.goto(url)
            await page.wait_for_load_state("domcontentloaded")
            return url, await page.content()

        # scrape 3 pages asynchronously on 3 different pages
        urls = [
        htmls = await gather(*(
            get_loaded_html(page, url)
            for page, url in zip(pages, urls)
        return htmls

if __name__ == "__main__":

In this short example we start 3 web browser instances, then we can use them asynchronously to retrieve multiple pages.

Despite the integrated asynchronous nature of these clients, the scalability issue still remains difficult to solve. Mainly because there are so many things that can go wrong when it comes to web browsers: what if a browser tab crashes? How to deal with persistent session information? How to ensure efficient retry strategies? How to reduce browser bootup time?

There are all important questions that will eventually come up when scaling web-scrapers, so we would advise starting thinking about them early!

Disabling Unnecessary Load

Since we'd be running a real web browser that is intended for humans and not robots, we'd actually be wasting a lot of computing and network resources on retrieving stuff robots don't need: such as images, styling and accessibility features.
To get a noticeable optimization boost, we can modify our browser to block requests to non-critical resources. For example, in Playwright (Python) we can implement these simple route options:

page = await browser.new_page()
# block requests to png, jpg and jpeg files
await page.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
await page.goto("")
await browser.close()

For heavy images targets such as e-commerce websites, this simple rule could reduce website load speeds up to 10 times!

Finally, that's just the tip of the iceberg - since browsers are such complex software projects, there's a lot of space to optimize for our unique purpose of web-scraping. At ScrapFly, we provide many optimization techniques automatically for the quickest rendering experience possible: we have a sophisticated custom browser pool on standby that implements smart shared caching and resource managing techniques to trim off every possible millisecond!


How can I tell whether it's a dynamic website?

The easiest way to determine whether any of dynamic content is present on the web page is to disable javascript in your browser and see if data is missing. Sometimes data might not be visible in the browser but is still present in the page source code - we can click "view page source" and look for data there. Often, dynamic data is located in javascript variables under <script> HTML tags.

Should I parse HTML using browser or do it in my scraper code?

While browser has a very capable javascript environment generally using HTML parsing libraries (such as beautifulsoup in Python) will result in faster and easier to maintain scraper code. Common idiom is to wait for the dynamic data to load and then pull whole page source (HTML file) into scraper code and parse the data there.

Can I scrape web applications or SPAs using browser automation?

Yes, web applications or single page apps (SPA) function the same as any other dynamic websites. Using browser automation toolkits we can click around, scroll and replicate all the user interactions a normal browser could do! So scraping these websites needs no special technologies.

What are static page websites?

Static websites are essentially opposite of dynamic websites - all the content is always present in the page source (HTML source code). However, static page websites can still use javascript to unpack or transform this data on page load, so browser automation can still be beneficial to avoid the need of reverse engineering javascript code.

Can I scrape a javascript website with python without using browser automation?

When it comes to using python in web scraping dynamic content we have two extreme ends: reverse engineer the website's behavior or plow through it with browser automation. However, there's a lot of space in the middle for niche, creative solutions.

For example, common tool used in web scraping is Js2Py which can be used to execute javascript in python. Using this tool we can quickly replicate some key javascript functionality without the need to recreate it in Python.


In this overview article, we took a glance at the capabilities of the most popular browser automation clients in the context of web-scraping. We took a look at the classic Selenium client, newer Google's approach - Puppeteer and Microsoft's Playwright which aims to do everything on everything.

Finally, we covered a few common challenges when it comes to scaling and managing browser emulation based web-scrapers and how ScrapFly's API is designed to solve all of these issues for you!

Browser automation, while resource intensive and difficult to scale, can be a great generic solution for web-scraping dynamically generated websites and web apps - give it a shot!

Related post

How to Scrape

Tutorial on how to scrape business and review data using Python. How to avoid blocking to scrape data at scale and other tips.

How to Scrape

Tutorial on how to scrape's product and review data using Python and how to avoid blocking to scrape this information at scale.

How to Scrape

Tutorial on how to scrape sale and rent property data, using Python and how to avoid blocking to scrape at scale.

How to Scrape

Tutorial on how to scrape hotel, restaurant, tour and review data using Python and how to avoid blocking to scrape at scale.