Web Scraping With a Headless Browser: Puppeteer

article feature image

When it comes to web scraping in javascript, there are generally two popular approaches: using HTTP clients like fetch and axios and using browser automation tools like Puppeteer to control web browsers.

While traditional HTTP client-based scraping is very efficient, scraping dynamic web pages can be very difficult. While consuming more resources and being slower, browser automation is a much easier and more accessible form of web scraping.

In this tutorial, we'll take a look at Puppeteer - a brilliant open source browser automation library for JavaScript (NodeJS).

We'll start with a general Puppeteer API overview and then focus on web scraping features like how to retrieve pages and wait for them to load all of the content.
Finally, we'll take a look at common issues and challenges and wrap everything up with an example https://www.tiktok.com/ web scraper!

Web Scraping With NodeJS and Javascript

Sometimes Puppeteer might be more than we need for scraping - check out our introduction article to using just NodeJS for web-scraping.

Web Scraping With NodeJS and Javascript

Puppeteer Overview

What is Puppeteer and how does it work?
Modern web browsers now contain special access tools for automation and cross-program communication. In particular, Chrome Devtools Protocol (aka CDP) - is a high-level API protocol that allows programs to control Chrome or Firefox web browser instances through socket connections.

In other words, Puppeteer allows us to create web scrapers in Chrome or Firefox browsers.

CDP protocol usage illustration
CDP translates program commands to web browser commands

As you can imagine, Puppeteer is a brilliant tool for web scraping! Automating a web browser gives our web scraper several advantages:

  • Web Browser based scrapers see what users see. In other words, the browser renders all scripts, images, etc. - making web scraper development much easier.
  • Web Browser based scrapers are harder to detect and block. Since we look like normal website users, we are much harder to identify as robots.

That being said, there are some negatives.
Browsers are complex software projects and are very resource intensive. In turn, more complexity also requires more developer diligence and maintenance.

Tip: Puppeteer in REPL

The easiest way to experiment and get the hang of Puppeteer is to use nodejs REPL and try Puppeteer out real-time. See this video for a quick intro:

0:00
/
Puppeteer overview in NodeJS REPL

Now, let's take a look at this in greater detail.

The Basics

Puppeteer node js library can be installed through NodeJS package manager npm with these terminal commands:

$ mkdir myproject && cd myproject
$ npm init
$ npm install puppeteer

The first thing we should note is that Puppeteer is an asynchronous node library. This means we'll be working in the context of Promises and async/await programming. If you're unfamiliar with async await syntax in Javascript, we recommend this quick introduction article by MDN.

Now with our package ready let's start with the most basic example. We'll start a headless Chrome web browser (headless mode meaning a special version of the browser that has no GUI elements), tell it to go to some websites, wait for it to load and retrieve the HTML page source:

// import puppeteer library
const puppeteer = require('puppeteer')

async function run(){
    // First, we must launch a browser instance
    const browser = await puppeteer.launch({
        // Headless option allows us to disable visible GUI, so the browser runs in the "background"
        // for development lets keep this to true so we can see what's going on but in
        // on a server we must set this to true
        headless: false,
        // This setting allows us to scrape non-https websites easier
        ignoreHTTPSErrors: true,
    })
    // then we need to start a browser tab
    let page = await browser.newPage();
    // and tell it to go to some URL
    await page.goto('http://httpbin.org/html', {
        waitUntil: 'domcontentloaded',
    });
    // print html content of the website
    console.log(await page.content());
    // close everything
    await page.close();
    await browser.close();
}

run();

In this basic example, we create a visible browser instance, start a new tab, go to http://httpbin.org/html webpage and print its contents. When scraping with Puppeteer we'll be mostly working with Page objects which essentially are web browser tabs. In this example, we're using two methods: goto() which tells the tab where to navigate to and content() which returns webpage source code.

With this basic knowledge, we can start to explore common Puppeteer usage patterns, let's start with basic parsing.

Waiting For Content

In this basic script, we encounter our first problem: How do we know when the page is loaded and ready to be parsed for data?

In this example, we used waitUntil argument to tell the browser to wait for domcontentloaded signal which is fired when the browser reads the HTML content of the page. However, this might not work for every page as dynamic pages might continue loading content even when the HTML page is read by the browser.

browser page load order
Illustration of how web browsers load web pages

When dealing with modern, dynamic websites that use javascript it's a good practice to wait for content explicitly instead of relying on load and domcontentloaded signals:

await page.goto('http://httpbin.org/html');
await page.waitForSelector('h1', {timeout: 5_000})

Here, we're telling Puppeteer to wait for the <h1> node to appear in the document body for a maximum of 5 seconds (5000 milliseconds). Since we're scraping HTML content, relying on HTML structure loading is much safer than browser events. Using waitForSelector() is the best way to ensure our content has loaded!

Selecting Content

As Puppeteer runs a full web browser, we have access to both CSS selectors and XPath selectors. These two tools allow us to select specific page parts and extract the displayed data or submit events like clicks and text inputs. Let's look at how to select HTML elements in puppeteer scraping.

Parsing HTML with Xpath

For more on XPATH selectors see our complete introduction tutorial which covers basic usage, tips and tricks and common web scraping idioms

Parsing HTML with Xpath

The Page object comes with several methods that allow us to find ElementHandle objects which we can extract or use as a click/input target:

const puppeteer = require('puppeteer');

async function run() {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // we can use .setContent to set page html to some test value:
    await page.setContent(`
    <div class="links">
    <a href="https://twitter.com/@scrapfly_dev">Twitter</a>
    <a href="https://www.linkedin.com/company/scrapfly/">LinkedIn</a>
    </div>
    `);
    // using .$ we can select first occurring value and get it's inner text or attribute:
    await (await page.$('.links a')).evaluate( node => node.innerText);
    await (await page.$('.links a')).evaluate( node => node.getAttribute("href"));

    // using .$$ we can select multiple values:
    let links = await page.$$('.links a');
    // or using xpath selectors instead of css selectors:
    // let links = await page.$x('//*[contains(@class, "links")]//a');
    for (const link of links){
        console.log(await link.evaluate( node => node.innerText));
        console.log(await link.evaluate( node => node.getAttribute("href")));
    }
    await browser.close();
}

run();

As you can see, the Page object gives us access to both CSS and XPATH selectors. We can either extract the first found element using the .$ method or all of the matching elements using the .$$ one.

Parsing HTML with CSS Selectors

For more on CSS selectors see our complete introduction tutorial which covers basic usage, tips and tricks and common web scraping idioms

Parsing HTML with CSS Selectors

We can also trigger mouse clicks, button presses and text inputs the same way:

await page.setContent(`
<div class="links">
  <a href="https://twitter.com/@scrapfly_dev">Twitter</a>
  <input></input>
</div>
`);
// enter text to the input
(await page.$('input')).type('hello scrapfly!', {delay: 100});
// press enter button
(await page.$('input')).press('Enter');
// click on the first link
(await page.$('.links a')).click();

Puppeteer gives us access to navigation and parsing functionalities of the browser - we can use these functions to crawl our target and even parse the page contents!


Now that we know how to navigate our browser, wait for content to load and parse the HTML DOM we should solidify this knowledge with a real-life example.

Example Project: tiktok.com

In this example project, we'll be scraping https://www.tiktok.com/ public user details and their video metadata. Our scraper goal is to:

  1. Go to https://www.tiktok.com
  2. Search Top videos of tag #cats
  3. Go to each top video creator's page
  4. Collect creator's details: name, follower and likes count
  5. Go to the latest 5 videos and collect their details: description, link and likes count

We'll use functional programming in this example, so let's start from the bottom up:

// scrapes video details
async function scrapeVideo(browser, videoUrl){
    let page = await browser.newPage();
    await page.goto(videoUrl, { waitUntil: 'domcontentloaded'});

    // wait for the page to load
    await page.waitForSelector('strong[data-e2e=like-count]')

    let likes = await(await page.$('strong[data-e2e=like-count]')).evaluate(node => node.innerText); 
    let comments = await(await page.$('strong[data-e2e=comment-count]')).evaluate(node => node.innerText); 
    let desc = await(await page.$('div[data-e2e=video-desc]')).evaluate(node => node.innerText); 
    let music = await(await page.$('h4[data-e2e=video-music] a')).evaluate(node => node.getAttribute('href')); 
    await page.close();
    return {likes, comments, desc, music}
}

Here, we have our first function that takes in a Browser object and an url to a tiktok video. We'll design our web scraper to use 1 Browser object and pass it around to functions to keep things in the realm of functional programming. This will also allow us to scale up later by allowing usage of multiple browsers scraping multiple pages.
In this case, we're starting a new tab, navigating to the video URL and waiting for like count to appear. Once it's there we parse our details, close the tab and return the results.

Now, let's do the same thing for the creator's page:

// scrapes user and their top 5 video details
async function scrapeCreator(browser, username){
    let page = await browser.newPage();
    await page.goto('http://tiktok.com/' + username);
    await page.waitForSelector('div[data-e2e="user-post-item"] a');
    // parse user data
    let followers = await(await page.$('strong[data-e2e=followers-count]')).evaluate(node => node.innerText); 
    let likes = await(await page.$('strong[data-e2e=likes-count]')).evaluate(node => node.innerText); 

    // parse user's video data
    let videoLinks = [];
    links = await page.$$('div[data-e2e="user-post-item"] a');
    for (const link of links){
        videoLinks.push(await link.evaluate( node => node.getAttribute('href')));
    };
    let videoData = await Promise.all(videoLinks.slice(0, 5).map(
        url => scrapeVideo(browser, url)
    ))
    await page.close()
    return {username, likes, followers, videoData}
}

Here, we're integrating our scrapeVideo() function to not only pick up the creator's details but the details of the most recent 5 videos too. We're using Promise.all to concurrently execute 5 promises, so in the browser, you'd see 5 tabs open up and scrape 5 video details at the same time!

We're left to implement our discovery function. We want to go to tiktok.com, enter some text in the search bar and extract the first few results:

// finds users of top videos of a given query
async function findTopVideoCreators(browser, query){
    let page = await browser.newPage();
    // search for cat videos:
    await page.goto('http://tiktok.com/', { waitUntil: 'domcontentloaded'});
    let searchBox = await page.$('input[type=search]')
    await searchBox.type(query, {delay:111});
    await page.waitForTimeout(500);  // we need to wait a bit before pressing enter
    await searchBox.press('Enter');

    // wait for search results to load
    await page.waitForSelector('a[data-e2e="search-card-user-link"]');

    // find all user links
    let userLinks = [];
    links = await page.$$('a[data-e2e="search-card-user-link"]');
    for (const link of links){
        userLinks.push(await link.evaluate( node => node.getAttribute('href')));
    };
    await page.close();
    return userLinks;
}

In the example above, we're going to the website's homepage, sending some text to the search input box with a delay to appear more human, press the Enter key and wait for the results to load. Once everything loads we're picking up usernames displayed on the first page.

Finally, we should wrap everything with a runner function that joins these individual pieces:

async function run(query){
    const browser = await puppeteer.launch({
          headless: false,
          ignoreHTTPSErrors: true,
          args: [`--window-size=1920,1080`],
          defaultViewport: {
            width:1920,
            height:1080
          }
        });

    creatorNames = await findTopVideoCreators(browser, query);
    let creators = await Promise.all(creatorNames.slice(0, 3).map(
        url => scrapeCreator(browser, url)
    ))
    console.log(creators);
    await browser.close();
}

// run scraper with cats!
run("#cats");

Here, we create our main function that takes in a query text and scrapes the top 3 creators of this query and their video details. We should see results something like:

{
  username: '/@cutecatcats',
  likes: '10.2M',
  followers: '571.7K',
  videoData: [
    {
      likes: '7942',
      comments: '105',
      desc: 'Standing like a human🤣🤣🤣#cutecatcats #catoftiktok #fyp #고양이 #catlover #cat #catbaby',
      music: '/music/original-sound-7055891421471001390'
    },
    ...
  ]
}

We're left with data post-processing tasks (converting string numbers to real numbers etc.). Furthermore, we also skipped loads of error catching to keep this section brief, but ideally, it's always a good idea to implement at least basic retry logic as web browsers can misbehave and break!


Now that we solidified our knowledge with the web scraper example let's take a look at where we can move from here. What are common challenges Puppeteer web scrapers face, and how can we solve them?

Common Challenges

Regarding headless browser scraping, there are primarily two kinds of challenges: Scraping speed and Bot Detection.

Let's take a look at common tips and tricks we can apply in our puppeteer library-powered web scraper to solve these two issues.

Scraping Speed/Resource Optimizations

The most effective thing we can do to speed up our Puppeteer scrapers is to disable image and video loading. When web scraping, we don't care whether images are loaded into the webpage as we don't need to see them.

Note: the images and videos are still in the page source, so by disabling loading, we're not going to lose any data.

We can configure Puppeteer headless browsers with rules that will block puppeteer images and analytic traffic:

// we can block by resrouce type like fonts, images etc.
const blockResourceType = [
  'beacon',
  'csp_report',
  'font',
  'image',
  'imageset',
  'media',
  'object',
  'texttrack',
];
// we can also block by domains, like google-analytics etc.
const blockResourceName = [
  'adition',
  'adzerk',
  'analytics',
  'cdn.api.twitter',
  'clicksor',
  'clicktale',
  'doubleclick',
  'exelator',
  'facebook',
  'fontawesome',
  'google',
  'google-analytics',
  'googletagmanager',
  'mixpanel',
  'optimizely',
  'quantserve',
  'sharethrough',
  'tiqcdn',
  'zedo',
];

const page = await browser.newPage();
// we need to enable interception feature
await page.setRequestInterception(true);
// then we can add a call back which inspects every
// outgoing request browser makes and decides whether to allow it
page.on('request', request => {
  const requestUrl = request._url.split('?')[0];
  if (
    (request.resourceType() in blockedResourceType) ||
    blockResourceName.some(resource => requestUrl.includes(resource))
  ) {
    request.abort();
  } else {
    request.continue();
  }
});
}

In this example, we're adding an extension to our page that disables the loading of blocked resources and resource types. This would speed up web scraping greatly, on media-heavy websites this can be up to 10 times! Not only that but it would save our scraper a lot of bandwidth.

Avoiding Bot Detection

While we're using a real browser figuring out whether we're human or a bot isn't that difficult for our web-scraped website.
Since the headless browser is executing all the javascript and runs on a single IP address the websites can use various techniques like connection analysis and javascript fingerprinting to determine whether the browser is a web scraper.

To improve our chances we can do two things:

  • use proxies
  • apply stealth patches to our browser

Using Proxies

The default way to use proxies with Puppeteer is to apply them to the Browser object:

const browser = await puppeteer.launch({
   args: [ '--proxy-server=http://12.34.56.78:8000' ]
});

However, this approach has a few pitfalls. First, this would mean that every time we want to switch proxy, we'd need to restart our web browser - what if we're in the middle of something? For small web scrapers, a single proxy might be enough - to scale we need something better.

Unfortunately, Puppeteer is unable to set proxy per request or even Page. There are some solutions like puppeteer-page-proxy and puppeteer-proxy, but these extensions just hijack headless browser's requests and make them through NodeJS HTTP client which increases the likelihood of being detected as a bot.

The best way to use multiple proxies in Puppeteer is to start your own proxy server. This way, the web scraper will connect to one proxy then this proxy server will pick a random proxy from a list instead. For example, we can achieve this using proxy-chain nodeJS package:

const puppeteer = require('puppeteer')
const ProxyChain = require('proxy-chain');

const proxies = [
  'http://user:pass@11.11.11.11:8000',
  'http://user:pass@22.22.22.22:8000',
  'http://user:pass@33.33.33.33:8000',
]

const server = new ProxyChain.Server({
  port: 8000,
  prepareRequestFunction: ({request}) => {
    let randomProxy = proxies[proxies.length * Math.random() | 0];
    return {
      upstreamProxyUrl: randomProxy,
    };
  });
});

server.listen(() => console.log('Proxy server started on 127.0.0.1:8000'));

const browser = await puppeteer.launch({
   args: [ '--proxy-server=http://127.0.0.1:8000' ]
});

🧙‍♂️ This approach works best with high-quality residential proxies

With this approach, our Browser will be using a single proxy-chain proxy. Though in reality, our proxy selects a random proxy address for each request!

How to Avoid Web Scraper IP Blocking?

For more on how IP addresses are used to block web scrapers see our full introduction article

How to Avoid Web Scraper IP Blocking?

Making Puppeteer Stealthy

As we web scrape with a web browser, we give full code execution access to the website. This means that websites can use various javascript scripts to gather information about our browser. This information can identify us as a Puppeteer-controlled web browser or be used to build a javascript fingerprint.

To get around fingerprinting, we can fortify our headless browsers to mock its features. Javascript fingerprint resistance is a huge topic, though to start there are community-maintained tools like puppeteer-stealth plugin.

const puppeteer = require('puppeteer-extra')

// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())

// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
  console.log('Running tests..')
  const page = await browser.newPage()
  await page.goto('https://bot.sannysoft.com')
  await page.waitForTimeout(5000)
  await page.screenshot({ path: 'testresult.png', fullPage: true })
  await browser.close()
  console.log(`All done, check the screenshot. ✨`)
})

In this example, we've installed puppeteer-extra plugin pack (npm install puppeteer-extra) and patched our puppeteer package with stealth patches. While this is not a solve-all solution it's a good starting point at fortifying Puppeteer browsers for web scraping.

How Javascript is Used to Block Web Scrapers? In-Depth Guide

For more on fortifying web browser for web scraping, see our complete introduction article, which covers how js fingerprint is generated and how to avoid it.

How Javascript is Used to Block Web Scrapers? In-Depth Guide

ScrapFly - Solve it For Us!

There's quite a bit of work involved in making a Puppeteer-based web scraper undetectable and efficient. The best way to deal with difficult challenges like these is to employ logic separation!

ScrapFly's middleware service can let you focus on web scraping while we deal with ensuring the scraper remains undetected and efficient.

scrapfly middleware
ScrapFly service does the heavy lifting for you

For example, the most popular feature is javascript rendering, which is Puppeteer in the cloud. This feature allows using of simple javascript HTTP clients to control managed and fortified, puppeteer-like web browsers:

const axios = require('axios');

function scrapflyRequest(url, waitForSelector){
  var options = {
    'key': 'YOUR SCRAPFLY API KEY',
    'render_js': true,
    'wait_for_selector': waitForSelector
    'url': url
  };
  return axios.get(
      'https://api.scrapfly.io/scrape',
      {params: options}
  );
}

const response = await scrapflyRequest('http://tiktok.com', 'h1');
console.log(response.data.result.content);

Often we might not even need full javascript rendering, but we still might be blocked by some anti-bot protection. For this, ScrapFly provides Anti Scraping Protection solution which solves various scraping blockers like captchas and popular anti-scraping integrations:

const axios = require('axios');

function scrapflyRequest(url, waitForSelector){
  var options = {
    'key': scrapflyKey,
    'url': url,
    'render_js': true,
    'wait_for_selector': waitForSelector,
    'asp': true
    // ^^^ we can turn on ASP with this url option
  };
  return axios.get(
      'https://api.scrapfly.io/scrape',
      {params: options}
  );
}
const response = await scrapflyRequest('http://tiktok.com', 'h1');
console.log(response.data.result.content);

Finally, ScrapFly provides an automated smart proxy system that applies proxies to every request automatically! We can even choose the proxy type and country of origin:

function scrapflyRequest(url, waitForSelector){
  var options = {
    'key': scrapflyKey,
    'url': url,
    'country': 'US',
    // ^^^ We can change IP address to any country
    'proxy_pool': 'public_residential_pool',
    // ^^^ we can also use high quality residential proxies

  };
  return axios.get(
      'https://api.scrapfly.io/scrape',
      {params: options}
  );
}
const response = await scrapflyRequest('http://tiktok.com', 'h1');
console.log(response.data.result.content);

ScrapFly's feature set doesn't end here - for the full feature set, see our full documentation

FAQ

To wrap this puppeteer tutorial up, let's take a look at frequently asked questions about web scraping with javascript and puppeteer:

Why does deployed Puppeteer scraper behaves differently?

Puppeteer is automating a real browser, so its natural functionality depends on the host machine. In other words, the headless Chrome browser controlled by Puppeteer inherits operating system packages. So, if we're developing our code on MacOs and run it in production on Linux - the scraper will behave slightly differently.

How can I scrape faster with Puppeteer?

Puppeteer provides a high-level API for controlling browsers, but it's not a dedicated web scraping framework. So, there are many ways to speed up web scraping.
The easiest one is to take advantage of the asynchronous nature of this library. We can launch multiple browsers and use them in a single scraper application using Promise.all or Promise.allSettled concurrency functions.

How to capture background requests and responses using Puppeteer?

Often dynamic websites would use background requests (XHR) to generate some data after the page loads. We can capture these requests and responses using page.on signal capturing function. For example, we can capture all XHR-type requests and either drop them or read/modify their data:

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  // capture background requests:
  await page.setRequestInterception(true);
  page.on('request', request => {
    if (request.resourceType() === 'xhr') {
      console.log(request):
      // we can block these requests with:
      // request.abort();
    } else {
      request.continue();
    }
  });
  // capture background responses:
  page.on('response', response => {
    if (response.resourceType() === 'xhr') {
      console.log(response);
    }
  })
  await browser.close();
})();

How do you spell Puppeteer?

This project has a notoriously difficult name and it's often misspelled in thousand different ways: pupeteer, puppeter, puperter, puppetier etc.
The easiest way to remember is to follow this simple formula: pup + pet + eer.
Note the problematic name when looking up specific resources - sometimes, misspelling it on purpose can help you find the error solution!

Summary

In this introductory article, we looked at the Puppeteer web browser automation package for NodeJS and how we can use it for web scraping. We covered some common use case scenarios, and explore challenges like how to use proxies and avoid being detected. We also wrote a small example scraper which collects creator data from tiktok.com.

Finally, we wrapped everything up by looking at ScrapFly's solution, how it compares to Puppeteer, and how it can be used in place of Puppeteer to provide even better results in NodeJS!

How to Scrape Dynamic Websites Using Headless Web Browsers

For more on different browser automation solutions, see a related article we wrote about Selenium, Puppeteer and Playwright and how they compare in the context of web scraping!

How to Scrape Dynamic Websites Using Headless Web Browsers

Related Posts

How to Use Chrome Extensions with Playwright, Puppeteer and Selenium

In this article, we'll explore different useful Chrome extensions for web scraping. We'll also explain how to install Chrome extensions with various headless browser libraries, such as Selenium, Playwright and Puppeteer.

How to Scrape Dynamic Websites Using Headless Web Browsers

Introduction to using web automation tools such as Puppeteer, Playwright, Selenium and ScrapFly to render dynamic websites for web scraping