Web Scraping With Puppeteer - 2024 Puppeteer Tutorial

article feature image

When it comes to web scraping in JavaScript, a popular approach is using HTTP clients, such as Fetch and Axios. However, HTTP clients aren't efficient with scraping dynamic web pages. This is where the Puppeteer headless browser automation tool comes into place!

In this guide, we'll explore web scraping with Puppeteer. We'll start with a general overview of the Puppeteer API and its utilization while scraping. Then, we'll create multiple Puppeteer web scrapers for common use cases, such as crawling, infinite scrolling, clicking buttons, and filling out forms. Let's get started!

Web Scraping With NodeJS

Sometimes, headless browsers can be overkill. Learn how to web scrape with NodeJS using HTTP requests through our comprehensive guide.

Web Scraping With NodeJS

What is Puppeteer and How it Works?

Modern web browsers contain special access tools for automation and cross-program communication. In particular, Chrome Devtools Protocol (aka CDP) - is a high-level API protocol that allows programs to control Chrome or Firefox web browser instances through socket connections.

In other words, the Puppeteer web scraping tool allows for creating automation scripts in Chrome or Firefox.

CDP protocol usage illustration
CDP translates program commands to web browser commands

As you can imagine, using Puppeteer for web scraping comes with several great advantages:

  • Puppeteer scrapers enable dynamic data extraction, as the headless browser renders JavaScript, images, etc, just like the regular web browsers.
  • Puppeteer web scraping scripts are harder to detect and block. Since the connection configuration looks like regular users, it's harder to identify them as automated.

That being said, headless browser automation tools are resource-intensive. Hence, they are more complex and require continuous maintenance.

Tip: Puppeteer in REPL

The easiest wat to experiment with Puppeteer to scrape data is using the NodeJS REPL mode. See the following video for a quick introduction:

0:00
/
NodeJS REPL overview

Now, let's jump into the details of our Puppeteer tutorial!

How to Install Puppeteer?

The Puppeteer NodeJS library can be installed using the NodeJS package manager (npm) with the following terminal commands:

$ mkdir myproject && cd myproject
$ npm init
$ npm install puppeteer

Note that we'll execute our Puppeteer web scraping code asynchronously with the context of promises, and async/await programming. If you are unfamiliar with this JavaScript concept, we recommend this MDN quick introduction.

Puppeteer Basics

Let's start with a basic Puppeteer code that does the following:

  • Start a Chrome headless browser (browser without graphical user interface).
  • Launch a new page and go to a target website.
  • Wait the page to load and retrieve the HTML.
const puppeteer = require('puppeteer')

async function run(){
    // First, we must launch a browser instance
    const browser = await puppeteer.launch({
        // Headless option allows us to disable visible GUI, so the browser runs in the "background"
        // for development lets keep this to true so we can see what's going on but in
        // on a server we must set this to true
        headless: false,
        // This setting allows us to scrape non-https websites easier
        ignoreHTTPSErrors: true,
    })
    // then we need to start a browser tab
    let page = await browser.newPage();
    // and tell it to go to some URL
    await page.goto('http://httpbin.org/html', {
        waitUntil: 'domcontentloaded',
    });
    // print html content of the website
    console.log(await page.content());
    // close everything
    await page.close();
    await browser.close();
}

run();

In the above example, we create a visible browser instance, start a new tab, go to httpbin.org/html webpage, and return its HTML content. When web scraping with puppeteer, we'll mostly work with the Page objects. This object represents a browser, which we use two of its methods:

  • goto(): Request the URL within the browser tab.
  • content(): Return the web page HTML code.

The above code snippet is the building block of most Puppeteer scrapers used to extract data. Next, let's add waiting logic!

Waiting For Content

In the previous code, we encounter a common Puppeteer scraping question: How do we know if the page is loaded and ready to be parsed for data?

In the previous example, we used the waitUntil argument. It directs the browser to wait for the domcontentloaded signal, which is triggered once the browser reads the HTML content of the page. However, this approach isn't suitable for all the dynamic content scraping use cases, as some parts of the page might continue loading even after the browser reads the HTML.

browser page load order
Illustration of how web browsers load web pages

When dealing with modern, dynamic websites that use JavaScript, it's a good practice to wait for explicit content instead of relying on rendering signals:

await page.goto('http://httpbin.org/html');
await page.waitForSelector('h1', {timeout: 5_000})

In the above basic example, we wait for the <h1> node to appear in the document body for a maximum of 5 seconds (5000 milliseconds).

Now that we can effectively wait for elements, let's jump to the next part of our Puppeteer tutorial: selecting content!

HTML Parsing

Since Puppeteer return data of the whole page, we can use both XPath and CSS selectors to parse the HTML. These two selectors allow us to select specific page parts and extract the displayed data or submit events like clicks and text inputs. Let's have a look at parsing HTML elements during Puppeteer scraping.

The Page object comes with several methods. One of these methods is the ElementHandle, which we can use as a click or input target:

const puppeteer = require('puppeteer');

async function run() {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // we can use .setContent to set page html to some test value:
    await page.setContent(`
    <div class="links">
    <a href="https://twitter.com/@scrapfly_dev">Twitter</a>
    <a href="https://www.linkedin.com/company/scrapfly/">LinkedIn</a>
    </div>
    `);
    // using .$ we can select first occurring value and get it's inner text or attribute:
    await (await page.$('.links a')).evaluate( node => node.innerText);
    await (await page.$('.links a')).evaluate( node => node.getAttribute("href"));

    // using .$$ we can select multiple values:
    let links = await page.$$('.links a');
    // or using xpath selectors instead of css selectors:
    // let links = await page.$x('//*[contains(@class, "links")]//a');
    for (const link of links){
        console.log(await link.evaluate( node => node.innerText));
        console.log(await link.evaluate( node => node.getAttribute("href")));
    }
    await browser.close();
}

run();

In the above Puppeteer code, we use the CSS selector with the .$ method to find one matching element and the .$$ method to find multiple ones.

We can also trigger mouse clicks, button presses, and add text inputs in the same way:

await page.setContent(`
<div class="links">
  <a href="https://twitter.com/@scrapfly_dev">Twitter</a>
  <input></input>
</div>
`);
// enter text to the input
(await page.$('input')).type('hello scrapfly!', {delay: 100});
// press enter button
(await page.$('input')).press('Enter');
// click on the first link
(await page.$('.links a')).click();

With the above methods, we have full control of the headless browser. We can use their functionalities to automate actions required to reach a specific page and then parse it to extract data!

Puppeteer Web Scraping Examples

We have explored the core concepts of using Puppeteer for web scraping through browser navigation, waiting for content, and parsing the HTML. Let's solidify these concepts through real-life scraping use cases!

Infinite Scrolling

Modern websites often use scrolling to render more data pages. This is a common web scraping challenge, as we have to provide the Puppeteer automation script with the scroll instructions and when to stop. For example, review data on web-scraping.dev/testemonials are loaded dynamically upon scroll:

webpage with review data
Reviews on web-scraping.dev

Let's implement a logic for infinite scrolling with Puppeteer:

const puppeteer = require("puppeteer");

async function scrollDown(page) {
    let prevHeight = -1;
    let maxScrolls = 100;
    let scrollCount = 0;

    while (scrollCount < maxScrolls) {
        // Scroll to the bottom of the page
        await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
        // Wait for page load
        await new Promise(resolve => setTimeout(resolve, 1000));
        // Calculate new scroll height and compare
        let newHeight = await page.evaluate('document.body.scrollHeight');
        if (newHeight == prevHeight) {
            break;
        }
        prevHeight = newHeight;
        scrollCount += 1;
    }
};

async function parseReviews(page) {
    let elements = await page.$$('.testimonial');
    let results = [];
    // Iterate over all the review elements
    for (let element of elements) {
        let rate = await element.$$eval('span.rating > svg', elements => elements.map(el => el.innerHTML))   
        results.push({
            "text": await element.$eval('.text', node => node.innerHTML),
            "rate" : rate.length
        });
    }
    return results;
};

async function run(){
    const browser = await puppeteer.launch({
          headless: false,
          ignoreHTTPSErrors: true,
        });
    page = await browser.newPage();
    await page.goto('https://web-scraping.dev/testimonials/'); // Go to the target page
    await scrollDown(page); // Scroll to the end
    reviews = await parseReviews(page); // Parse the reviews
    await browser.close();
    console.log(reviews);
};

run();

Here, we use Puppeteer to scrape the review data on the page. Let's break down the used functions:

  • scrollDown: For scrolling down the page using a JavaScript function. It stops scrolling once the height of the previous page is equal to the current height.
  • parseReviews: For parsing the review data by iterating over all the review boxes on the HTML and then extracting the text and rate from each review box.
  • run: To launch a Puppeteer headless browser, request the target web page and then use the previously defined helper functions.

The above code will return all the data on the page, which can also be saved to a JSON file:

Sample output
[
    {
        "text": "We've been using this utility for years - awesome service!",
        "rate": 5
    },
    {
        "text": "This Python app simplified my workflow significantly. Highly recommended.",
        "rate": 5
    },
    {
        "text": "Had a few issues at first, but their support team is top-notch!",
        "rate": 4
    },
    {
        "text": "A fantastic tool - it has everything you need and more.",
        "rate": 5
    },
    {
        "text": "The interface could be a little more user-friendly.",
        "rate": 5
    },
    ....
]

The data loaded on the page are rendered through scrolls. However, infinite scroll pages usually load the data through hidden APIs. So, we can approach the above data extraction by requesting the API endpoints directly.

Crawling

Crawling is a common web scraping use case. In this technique, the scraper navigates to different pages, usually through pagination or href links. Let's implement crawling with Puppeteer. For this, we'll use the web-scraping.dev/products endpoint:

web-scraping.dev products
Product page on web-scraping.dev

For this Puppeteer scraper, we'll crawl over the pagination pages and extract the review data on each page:

 const puppeteer = require("puppeteer");

async function parseProducts(page) {
    let boxes = await page.$$('div.row.product');
    let results = [];
    
    for(let box of boxes) {
        results.push({
            "title": await box.$eval('a', node => node.innerHTML),
            "link": await box.$eval('a', node => node.getAttribute('href')),
            "price": await box.$eval('div.price', node => node.innerHTML)
        })
    }
    return results;
}

async function run(){
    const browser = await puppeteer.launch({
        headless: false,
        ignoreHTTPSErrors: true,
    });
    page = await browser.newPage();
    data = [];
    for (let i=1; i < 6; i++) {
        await page.goto(`https://web-scraping.dev/products?page=${i}`)
        products = await parseProducts(page)
        data.push(...products);
    }
    console.log(data);
    browser.close();
}

run();

The above code is fairly straightforward, we only use two functions:

  • parseProducts: For parsing all the product data on each pagination page using CSS selectors.
  • run: For launching a browser in the headless mode and requesting all the product pages while using the defined parseProducts function.

The script will return the product data found on five pages:

Sample output
[
    {
        "title": "Box of Chocolate Candy",
        "link": "https://web-scraping.dev/product/1",
        "price": 24.99
    },
    {
        "title": "Dark Red Energy Potion",
        "link": "https://web-scraping.dev/product/2",
        "price": 4.99
    },
    {
        "title": "Teal Energy Potion",
        "link": "https://web-scraping.dev/product/3",
        "price": 4.99
    },
    {
        "title": "Red Energy Potion",
        "link": "https://web-scraping.dev/product/4",
        "price": 4.99
    },
    {
        "title": "Blue Energy Potion",
        "link": "https://web-scraping.dev/product/5",
        "price": 4.99
    },
    ....
]

In this example, we have crawled over pagination pages. However, the same approach can be used to crawl other pages, such as the product pages themselves. For further details, refer to our dedicated guide.

Clicking Buttons and Filling Forms

In this section, we'll explore clicking buttons and filling forms with the Puppeteer headless browser, which can be challenging to approach with regular HTTP clients. For this example, we'll log in to web-scrapin.dev/login.

We'll use Puppeteer to scrape the page behind the login process by accepting the cookie policy, entering the login credentials, and then clicking the login button:

const puppeteer = require('puppeteer');

async function run(){
    const browser = await puppeteer.launch({
        headless: false,
        ignoreHTTPSErrors: true,
    });
    const page = await browser.newPage();
    await page.goto(
        'https://web-scraping.dev/login?cookies=',
        { waitUntil: 'domcontentloaded'}
    );
    // Wait for 500 milliseconds
    await new Promise(resolve => setTimeout(resolve, 500));
    // Accept the cookie policy
    await page.click('button#cookie-ok')
    // Wait for navigation
    await new Promise(resolve => setTimeout(resolve, 500));
    // fill in the login credentials
    await page.$eval('input[name="username"]', (el, value) => el.value = value, 'user123');
    await page.$eval('input[name="password"]', (el, value) => el.value = value, 'password');
    // click the login button
    await page.click('button[type="submit"]');
    await page.waitForSelector('div#secret-message');
    // Prase the secrent message
    secretMessage = await page.$eval('div#secret-message', node => node.innerHTML)
    console.log(`The secret message is ${secretMessage}`);
    await browser.close();
}

run();

Here, we use the CSS selector to select the elements to click or fill and wait for fixed timeouts or specific elements to load between the actions.


Now that we have solidified our knowledge of web scraping with Puppeteer, let's look at the common challenges faced and how we can solve them.

Common Challenges

When it comes to headless browser scraping, there are primarily two kinds of challenges: Scraping speed and Bot Detection. Let's explore common tips and tricks we can apply to web scrapers powered by the Puppeteer node library.

Scraping Speed and Resource Optimizations

The most effective tip to speed up our Puppeteer scrapers is to disable image and video loading, as when scraping, we don't care about image rendering.

Note: the images and videos are still in the page source, so by turning off loading, we're not going to lose any data.

We can configure Puppeteer headless browsers with rules that will block puppeteer images and analytic traffic:

// we can block by resrouce type like fonts, images etc.
const blockResourceType = [
  'beacon',
  'csp_report',
  'font',
  'image',
  'imageset',
  'media',
  'object',
  'texttrack',
];
// we can also block by domains, like google-analytics etc.
const blockResourceName = [
  'adition',
  'adzerk',
  'analytics',
  'cdn.api.twitter',
  'clicksor',
  'clicktale',
  'doubleclick',
  'exelator',
  'facebook',
  'fontawesome',
  'google',
  'google-analytics',
  'googletagmanager',
  'mixpanel',
  'optimizely',
  'quantserve',
  'sharethrough',
  'tiqcdn',
  'zedo',
];

const page = await browser.newPage();
// we need to enable interception feature
await page.setRequestInterception(true);
// then we can add a call back which inspects every
// outgoing request browser makes and decides whether to allow it
page.on('request', request => {
  const requestUrl = request._url.split('?')[0];
  if (
    (request.resourceType() in blockedResourceType) ||
    blockResourceName.some(resource => requestUrl.includes(resource))
  ) {
    request.abort();
  } else {
    request.continue();
  }
});
}

In the above code, we add Puppeteer extensions to our page that disable the loading of defined resources along with their types. This can notably increase our scraping speed on media-heavy websites up to 10 times! Additionally, it can be beneficial for saving lots of bandwidth, especially when using proxies!

Avoiding Puppeteer Scraping Blocking

Although we scrape using a real browser, websites can still detect us. Since headless browsers support executing JavaScript and use a single IP address, websites can use various techniques like connection analysis and JavaScript fingerprinting to detect any signs of automation.

To increase the chances of avoiding Puppeteer scraping blocking, we can use two tricks:

  • Using proxies
  • Applying stealth patches to Puppeteer

Using Proxies

The default method for using Puppeteer proxies is adding them to the Browser object:

const browser = await puppeteer.launch({
   args: [ '--proxy-server=http://12.34.56.78:8000' ]
});

However, this approach has a downfall: it requires restarting the client each time we add proxies, which isn't practical to disrupt the automation process.

Unfortunately, adding proxies to Puppeteer on the request level is unavailable. There are community solutions for getting around this, such as puppeteer-page-proxy and puppeteer-proxy. However, these Puppeteer extensions only redirect the requests to NodeJS HTTP clients, increasing the chances of getting detected.

A better way to use multiple proxies in Puppeteer is by creating a proxy server. This way, the browser will connect to a different proxy each time the browser is launched. Here is a quick example of creating a proxy server for Puppeteer using the proxy-chain NodeJS package:

const puppeteer = require('puppeteer')
const ProxyChain = require('proxy-chain');

const proxies = [
  'http://user:pass@11.11.11.11:8000',
  'http://user:pass@22.22.22.22:8000',
  'http://user:pass@33.33.33.33:8000',
]

const server = new ProxyChain.Server({
  port: 8000,
  prepareRequestFunction: ({request}) => {
    let randomProxy = proxies[proxies.length * Math.random() | 0];
    return {
      upstreamProxyUrl: randomProxy,
    };
  });
});

server.listen(() => console.log('Proxy server started on 127.0.0.1:8000'));

const browser = await puppeteer.launch({
   args: [ '--proxy-server=http://127.0.0.1:8000' ]
});

🧙‍ This approach works best with high-quality residential proxies

In the above example, we select a random proxy IP address each time a request is sent. Additionally, we can power it with a proxy rotation system to select proxies smartly!

How to Avoid Web Scraper IP Blocking?

For more on how IP addresses are used to block web scrapers see our full introduction article

How to Avoid Web Scraper IP Blocking?

Making Puppeteer Stealthy

Since we scrape with a web browser, the websites can use various JavaScript scripts to gather information about the client. This can leak the fact that the requests are automated!

To avoid this JavaScript fingerprinting, we can modify the Puppeteer configuration and mock its features. Fingerprint resistance is quite a complicated topic. However, there are community-maintained web scraping tools that manage it for us! One of these tools is the puppeteer-stealth plugin:

const puppeteer = require('puppeteer-extra')

// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())

// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
  console.log('Running tests..')
  const page = await browser.newPage()
  await page.goto('https://bot.sannysoft.com')
  await page.waitForTimeout(5000)
  await page.screenshot({ path: 'testresult.png', fullPage: true })
  await browser.close()
  console.log(`All done, check the screenshot. ✨`)
})

In the above example, we've installed puppeteer-extra npm install puppeteer-extra. This plugin fortifies the Puppeteer browser and hides its traces. For further details on JavaScript fingerprinting, refer to our dedicated guide.

How Javascript is Used to Block Web Scrapers? In-Depth Guide

For more on fortifying web browser for web scraping, see our complete introduction article, which covers how JS fingerprint is generated and how to avoid it.

How Javascript is Used to Block Web Scrapers? In-Depth Guide

ScrapFly - A Better Alternative!

There's quite a bit of work involved in making Puppeteer web scrapers undetectable and efficient. The best way to deal with the difficult challenges is to defer it!

ScrapFly is a web scraping API that allows for scraping at scale by providing:

scrapfly middleware
ScrapFly service does the heavy lifting for you

Here is how to use ScrapFly to scrape dynamic pages without getting blocked. All we have to do is enable the asp and render_js parameters :

const axios = require('axios');

function scrapflyRequest(url, waitForSelector){
  var options = {
    'key': 'Your ScrapFly API key',
    'asp': true, // Bypass scraping blocking
    'render_js': true, // Enable JS rendering
    'country': 'US', // Proxy country location
    'wait_for_selector': waitForSelector, // waiting for selector to load
    'url': url  // The url to scrape
  };
  return axios.get(
    'https://api.scrapfly.io/scrape',
    {params: options}
  );
}

const response = await scrapflyRequest('https://web-scraping.dev/products', 'div.container');
console.log(response.data.result.content); // Get the HTML

ScrapFly's feature set doesn't end here - for the full feature set, see our full documentation.

FAQ

To wrap this puppeteer tutorial up, let's take a look at frequently asked questions about web scraping with javascript and puppeteer:

Why does deployed Puppeteer scraper behaves differently?

Puppeteer is automating a real browser, so its natural functionality depends on the host machine. In other words, the headless Chrome browser controlled by Puppeteer inherits operating system packages. So, if we're developing our code on MacOs and run it in production on Linux - the scraper will behave slightly differently.

How can I scrape faster with Puppeteer?

Puppeteer provides a high-level API for controlling browsers, but it's not a dedicated web scraping framework. So, there are many ways to speed up web scraping.
The easiest one is to take advantage of the asynchronous nature of this library. We can launch multiple browsers and use them in a single scraper application using Promise.all or Promise.allSettled concurrency functions.

How to capture background requests and responses using Puppeteer?

Often dynamic websites would use background requests (XHR) to generate some data after the page loads. We can capture these requests and responses using page.on signal capturing function. For example, we can capture all XHR-type requests and either drop them or read/modify their data:

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  // capture background requests:
  await page.setRequestInterception(true);
  page.on('request', request => {
    if (request.resourceType() === 'xhr') {
      console.log(request):
      // we can block these requests with:
      // request.abort();
    } else {
      request.continue();
    }
  });
  // capture background responses:
  page.on('response', response => {
    if (response.resourceType() === 'xhr') {
      console.log(response);
    }
  })
  await browser.close();
})();

Summary

In this tutorial, we explained web scraping with Puppeteer. We started by defining and installing it. Then, we explained how to use it for common scraping cases: scrolling, crawling, clicking buttons, and filling out forms.

We also covered the common Puppeteer challenges and how to solve them. You now know how to:

  • Make it faster and save bandwidth.
  • Add proxies and rotate them.
  • Avoid its scraping blocking.
How to Scrape Dynamic Websites Using Headless Web Browsers

For more on different browser automation solutions, see a related article we wrote about Selenium, Puppeteer and Playwright and how they compare in the context of web scraping!

How to Scrape Dynamic Websites Using Headless Web Browsers

Related Posts

How to Scrape With Headless Firefox

Discover how to use headless Firefox with Selenium, Playwright, and Puppeteer for web scraping, including practical examples for each library.

How to Use Chrome Extensions with Playwright, Puppeteer and Selenium

In this article, we'll explore different useful Chrome extensions for web scraping. We'll also explain how to install Chrome extensions with various headless browser libraries, such as Selenium, Playwright and Puppeteer.

How to Scrape Dynamic Websites Using Headless Web Browsers

Introduction to using web automation tools such as Puppeteer, Playwright, Selenium and ScrapFly to render dynamic websites for web scraping