Web Scraping With a Headless Browser: Puppeteer
Introduction to using Puppeteer in Nodejs for web scraping dynamic web pages and web apps. Tips and tricks, best practices and example project.
When scraping using Puppeteer we might encounter modal popups which are Javascript alerts that hide the content on page load and show some sort of message like this one:
The most common example of modal popup is the cookie consent popup and there are multiple ways to handle popups in Puppeteer:
For example, let's take a look at web-scraping.dev/login page which on page load throws a cookie pop-up:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://web-scraping.dev/login');
// Option #1 - use page.click() to click on the button
try {
await page.waitForSelector('#cookie-ok', { timeout: 2000 });
await page.click('#cookie-ok');
} catch (error) {
console.log('no cookie popup');
}
// Option #2 - delete the popup HTML
// remove pop up
const cookieModal = await page.$('#cookieModal');
if (cookieModal) {
await page.evaluate((el) => el.remove(), cookieModal);
}
// remove grey backgdrop which covers the screen
const modalBackdrop = await page.$('.modal-backdrop');
if (modalBackdrop) {
await page.evaluate((el) => el.remove(), modalBackdrop);
}
await browser.close();
})();
Above, we explore two ways to handle modal pop-ups: clicking a button that would dismiss it and hard removing them from the DOM.
Generally, the first approach is more reliable as the real button click can have functionality attached to it like setting a cookie so the pop-up doesn't appear again.
For cases when it's a login requirement or advertisement, the second approach is more suited.