Web Scraping With a Headless Browser: Puppeteer
Introduction to using Puppeteer in Nodejs for web scraping dynamic web pages and web apps. Tips and tricks, best practices and example project.
When web scraping, we might want to collect page screenshots or peek into what our headless browsers are seeing for debugging. In Puppeteer a screenshot can be taken using the screenshot()
method of page
or element
objects:
const puppeteer = require('puppeteer');
async function run() {
// usual browser startup:
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("http://httpbin.dev/html");
// wait for the selector appear on the page
await page.screenshot({
"type": "png", // can also be "jpeg" or "webp" (recommended)
"path": "screenshot.png", // where to save it
"fullPage": true, // will scroll down to capture everything if true
});
// alternatively we can capture just a specific element:
const element = await page.$("p");
await element.screenshot({"path": "just-the-paragraph.png", "type": "png"});
browser.close();
}
run();
⚠ Note that when scraping dynamic web pages, screenshots could be captured before the page is fully loaded. For more see How to wait for a page to load in Puppeteer?