Web Scraping With a Headless Browser: Puppeteer
Introduction to using Puppeteer in Nodejs for web scraping dynamic web pages and web apps. Tips and tricks, best practices and example project.
Puppeteer stealth is a popular extension for the Puppeteer browser automation framework. This plugin patches Puppeteer runtime to be less likely to be detected by anti-scraping detection techniques.
Using puppeteer-stealth scrapers have better chance at bypassing Cloudflare, Datadome and other popular anti scraping services.
puppeteer-stealth can be installed using NPM:
$ npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
# or
$ yarn add puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
Then the StealthPlugin
object needs to be attached to enable the extension:
// Note: import puppeteer-extra rather than puppeteer
const puppeteer = require('puppeteer-extra')
// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())
// test run - check scrapfly.io browser fingerprint page
puppeteer.launch({ headless: true }).then(async browser => {
console.log('Running tests..')
const page = await browser.newPage()
await page.goto('https://scrapfly.io/web-scraping-tools/browser-fingerprint')
await page.waitForTimeout(5000)
await page.screenshot({ path: 'testresult.png', fullPage: true })
await browser.close()
console.log(`All done, check the screenshot. ✨`)
})
Note that puppeteer-stealth
features many patches for different detection techniques that can be customized and extended.
Alternatively, Scrapfly API automatically bypasses anti scraping protections using anti scraping protection bypass feature