Scraping Dynamic Websites Using Browser
Introduction to using web automation tools such as Puppeteer, Playwright, Selenium and ScrapFly to render dynamic websites for web scraping
When it comes to web scraping there are generally two approaches to collecting data: making HTTP requests and using browser automation tools to control web browsers.
While the HTTP approach is a very efficient form of data collection often we're required to invest extra development time to reverse engineer and understand our target. Browser automation, while consuming more resources and being slower, spares us a lot of effort and is a more accessible form of automation!
In this article we'll take a look at Puppeteer which is a brilliant browser automation library for JavaScript (NodeJS) and how can we use it for web scraping dynamic websites.
We'll start off with a general overview and then dig deep into the parts that are used for web scraping. We'll also cover most common idioms, tips and tricks and wrap everything up with an example https://www.tiktok.com/ web scraper!
Sometimes Puppeteer might be more than we need for scraping - check out our introduction article to using just NodeJS for web-scraping.
So what is Puppeteer and how does it work?
These days modern web browsers contain special access tools designed for automation and cross-program communication. In particular, Chrome Devtools Protocol (aka CDP) - is a high-level API protocol that allows programs to control Chrome or Firefox web browser instances through socket connections. In other words, we can write a program that connects to Chrome/Firefox browser instance and tells it to do something.
As you can imagine, this is a brilliant tool for web scraping! Automating a web browser gives our web scraper several advantages:
That being said, there are some negatives. Browsers are really complex software projects meaning they consume a lot of resources. In turn, more complexity can also mean higher maintenance overhead for our web scrapers.
Puppeteer node js library can be installed through NodeJS package manager npm with these terminal commands:
$ mkdir myproject && cd myproject
$ npm install puppeteer
The first thing we should note is that Puppeteer is an asynchronous node library. This means we'll be working in the context of Promises and async/await programming. If you're unfamiliar with async await syntax in Javascript, we recommend this quick introduction article by MDN.
Now with our package ready let's start with the most basic example. We'll start a headless Chrome web browser (headless mode meaning a special version of the browser that has no GUI elements), tell it to go to some websites, wait for it to load and retrieve the HTML page source:
// import puppeteer library
const puppeteer = require('puppeteer')
async function run(){
// First, we must launch a browser instance
const browser = await puppeteer.launch({
// Headless option allows us to disable visible GUI, so the browser runs in the "background"
// for development lets keep this to true so we can see what's going on but in
// on a server we must set this to true
headless: false,
// This setting allows us to scrape non-https websites easier
ignoreHTTPSErrors: true,
})
// then we need to start a browser tab
let page = await browser.newPage();
// and tell it to go to some URL
await page.goto('http://httpbin.org/html', {
waitUntil: 'domcontentloaded',
});
// print html content of the website
console.log(await page.content());
// close everything
await page.close();
await browser.close();
}
run();
In this basic example, we create a visible browser instance, start a new tab, go to http://httpbin.org/html webpage and print its contents. When scraping with Puppeteer we'll be mostly working with Page
objects which essentially are web browser tabs. In this example, we're using two methods: goto()
which tells the tab where to navigate to and content()
which returns webpage source code.
With this basic knowledge, we can start to explore common Puppeteer usage patterns, let's start with basic parsing.
In this basic script, we encounter our first problem: How do we know when the page is loaded and ready to be parsed for data?
In this example, we used waitUntil
argument to tell the browser to wait for domcontentloaded
signal which is fired when the browser reads the HTML content of the page. However, this might not work for every page as dynamic pages might continue loading content even when the HTML page is read by the browser.
Simplified illustration of how web browsers load new pages
When dealing with modern, dynamic websites that use javascript it's a good practice to wait for content explicitly instead of relying on load
and domcontentloaded
signals:
await page.goto('http://httpbin.org/html');
await page.waitForSelector('h1', {timeout: 5_000})
Here, we're telling Puppeteer to wait for the <h1>
node to appear in the document body for a maximum of 5 seconds (5000 milliseconds). Since we're scraping HTML content it's much safer to rely on HTML structure rather than browser events, so using waitForSelector()
is the best way to ensure our content has loaded!
Since Puppeteer runs a full browser we have access to both CSS and XPATH selectors which allows us to select specific parts of the page and either extract displayed data or submit events like clicks and text inputs. Let's take a look at how this is used in web scraping.
For more on XPATH selectors see our complete introduction tutorial which covers basic usage, tips and tricks and common web scraping idioms
The Page object comes with several methods that allow us to find ElementHandle objects which we can extract or use as a click/input target:
// we can use .setContent to set page html to some test value:
await page.setContent(`
<div class="links">
<a href="https://twitter.com/@scrapfly_dev">Twitter</a>
<a href="https://www.linkedin.com/company/scrapfly/">LinkedIn</a>
</div>
`);
// using .$ we can select first occurring value and get it's inner text or attribute:
await (await page.$('.links a')).evaluate( node => node.innerText);
await (await page.$('.links a')).evaluate( node => node.getAttribute("href"));
// using .$$ we can select multiple values:
let links = await page.$$('.links a');
// or using xpath selectors instead of css selectors:
// let links = await page.$x('//*[contains(@class, "links")]//a');
for (const link of links){
console.log(await link.evaluate( node => node.innerText));
console.log(await link.evaluate( node => node.getAttribute("href")));
}
As you can see, the Page object gives us access to both CSS and XPATH selectors. We can either extract the first found element using the .$
method or all of the matching elements using the .$$
one.
For more on CSS selectors see our complete introduction tutorial which covers basic usage, tips and tricks and common web scraping idioms
We can also trigger mouse clicks, button presses and text inputs the same way:
await page.setContent(`
<div class="links">
<a href="https://twitter.com/@scrapfly_dev">Twitter</a>
<input></input>
</div>
`);
// enter text to the input
(await page.$('input')).type('hello scrapfly!', {delay: 100});
// press enter button
(await page.$('input')).press('Enter');
// click on the first link
(await page.$('.links a')).click();
Puppeteer gives us access to navigation and parsing functionalities of the browser - we can use these functions to crawl our target and even parse the page contents!
Now that we know how to navigate our browser, wait for content to load and parse the HTML DOM we should solidify this knowledge with a real-life example.
In this example project, we'll be scraping https://www.tiktok.com/ public user details and their video metadata. Our scraper goal is to:
We'll use functional programming in this example, so let's start from the bottom up:
// scrapes video details
async function scrapeVideo(browser, videoUrl){
let page = await browser.newPage();
await page.goto(videoUrl, { waitUntil: 'domcontentloaded'});
// wait for the page to load
await page.waitForSelector('strong[data-e2e=like-count]')
let likes = await(await page.$('strong[data-e2e=like-count]')).evaluate(node => node.innerText);
let comments = await(await page.$('strong[data-e2e=comment-count]')).evaluate(node => node.innerText);
let desc = await(await page.$('div[data-e2e=video-desc]')).evaluate(node => node.innerText);
let music = await(await page.$('h4[data-e2e=video-music] a')).evaluate(node => node.getAttribute('href'));
await page.close();
return {likes, comments, desc, music}
}
Here, we have our first function that takes in a Browser
object and an url to a tiktok video. We'll design our web scraper to use 1 Browser object and pass it around to functions to keep things in the realm of functional programming. This will also allow us to scale up later by allowing usage of multiple browsers scraping multiple pages.
In this case, we're starting a new tab, navigating to the video URL and waiting for like count to appear. Once it's there we parse our details, close the tab and return the results.
Now, let's do the same thing for the creator's page:
// scrapes user and their top 5 video details
async function scrapeCreator(browser, username){
let page = await browser.newPage();
await page.goto('http://tiktok.com/' + username);
await page.waitForSelector('div[data-e2e="user-post-item"] a');
// parse user data
let followers = await(await page.$('strong[data-e2e=followers-count]')).evaluate(node => node.innerText);
let likes = await(await page.$('strong[data-e2e=likes-count]')).evaluate(node => node.innerText);
// parse user's video data
let videoLinks = [];
links = await page.$$('div[data-e2e="user-post-item"] a');
for (const link of links){
videoLinks.push(await link.evaluate( node => node.getAttribute('href')));
};
let videoData = await Promise.all(videoLinks.slice(0, 5).map(
url => scrapeVideo(browser, url)
))
await page.close()
return {username, likes, followers, videoData}
}
Here, we're integrating our scrapeVideo()
function to not only pick up the creator's details but the details of the most recent 5 videos too. We're using Promise.all
to concurrently execute 5 promises, so in the browser, you'd see 5 tabs open up and scrape 5 video details at the same time!
We're left to implement our discovery function. We want to go to tiktok.com, enter some text in the search bar and extract the first few results:
// finds users of top videos of a given query
async function findTopVideoCreators(browser, query){
let page = await browser.newPage();
// search for cat videos:
await page.goto('http://tiktok.com/', { waitUntil: 'domcontentloaded'});
let searchBox = await page.$('input[type=search]')
await searchBox.type(query, {delay:111});
await page.waitForTimeout(500); // we need to wait a bit before pressing enter
await searchBox.press('Enter');
// wait for search results to load
await page.waitForSelector('a[data-e2e="search-card-user-link"]');
// find all user links
let userLinks = [];
links = await page.$$('a[data-e2e="search-card-user-link"]');
for (const link of links){
userLinks.push(await link.evaluate( node => node.getAttribute('href')));
};
await page.close();
return userLinks;
}
In the example above, we're going to the website's homepage, sending some text to the search input box with a delay to appear more human, press the Enter
key and wait for the results to load. Once everything loads we're picking up usernames displayed on the first page.
Finally, we should wrap everything with a runner function that joins these individual pieces:
async function run(query){
const browser = await puppeteer.launch({
headless: false,
ignoreHTTPSErrors: true,
args: [`--window-size=1920,1080`],
defaultViewport: {
width:1920,
height:1080
}
});
creatorNames = await findTopVideoCreators(browser, query);
let creators = await Promise.all(creatorNames.slice(0, 3).map(
url => scrapeCreator(browser, url)
))
console.log(creators);
await browser.close();
}
// run scraper with cats!
run("#cats");
Here, we create our main function that takes in a query text and scrapes the top 3 creators of this query and their video details. We should see results something like:
{
username: '/@cutecatcats',
likes: '10.2M',
followers: '571.7K',
videoData: [
{
likes: '7942',
comments: '105',
desc: 'Standing like a human🤣🤣🤣#cutecatcats #catoftiktok #fyp #고양이 #catlover #cat #catbaby',
music: '/music/original-sound-7055891421471001390'
},
...
]
}
We're left with data post-processing tasks (converting string numbers to real numbers etc.). Furthermore, we also skipped loads of error catching to keep this section brief, but ideally, it's always a good idea to implement at least basic retry logic as web browsers can misbehave and break!
Now that we solidified our knowledge with the web scraper example let's take a look at where we can move from here. What are common challenges Puppeteer web scrapers face, and how can we solve them?
When it comes to headless browser automation scraping there are primarily two kinds of challenges: Scraping speed and Bot Detection.
Let's take a look at common tips and tricks we can apply in our puppeteer library-powered web scraper to solve these two issues.
The most effective thing we can do is disable the loading of embedded images/videos. When web scraping we don't really care whether images are loaded into the webpage - we just want to collect the data.
We can configure Puppeteer headless browsers with rules that will block images and analytic traffic:
// we can block by resrouce type like fonts, images etc.
const blockResourceType = [
'beacon',
'csp_report',
'font',
'image',
'imageset',
'media',
'object',
'texttrack',
];
// we can also block by domains, like google-analytics etc.
const blockResourceName = [
'adition',
'adzerk',
'analytics',
'cdn.api.twitter',
'clicksor',
'clicktale',
'doubleclick',
'exelator',
'facebook',
'fontawesome',
'google',
'google-analytics',
'googletagmanager',
'mixpanel',
'optimizely',
'quantserve',
'sharethrough',
'tiqcdn',
'zedo',
];
const page = await browser.newPage();
// we need to enable interception feature
await page.setRequestInterception(true);
// then we can add a call back which inspects every
// outgoing request browser makes and decides whether to allow it
page.on('request', request => {
const requestUrl = request._url.split('?')[0];
if (
(request.resourceType() in blockedResourceType) ||
blockResourceName.some(resource => requestUrl.includes(resource))
) {
request.abort();
} else {
request.continue();
}
});
}
In this example, we're adding an extension to our page that disables the loading of blocked resources and resource types. This would speed up web scraping greatly, on media-heavy websites this can be up to 10 times! Not only that but it would save our scraper a lot of bandwidth.
While we're using a real browser figuring out whether we're human or a bot isn't that difficult for our web-scraped website. Since your headless browser is executing all the javascript and runs on a single IP address the websites can use various techniques like connection analysis and profile fingerprinting to determine whether we're a bot or not. To improve our chances we can do two things: use proxies and apply stealth patches to our browser.
First, let's take a look at proxies.
The default way to use proxies with Puppeteer is to apply them to the Browser object:
const browser = await puppeteer.launch({
args: [ '--proxy-server=http://12.34.56.78:8000' ]
});
However, this approach has a few pitfalls. First, this would mean that every time we want to switch proxy we'd need to restart our web browser - what if we're in the middle of something? For small web scrapers, a single proxy might be enough but once we scale we need something better.
Unfortunately, Puppeteer is unable to set proxy per request or even Page. There are some solutions like puppeteer-page-proxy and puppeteer-proxy, but these extensions just hijack headless browser's requests and make them through NodeJS http client which increases the likelihood of being detected as a bot.
The best way to use multiple proxies in Puppeteer is to start your own proxy server. This way the web-scraper will be connecting to one proxy and our proxy server will pick a random proxy from a list instead. For example, we can achieve this using proxy-chain nodeJS package:
const puppeteer = require('puppeteer')
const ProxyChain = require('proxy-chain');
const proxies = [
'http://user:pass@11.11.11.11:8000',
'http://user:pass@22.22.22.22:8000',
'http://user:pass@33.33.33.33:8000',
]
const server = new ProxyChain.Server({
port: 8000,
prepareRequestFunction: ({request}) => {
let randomProxy = proxies[proxies.length * Math.random() | 0];
return {
upstreamProxyUrl: randomProxy,
};
});
});
server.listen(() => console.log('Proxy server started on 127.0.0.1:8000'));
const browser = await puppeteer.launch({
args: [ '--proxy-server=http://127.0.0.1:8000' ]
});
With this approach, our Browser will be using our proxy-chain server as its proxy which in turn, selects a random proxy for every request! However, this approach is not bullet-proof and might result in some errors if used with low-quality proxies. So if you do go with proxy chains make sure to spend extra effort in monitoring.
As we web scrape with a web browser, we give full code execution access to the website. Meaning, that websites can use various javascript scripts to gather information about our browser. This is referred to as fingerprint and once websites successfully identify us as non-human users they can block us.
To get around fingerprinting we can fortify our headless browsers to essentially lie what it is. This is a really time-consuming process that is beyond the scope of a single developer but there are community-maintained tools like puppeteer-stealth plugin.
const puppeteer = require('puppeteer-extra')
// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())
// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
console.log('Running tests..')
const page = await browser.newPage()
await page.goto('https://bot.sannysoft.com')
await page.waitForTimeout(5000)
await page.screenshot({ path: 'testresult.png', fullPage: true })
await browser.close()
console.log(`All done, check the screenshot. ✨`)
})
In this example, we've installed puppeteer-extra
plugin pack (npm install puppeteer-extra
) and patched our puppeteer package with stealth patches. While this is not a solve-all solution it does patch up some common holes that might reveal our web scraper!
As you can see there's quite a bit of work involved in making a Puppeteer-based web scraper undetectable and efficient. The best way to deal with difficult challenges like these is to employ logic separation!
ScrapFly's middleware service can let you focus on web scraping while we deal with ensuring the scraper remains undetected and efficient.
ScrapFly service does the heavy lifting for you
For example, the most popular feature is javascript rendering which essentially is Puppeteer in the cloud! This allows us to use simple NodeJS HTTP requests instead of starting up a whole web browser locally. This, not only saves us resources but also provides better results as ScrapFly's automated web browsers are specially designed to be web scraping efficient.
const axios = require('axios');
const scrapflyKey = 'YOUR SCRAPFLY API KEY'
function scrapflyRequest(url, waitForSelector){
var options = {
'key': scrapflyKey,
'render_js': true,
'url': url,
'wait_for_selector': waitForSelector
};
return axios.get(
'https://api.scrapfly.io/scrape',
{params: options}
);
}
async function run(){
let response = await scrapflyRequest('http://tiktok.com', 'h1');
console.log(response);
}
run();
Here, we are using ScrapFly's middleware to use a fortified cloud web browser to render a request url. We can even tell it when to stop the wait the same way we do it in Puppeteer!
Often we might not even need full javascript rendering, but we still might be blocked by some anti-bot protection. For this ScrapFly provides Anti Scraping Protection solution which solves various scraping blockers like captchas and popular anti-scraping integrations:
function scrapflyRequest(url, useASP){
var options = {
'key': scrapflyKey,
'url': url,
'asp': useASP
// ^^^ we can turn on ASP with this url option
};
return axios.get(
'https://api.scrapfly.io/scrape',
{params: options}
);
}
In the example above, we are using ScrapFly's middleware which enables an anti web scraper protection solution via asp
argument.
Finally, ScrapFly provides an automated smart proxy system that applies proxies to every request automatically! We can even choose the proxy type and country of origin:
function scrapflyRequest(url){
var options = {
'key': scrapflyKey,
'url': url,
'country': 'ca'
// ^^^ use proxies from Canada
'proxy_pool': 'public_residential_pool',
// ^^ use residential proxies which are much harder to detect
};
return axios.get(
'https://api.scrapfly.io/scrape',
{params: options}
);
}
Here, we make our requests to use residential proxies from Canada which would allow us to access geo-locked Canadian websites and by using residential proxies we are much more difficult to block!
ScrapFly's feature set doesn't end here though! For the full feature set see our full documentation
To wrap up this tutorial let's take a look at frequently asked questions:
Puppeteer provides a high-level API for controlling browsers, but it's not a full web scraping framework. However, since it's an asynchronous node library the easiest speedup optimization would be to take advantage of this fact. We can launch multiple browsers and use them in a single scraper application using Promise.all or Promise.allSettled concurrency functions.
Since Puppeteer is automating a browser a lot of its natural functionality depends on the machine itself. In other words, the headless Chrome browser controlled by Puppeteer inherits operating system packages so if we're developing our code on MacOs and run it in production on a Linux machine it will behave slightly differently.
Often dynamic websites would use background requests (XHR) to generate some data after the page loads. We can capture these requests and responses using page.on
signal capturing function. For example, we can capture all XHR type requests and either drop them or read/modify their data:
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// capture background requests:
await page.setRequestInterception(true);
page.on('request', request => {
if (request.resourceType() === 'xhr') {
console.log(request):
// we can block these requests with:
// request.abort();
} else {
request.continue();
}
});
// capture background responses:
page.on('response', response => {
if (response.resourceType() === 'xhr') {
console.log(response);
}
})
await browser.close();
})();
In this extensive introduction article we took a look at the Puppeteer web browser automation package for NodeJS and how can we use it for web scraping. We covered some common use case scenarios, and explore challenges like how to use proxies and avoid being detected. We also wrote a small example scraper which collects creator data from tiktok.com.
Finally, we wrapped everything up by taking a look at ScrapFly's solution, how it compares to Puppeteer and how it can be used instead of Puppeteer to provide even better results in NodeJS!
For more on different browser automation solutions see a related article we wrote about Selenium, Puppeteer and Playwright and how they compare in the context of web scraping!