What is a Headless Browser? Top 5 Headless Browser Tools
Quick overview of new emerging tech of browser automation - what exactly are these tools and how are they used in web scraping?
To speed up Puppeteer web scrapers we can block media and other non-essential requests using the request interception feature:
// we can block by resrouce type like fonts, images etc.
const blockResourceType = [
'beacon',
'csp_report',
'font',
'image',
'imageset',
'media',
'object',
'texttrack',
];
// we can also block by domains, like google-analytics etc.
const blockResourceName = [
'adition',
'adzerk',
'analytics',
'cdn.api.twitter',
'clicksor',
'clicktale',
'doubleclick',
'exelator',
'facebook',
'fontawesome',
'google',
'google-analytics',
'googletagmanager',
'mixpanel',
'optimizely',
'quantserve',
'sharethrough',
'tiqcdn',
'zedo',
];
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
// we need to enable interception feature
await page.setRequestInterception(true);
// then we can add a call back which inspects every
// outgoing request browser makes and decides whether to allow it
page.on('request', request => {
const requestUrl = request._url.split('?')[0];
if (
(request.resourceType() in blockedResourceType) ||
blockResourceName.some(resource => requestUrl.includes(resource))
) {
request.abort();
} else {
request.continue();
}
});
}
This knowledgebase is provided by Scrapfly data APIs, check us out! 👇