Scraping using Headless Browsers

Headless browsers are special versions of web browsers that contain no UI elements and run in the background. This makes them ideal for web automation and web scraping dynamic pages.

For example, instead of reverse engineering hidden APIs we can run a real browser and perform actions that call these APIs.

Scraping using headless browsers has some clear advantages and disadvantages:

Easier to develop - we can direct a real browser to do all the things browsers do: click buttons, reload and input forms.
Can execute Javascript - a real web browser can execute Javascript and render dynamic content.
Harder to block - a real web browser is harder to identify as a scraper.

Extremely resource intense. Browsers use significantly more sources from processing and memory to bandwidth use.
Prone to bugs as browsers are very complicated, especially when used in headless mode.

Scrapfly supports scraping using headless browsers through the Javascript Rendering feature which basically eliminates all the disadvantages of headless browsers by running it in a managed cloud!

Introduction

Headless browsers are controlled through a standard protocol called CDP. For CDP to work a real web browser needs to be launched with a special flag that enables the CDP server and disables GUI. Then, communication with the browser is done through a websocket connection.

Intro to Scraping Using Browsers

This introduction will cover everything you need to know about headless browser scraping.

Libraries

There are many libraries and clients that implement this communication protocol. So if you want to dive right in these 3 are the major ones:

Playwright Intro

The most modern client accessible from Python and Javascript. Supports Chrome and Firefox browsers.

Selenium Intro

Classic choice with the biggest web scraping community.

Puppeteer Intro

Predecessor to Playwright. Only available NodeJS with a sizable scraping community around it.

As for which one to choose - Playwright is the newest edition to this area with nice developer API. However, Selenium and Puppeteer both have larger communities and extension ecosystems. Either way, all of these tools can be leveraged to do almost anything with a real web browser - refer to our FAQ knowledge bases for more:

Selenium Knowledgebase

Playwright Knowledgebase

Puppeteer Knowledgebase

There are many useful optimizations and extensions when it comes to scraping using browsers. Here are a few articles related to the most popular ones:

Enabling Extensions

How to enable browser extensions like adblockers for headless browser scrapers.

Selenium Grid Scaling

Selenium Grid is a server for managing and scaling multiple selenium profiles.

Undetected Chromedriver Patch

Selenium web driver replacement with many patches that can reduce scraper blocking.

Utilities

Having a browser at your disposal opens up new ideas for web scraping as not only we can load all dynamic pages but we can also perform browser actions and capture what the browser is doing.

Capturing Background Request Browsers Make

The most interesting use case is using hidden web browsers to discover hidden APIs. By launching the browser in request capture mode we can collect all requests it makes and turn them into efficient HTTP client requests.

Next up - Hidden Data Scraping

Scraping using headless browsers can be expensive and complex, so it's best avoided if possible. We already have taken a look at how to scrape hidden APIs, but there's another dynamic page scraping secret - hidden web data. Let's take a look at it next!

< >