Scraping using Headless Browsers
Headless browsers are special versions of web browsers that contain no UI elements and run in the background. This makes them ideal for web automation and web scraping dynamic pages.
For example, instead of reverse engineering hidden APIs we can run a real browser and perform actions that call these APIs.
Scraping using headless browsers has some clear advantages and disadvantages:
- Easier to develop - we can direct a real browser to do all the things browsers do: click buttons, reload and input forms.
- Can execute Javascript - a real web browser can execute Javascript and render dynamic content.
- Harder to block - a real web browser is harder to identify as a scraper.
- Extremely resource intense. Browsers use significantly more sources from processing and memory to bandwidth use.
- Prone to bugs as browsers are very complicated, especially when used in headless mode.
Scrapfly supports scraping using headless browsers through the Javascript Rendering feature which basically eliminates all the disadvantages of headless browsers by running it in a managed cloud!
Introduction
Headless browsers are controlled through a standard protocol called CDP. For CDP to work a real web browser needs to be launched with a special flag that enables the CDP server and disables GUI. Then, communication with the browser is done through a websocket connection.
Intro to Scraping Using Browsers
This introduction will cover everything you need to know about headless browser scraping.
Libraries
There are many libraries and clients that implement this communication protocol. So if you want to dive right in these 3 are the major ones:
Playwright Intro
Selenium Intro
Classic choice with the biggest web scraping community.
Puppeteer Intro
Predecessor to Playwright. Only available NodeJS with a sizable scraping community around it.
As for which one to choose - Playwright is the newest edition to this area with nice developer API. However, Selenium and Puppeteer both have larger communities and extension ecosystems. Either way, all of these tools can be leveraged to do almost anything with a real web browser - refer to our FAQ knowledge bases for more:
There are many useful optimizations and extensions when it comes to scraping using browsers. Here are a few articles related to the most popular ones:
Enabling Extensions
How to enable browser extensions like adblockers for headless browser scrapers.
Selenium Grid Scaling
Selenium Grid is a server for managing and scaling multiple selenium profiles.
Undetected Chromedriver Patch
Selenium web driver replacement with many patches that can reduce scraper blocking.
Utilities
Having a browser at your disposal opens up new ideas for web scraping as not only we can load all dynamic pages but we can also perform browser actions and capture what the browser is doing.
Capturing Background Request Browsers Make
The most interesting use case is using hidden web browsers to discover hidden APIs. By launching the browser in request capture mode we can collect all requests it makes and turn them into efficient HTTP client requests.
Next up - Hidden Data Scraping
Scraping using headless browsers can be expensive and complex, so it's best avoided if possible. We already have taken a look at how to scrape hidden APIs, but there's another dynamic page scraping secret - hidden web data. Let's take a look at it next!