As a software developer, you might have come across the word "headless" in many use cases in your work; headless CMS, headless browser , headless linux, and many more. In the world of software, the word headless by itself usually implies the absence of a presentation layer (GUI) from the application at hand.
Accordingly, a headless browser is a browser instance without any visible GUI (graphical user interface) elements.
Headless browsers have become increasingly important as web applications have grown more complex. This growing complexity drove the need for new tools to facilitate tasks like web automation, testing, web scraping and other server-side browser interactions.
In this article, we will dive deeper into what is a headless browser, what are its use cases, and what are some common tools that utilize the power of headless browsers.
What Is a Headless Browser?
A headless browser is web browser striped from all GUI elements, leaving us with the core web browsing functionalities like rendering web content, navigating webpages, managing cookies, network communication and much more. Headless browsers can be controlled via command line or programmatically through a wide range of programming language making it very versatile and useful.
This makes headless browsers a perfect match to run on servers, as they typically don’t support rendering any visual output and are also controlled via command line.
Headless browsers are also significantly faster and more efficient than the traditional browser you are reading this article through. By avoiding the resource-intensive process of drawing output on a screen, a headless browser is able to dedicate all its resources into network communication and rendering content.
Headless Browser Use Cases
The performance, efficiency, and programmatic control capabilities of headless browsers make them a great tool to perform repetitive tasks like end-to-end testing, automation, and web scraping.
E2E Testing
E2E testing is crucial step in application’s lifecycle, it involves simulating user interactions with an application’s frontend programmatically, ensuring all user flows within a complex app behave correctly. These simulated interactions don’t always require any visual output and typically run in CI/CD pipelines which makes them a very obvious candidate to implement using headless browsers.
Automation
Extending the E2E testing use case, simulating interactions programmatically within an application is also very useful for automating repetitive tasks that might take weeks or months and tens or hundreds of employees to perform manually. For example, submitting hundreds of thousands of records to a web form that you don’t have direct access to the endpoint it submits its data to can be a hideous task to do manually; but using a headless browser and some computing resources could render this to be an hour worth of work.
Web Scraping
Now for the star of the show, the most popular use case for headless browsers, web scraping. The popularity of dynamic webpages (also usually referred to as single-page applications or client-side rendered application) has made web scraping with a headless browser a necessity.
As those applications rely on JavaScript to render any html to the page, simply sending a GET request to the page’s URL and parsing the html received doesn’t work anymore. Using a headless browser for scraping allows the page to load and execute any JS code that ships with initial html document, resulting in html content which can then be parsed and accessed programmatically to extract structured data from.
Common Headless Browser Tools
It is by no surprise that the most common tools that leverage the capabilities of headless browsers are browser automation tools. Browser automation tools are mostly marketed as testing automation tools. However, they are perfect to use for regular automation tasks and most importantly web scraping.
The most popular and widely-used browser automation tools are:
Playwright
Puppeteer
Selenium
Selenium-Wire
Undetected Chromedriver
Playwright
Playwright is relatively new (initial release 2020) open-source browser automation library created by Microsoft. It has gained a lot of popularity lately due to its intuitive API design that overcomes a lot of limitations in other popular libraries like selenium and cypress. It supports all modern browser engines and runs in both headed and headless modes.
Playwright was, as per its documentation, "created specifically to accommodate the needs of end-to-end testing" yet it is widely used by the webscraping community to build headless browser bots that extract data from dynamic websites.
Multi-Browser Support: Supports Chromium, Firefox, and WebKit, which covers most browser engines.
Cross-Platform: Works on Windows, macOS, and Linux, and supports multiple languages like JavaScript, Python, and C#.
Advanced Features: Provides powerful features like intercepting network requests, taking screenshots, generating PDFs, and simulating geolocation.
DevTools Protocol: Uses the DevTools protocol, enabling direct control over browser features.
Cons:
Larger API Surface: The API is more extensive compared to some other tools, leading to a steeper learning curve.
Relatively New: Less community support and third-party integrations compared to more established tools like Selenium.
Here is an example code that uses Playwright and Parsel in python to scrape some products from web-scraping.dev
from playwright.sync_api import sync_playwright
from parsel import Selector
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=True)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
page.goto("https://web-scraping.dev/products") # go to url
page_html = page.content()
sel = Selector(text=page_html)
parsed = []
for product in sel.css(".product"):
parsed.append({
'name': product.css('h3::text').get(),
'short_description': product.css('.short-description::text').get(),
'url': product.css('h3 a::attr(href)').get(),
'price': product.css('.price::text').get(),
"image_url": product.css('img::attr(src)').get()
})
Puppeteer
Puppeteer is another open-source browser automation library built by Google. Puppeteer is widely used in the JS community for automation and web-scraping tasks. It is not a popular as other tools due to it being only available in JS and supporting only Chrome and Firefox browsers.
Supported languages:JS, Python (unofficial and unmaintained)
Pros:
Native Chromium Support: Puppeteer is developed by Google and provides native control over Chromium-based browsers, which leads to better stability.
Simpler API: Easy to learn and use due to its straightforward API.
Network Control: Supports features like request interception, network throttling, and custom headers.
Good Documentation: Well-documented and actively maintained by the developers.
Cons:
Webkit Support: Lacks support for Webkit browsers like Safari
Limited Language Support: Primarily supports JavaScript and TypeScript, with no official bindings for other languages.
Here an example code that uses Pupeteer and Cheerio in Node.js to scrape the same products we scraped using Playwright
from playwright.sync_api import sync_playwright
from parsel import Selector
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=True)
context = browser.new_context(viewport={"width": 1920, "height": 1080})
page = context.new_page()
page.goto("https://web-scraping.dev/products") # go to url
page_html = page.content()
sel = Selector(text=page_html)
parsed = []
for product in sel.css(".product"):
parsed.append({
'name': product.css('h3::text').get(),
'short_description': product.css('.short-description::text').get(),
'url': product.css('h3 a::attr(href)').get(),
'price': product.css('.price::text').get(),
"image_url": product.css('img::attr(src)').get()
})
Selenium
Selenium is one of the most popular and widely-used browser automation tools. Dating back to 2004, selenium has had its opportunity to gather a huge ecosystem of plugins around it and an unprecedented community support. Selenium can be used with almost any programming language and all web browsers including good old Internet Explorer, giving it an edge compared to any other browser automation library.
Selenium’s long running development and support for multiple languages and browsers might have significant benefits but it also brings to the table a lot of limitations. Its API is outdated and developer experience is not very intuitive, it doesn’t support modern language features and browser APIs. That’s why other tools like playwright and puppeteer might be better alternatives.
Selenium-Wire and Undetected Chromedriver
Selenium’s popularity undoubtedly extends to the web scraping community as it also supports running in headless mode. This popularity has also set the ground for developers to build extensions that solve problems that are not directly addressed in selenium’s core automation library, such extensions include Selenium-Wire and Undetected Chromedriver.
1. Selenium-wire
Selenium-wire is python library that extends selenium's headless browser library. It adds to selenium the ability to inspect and manipulate background http requests sent by the browser. This adds another dimension to web scraping with headless browsers as we can now intecept http request invoked by users' interactions and extract valuable data from them.
Here is an example code for intercepting testmonial reuqests on the web-scraping.dev testmonials pages
from seleniumwire import webdriver
from parsel import Selector
import time
options = webdriver.ChromeOptions()
options.add_argument("log-level=3") # disable logs
driver = webdriver.Chrome(options=options)
driver.get("https://web-scraping.dev/testimonials")
def scroll(driver: webdriver):
for i in range(0, 5):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1)
scroll(driver)
def parse_xhr_html(html: str):
"""parse review data from each xhr response body"""
data = []
selector = Selector(html)
for review in selector.css("div.testimonial"):
data.append({
"rate": len(review.css("span.rating > svg").getall()),
"text": review.css("p.text::text").get()
})
return data
# iterate over all the recorded XHR requests and parse each response body
data = []
for request in driver.requests:
if "/testimonials" in request.url:
xhr_data = parse_xhr_html(request.response.body.decode('utf-8'))
data.extend(xhr_data)
driver.quit()
You can learn more about selenium-wire in our dedicated selenium-wire article:
2. Undetected Chromedriver
Many websites have implemented measures that detect headless browser bots and scrapers and block them from accessing the website's pages. EntersUndetected Chromedriver. Undetected Chromedriver is a modified selenium webdriver that has bulitin measures to combat websites that block automated headless browsers scraping their pages.
Here an example code for using undetected chromedriver to bypass cloudflare's crawler block
import undetected_chromedriver as uc
import time
# Add the driver options
options = uc.ChromeOptions()
options.headless = False
# Configure the undetected_chromedriver options
driver = uc.Chrome(options=options)
with driver:
# Go to the target website
driver.get("https://nowsecure.nl/")
# Wait for security check
time.sleep(4)
# Take a screenshot
driver.save_screenshot('screenshot.png')
# Close the browsers
driver.quit()
Learn how to use selenium's undetected chromedriver through our dedicated blogpost
Cloud Browsers with ScrapFly API
Choosing the right browser automation tool for you specific use case is a hard decision. That's why we at Scrapfly are offering to do the heavy lifting for you by providing a generic, scalable and fully managed cloud-based browser automation API.
Here's how easy it is to start and control a cloud browser with Scrapfly and Python:
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(key='Your Scrapfly Key')
# define your scenario, see docs for available options https://scrapfly.io/docs/scrape-api/javascript-scenario
scenario =
api_result = client.scrape(ScrapeConfig(
url="https://web-scraping.dev/login",
# enable js
render_js=True,
# optionally - send browser actions to reach the pages you want scraped
js_scenario = [
{"fill": {"selector": "input[name=username]", "value":"user123"}},
{"fill": {"selector": "input[name=password]", "value":"password"}},
{"click": {"selector": "button[type='submit']"}},
{"wait_for_navigation": {"timeout": 5000}}
],
))
print(api_result.selector.css("#secret-message ::text").get())
# prints:
"🤫"
Not only you can render javascript, but fully control the browser with instructions like button clicking, scrolling and text input!
FAQ
What is a headless browser?
Here's a concise, one-paragraph definition of headless browsers:
A headless browser is a web browser without a graphical user interface, designed to run in environments without displays. It performs all the functions of a regular browser—rendering web pages, executing JavaScript, and handling network requests—but operates invisibly in the background.
What is the best headless browser automation tool?
Every tool has its pros and cons. However, Playwright has been gaining a lot of intrest lately due to its feature rich, cross-language, cross-browser API which provides both asynchronous and synchronous client implementations.
What is a dynamic web application?
A dynamic web application, also referred to as single page application or client side rendered application, is a type of web application that depends on javascript libraries and frameworks to render all of part of its content. This means that the html that loads when the website is requested doesn't have any content which makes scraping using traditional methods impossible.
Summary
In this article, we have covered all the ins and outs of headless browsers and their use cases. As well as all the popular tools that utlilize headless browsers for testing, automation, and headless browser tasks.
The concept of a headless browser might seem simple but the ecosystem around it is defintely much more complex especially when comes to scaling it to high volumes. This is why at ScrapeFly, we hold the responsibility of building the data collection infrastructure that scales to your needs.