What is a Headless Browser? Top 5 Headless Browser Tools

What is a Headless Browser? Top 5 Headless Browser Tools

As a software developer, you might have come across the word "headless" in many use cases in your work; headless CMS, headless browser , headless linux, and many more. In the world of software, the word headless by itself usually implies the absence of a presentation layer (GUI) from the application at hand.

Accordingly, a headless browser is a browser instance without any visible GUI (graphical user interface) elements.

Headless browsers have become increasingly important as web applications have grown more complex. This growing complexity drove the need for new tools to facilitate tasks like web automation, testing, web scraping and other server-side browser interactions.

In this article, we will dive deeper into what is a headless browser, what are its use cases, and what are some common tools that utilize the power of headless browsers.

What Is a Headless Browser?

A headless browser is web browser striped from all GUI elements, leaving us with the core web browsing functionalities like rendering web content, navigating webpages, managing cookies, network communication and much more. Headless browsers can be controlled via command line or programmatically through a wide range of programming language making it very versatile and useful.

This makes headless browsers a perfect match to run on servers, as they typically don’t support rendering any visual output and are also controlled via command line.

Headless browsers are also significantly faster and more efficient than the traditional browser you are reading this article through. By avoiding the resource-intensive process of drawing output on a screen, a headless browser is able to dedicate all its resources into network communication and rendering content.

Headless Browser Use Cases

The performance, efficiency, and programmatic control capabilities of headless browsers make them a great tool to perform repetitive tasks like end-to-end testing, automation, and web scraping.

E2E Testing

E2E testing is crucial step in application’s lifecycle, it involves simulating user interactions with an application’s frontend programmatically, ensuring all user flows within a complex app behave correctly. These simulated interactions don’t always require any visual output and typically run in CI/CD pipelines which makes them a very obvious candidate to implement using headless browsers.

Automation

Extending the E2E testing use case, simulating interactions programmatically within an application is also very useful for automating repetitive tasks that might take weeks or months and tens or hundreds of employees to perform manually. For example, submitting hundreds of thousands of records to a web form that you don’t have direct access to the endpoint it submits its data to can be a hideous task to do manually; but using a headless browser and some computing resources could render this to be an hour worth of work.

Web Scraping

Now for the star of the show, the most popular use case for headless browsers, web scraping. The popularity of dynamic webpages (also usually referred to as single-page applications or client-side rendered application) has made web scraping with a headless browser a necessity.

As those applications rely on JavaScript to render any html to the page, simply sending a GET request to the page’s URL and parsing the html received doesn’t work anymore. Using a headless browser for scraping allows the page to load and execute any JS code that ships with initial html document, resulting in html content which can then be parsed and accessed programmatically to extract structured data from.

How to Scrape Dynamic Websites Using Headless Web Browsers

For more in-depth look at headless browser web scraping technology see this overview article we wrote a while back on what you should really be paying attention to when using these tools

How to Scrape Dynamic Websites Using Headless Web Browsers

Common Headless Browser Tools

It is by no surprise that the most common tools that leverage the capabilities of headless browsers are browser automation tools. Browser automation tools are mostly marketed as testing automation tools. However, they are perfect to use for regular automation tasks and most importantly web scraping.

The most popular and widely-used browser automation tools are:

  1. Playwright
  2. Puppeteer
  3. Selenium
  4. Selenium-Wire
  5. Undetected Chromedriver

Playwright

Playwright is relatively new (initial release 2020) open-source browser automation library created by Microsoft. It has gained a lot of popularity lately due to its intuitive API design that overcomes a lot of limitations in other popular libraries like selenium and cypress. It supports all modern browser engines and runs in both headed and headless modes.

Playwright was, as per its documentation, "created specifically to accommodate the needs of end-to-end testing" yet it is widely used by the webscraping community to build headless browser bots that extract data from dynamic websites.

Supported languages: Python, JS, Java,.NET

Pros:

  • Multi-Browser Support: Supports Chromium, Firefox, and WebKit, which covers most browser engines.
  • Cross-Platform: Works on Windows, macOS, and Linux, and supports multiple languages like JavaScript, Python, and C#.
  • Advanced Features: Provides powerful features like intercepting network requests, taking screenshots, generating PDFs, and simulating geolocation.
  • DevTools Protocol: Uses the DevTools protocol, enabling direct control over browser features.

Cons:

  • Larger API Surface: The API is more extensive compared to some other tools, leading to a steeper learning curve.
  • Relatively New: Less community support and third-party integrations compared to more established tools like Selenium.

Here is an example code that uses Playwright and Parsel in python to scrape some products from web-scraping.dev

from playwright.sync_api import sync_playwright
from parsel import Selector
	
with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=True)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()
    page.goto("https://web-scraping.dev/products")  # go to url
    page_html = page.content()
	
    sel = Selector(text=page_html)
    parsed = []
    for product in sel.css(".product"):
        parsed.append({
            'name': product.css('h3::text').get(),
            'short_description': product.css('.short-description::text').get(),
            'url': product.css('h3 a::attr(href)').get(),
            'price': product.css('.price::text').get(),
            "image_url": product.css('img::attr(src)').get()
        })
Web Scraping with Playwright and Python

For more see our in-depth introduction to scraping with Python and Playwright and an example project.

Web Scraping with Playwright and Python

Puppeteer

Puppeteer is another open-source browser automation library built by Google. Puppeteer is widely used in the JS community for automation and web-scraping tasks. It is not a popular as other tools due to it being only available in JS and supporting only Chrome and Firefox browsers.

Supported languages: JS, Python (unofficial and unmaintained)

Pros:

  • Native Chromium Support: Puppeteer is developed by Google and provides native control over Chromium-based browsers, which leads to better stability.
  • Simpler API: Easy to learn and use due to its straightforward API.
  • Network Control: Supports features like request interception, network throttling, and custom headers.
  • Good Documentation: Well-documented and actively maintained by the developers.

Cons:

  • Webkit Support: Lacks support for Webkit browsers like Safari
  • Limited Language Support: Primarily supports JavaScript and TypeScript, with no official bindings for other languages.

Here an example code that uses Pupeteer and Cheerio in Node.js to scrape the same products we scraped using Playwright

from playwright.sync_api import sync_playwright
from parsel import Selector
	
with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=True)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()
    page.goto("https://web-scraping.dev/products")  # go to url
    page_html = page.content()
	
    sel = Selector(text=page_html)
    parsed = []
    for product in sel.css(".product"):
        parsed.append({
            'name': product.css('h3::text').get(),
            'short_description': product.css('.short-description::text').get(),
            'url': product.css('h3 a::attr(href)').get(),
            'price': product.css('.price::text').get(),
            "image_url": product.css('img::attr(src)').get()
        })
How to Web Scrape with Puppeteer and NodeJS in 2024

For more see our in-depth introduction to scraping with Nodejs and Puppteer and an example project.

How to Web Scrape with Puppeteer and NodeJS in 2024

Selenium

Selenium is one of the most popular and widely-used browser automation tools. Dating back to 2004, selenium has had its opportunity to gather a huge ecosystem of plugins around it and an unprecedented community support. Selenium can be used with almost any programming language and all web browsers including good old Internet Explorer, giving it an edge compared to any other browser automation library.

Supported languages: Python, JavaScript, Java, C#, PHP, Go, Dart, R, Ruby, Perl, Haskell, Objective-C and more.

Pros:

  • Cross-Browser and Cross-Platform: Supports all major browsers (Chrome, Firefox, Safari, Edge, Opera) and platforms (Windows, Linux, macOS).
  • Large Community and Ecosystem: Extensive community support, libraries, and tools built around Selenium.
  • Extensive Documentation: Comprehensive documentation and a large number of learning resources.

Cons:

  • Slower Execution: Generally slower than Playwright or Puppeteer due to its architecture and reliance on WebDriver protocol.
  • Manual Waiting: Often requires manual handling of waits for elements to load, which can lead to flakiness in tests.
  • Limited Features for Modern Web Apps: Lacks some advanced features such as built-in support for intercepting network requests or mocking responses.

Here is an example code that uses Selenium and Parsel in python to scrape the same data from the examples above:

const puppeteer = require("puppeteer");
const cheerio = require("cheerio");

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  await page.setViewport({ width: 1920, height: 1080 });

  await page.goto("https://web-scraping.dev/products");
  const pageHtml = await page.content();

  const $ = cheerio.load(pageHtml);
  const parsed = [];

  $(".product").each((index, element) => {
    parsed.push({
      name: $(element).find("h3").text().trim(),
      short_description: $(element).find(".short-description").text().trim(),
      url: $(element).find("h3 a").attr("href"),
      price: $(element).find(".price").text().trim(),
      image_url: $(element).find("img").attr("src"),
    });
  });

  console.log(parsed);

  await browser.close();
})();

Selenium’s long running development and support for multiple languages and browsers might have significant benefits but it also brings to the table a lot of limitations. Its API is outdated and developer experience is not very intuitive, it doesn’t support modern language features and browser APIs. That’s why other tools like playwright and puppeteer might be better alternatives.

Web Scraping with Selenium and Python Tutorial + Example Project

For more see our in-depth introduction to scraping with Python and Selenium and an example project.

Web Scraping with Selenium and Python Tutorial + Example Project

Selenium-Wire and Undetected Chromedriver

Selenium’s popularity undoubtedly extends to the web scraping community as it also supports running in headless mode. This popularity has also set the ground for developers to build extensions that solve problems that are not directly addressed in selenium’s core automation library, such extensions include Selenium-Wire and Undetected Chromedriver.

1. Selenium-wire

Selenium-wire is python library that extends selenium's headless browser library. It adds to selenium the ability to inspect and manipulate background http requests sent by the browser. This adds another dimension to web scraping with headless browsers as we can now intecept http request invoked by users' interactions and extract valuable data from them.

Here is an example code for intercepting testmonial reuqests on the web-scraping.dev testmonials pages

from seleniumwire import webdriver
from parsel import Selector
import time

options = webdriver.ChromeOptions()
options.add_argument("log-level=3") # disable logs
driver = webdriver.Chrome(options=options)

driver.get("https://web-scraping.dev/testimonials")

def scroll(driver: webdriver):
    for i in range(0, 5):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)

scroll(driver)

def parse_xhr_html(html: str):
    """parse review data from each xhr response body"""
    data = []
    selector = Selector(html)
    for review in selector.css("div.testimonial"):
        data.append({
            "rate": len(review.css("span.rating > svg").getall()),
            "text": review.css("p.text::text").get()            
        })
    return data

# iterate over all the recorded XHR requests and parse each response body
data = []
for request in driver.requests:
    if "/testimonials" in request.url:
        xhr_data = parse_xhr_html(request.response.body.decode('utf-8'))
        data.extend(xhr_data)

driver.quit()

You can learn more about selenium-wire in our dedicated selenium-wire article:

Selenium Wire Tutorial: Intercept Background Requests

For more see our full guide on using Selenium wire to power up selenium based web scrapers with request interception

Selenium Wire Tutorial: Intercept Background Requests

2. Undetected Chromedriver

Many websites have implemented measures that detect headless browser bots and scrapers and block them from accessing the website's pages. Enters Undetected Chromedriver. Undetected Chromedriver is a modified selenium webdriver that has bulitin measures to combat websites that block automated headless browsers scraping their pages.

Here an example code for using undetected chromedriver to bypass cloudflare's crawler block

import undetected_chromedriver as uc
import time

# Add the driver options
options = uc.ChromeOptions() 
options.headless = False

# Configure the undetected_chromedriver options
driver = uc.Chrome(options=options) 

with driver:
    # Go to the target website
    driver.get("https://nowsecure.nl/")
# Wait for security check
time.sleep(4)

# Take a screenshot
driver.save_screenshot('screenshot.png')
# Close the browsers
driver.quit()

Learn how to use selenium's undetected chromedriver through our dedicated blogpost

Web Scraping Without Blocking With Undetected ChromeDriver

For more see this intro to Undetected ChromeDriver - selenium extension for bypassing many scraper blocking extensions. Hands-on tutorial and real-life example.

Web Scraping Without Blocking With Undetected ChromeDriver

Cloud Browsers with ScrapFly API

Choosing the right browser automation tool for you specific use case is a hard decision. That's why we at Scrapfly are offering to do the heavy lifting for you by providing a generic, scalable and fully managed cloud-based browser automation API.

scrapfly middleware
Scrapfly service does the heavy lifting for you!

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Here's how easy it is to start and control a cloud browser with Scrapfly and Python:

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key='Your Scrapfly Key')
# define your scenario, see docs for available options https://scrapfly.io/docs/scrape-api/javascript-scenario
scenario = 
api_result = client.scrape(ScrapeConfig(
    url="https://web-scraping.dev/login",
    # enable js 
    render_js=True,
    # optionally - send browser actions to reach the pages you want scraped
    js_scenario = [
        {"fill": {"selector": "input[name=username]", "value":"user123"}},
        {"fill": {"selector": "input[name=password]", "value":"password"}},
        {"click": {"selector": "button[type='submit']"}},
        {"wait_for_navigation": {"timeout": 5000}}
    ],
))
print(api_result.selector.css("#secret-message ::text").get())
# prints:
"🤫"

Not only you can render javascript, but fully control the browser with instructions like button clicking, scrolling and text input!

FAQ

What is a headless browser?

Here's a concise, one-paragraph definition of headless browsers:
A headless browser is a web browser without a graphical user interface, designed to run in environments without displays. It performs all the functions of a regular browser—rendering web pages, executing JavaScript, and handling network requests—but operates invisibly in the background.

What is the best headless browser automation tool?

Every tool has its pros and cons. However, Playwright has been gaining a lot of intrest lately due to its feature rich, cross-language, cross-browser API which provides both asynchronous and synchronous client implementations.

What is a dynamic web application?

A dynamic web application, also referred to as single page application or client side rendered application, is a type of web application that depends on javascript libraries and frameworks to render all of part of its content. This means that the html that loads when the website is requested doesn't have any content which makes scraping using traditional methods impossible.

Summary

In this article, we have covered all the ins and outs of headless browsers and their use cases. As well as all the popular tools that utlilize headless browsers for testing, automation, and headless browser tasks.

The concept of a headless browser might seem simple but the ecosystem around it is defintely much more complex especially when comes to scaling it to high volumes. This is why at ScrapeFly, we hold the responsibility of building the data collection infrastructure that scales to your needs.

Related Posts

Web Scraping with Playwright and JavaScript

Learn about Playwright - a browser automation toolkit for server side Javascript like NodeJS, Deno or Bun.

Guide to SeleniumBase — A Better & Easier Selenium

SeleniumBase streamlines browser automation with simple syntax, cross-browser support, and robust features, perfect for testing and web scraping.

Playwright vs Selenium

Explore the key differences between Playwright vs Selenium in terms of performance, web scraping, and automation testing for modern web applications.