Web Scraping Background Requests with Headless Browsers

by Bernardas Ališauskas Apr 11, 2024

#python #headless-browser

Web Scraping Background Requests with Headless Browsers

Background requests power much of the modern web and it can be a powerful web scraping opportunity.

In this tutorial, we'll be taking a look at how to scrape XHR. For this, we'll be using a headless browser scraping technique where we launch a real browser and collect the requests it makes in the background to scrape the data.

Setup

To execute background request can use variety of tools like Selenium, Playwright, Puppeteer and Scrapfly's Python SDK

For this tutorial, we'll take a look at Playwright - the most commonly used browser automation library for Python and Scrapfly's SDK for our Scrapfly users. All of these tools can be installed using pip terminal command:

$ pip install playwright "scrapfly-sdk[all]"

Benefits of Scraping Background Requests

Modern web-scraping generally falls into one of two categories:

Low-level scraping of using HTTP clients to request website data or their backend APIs.
High-level scraping of using real web browsers through automation tools like Playwright, Selenium or Puppeteer

Scraping XHR requests falls somewhere in between. While we'll be using a real web browser we'll only be capturing its background requests and parsing them for data.

This is a very accessible and maintainable approach to web scraping as even if the website changes its internal behavior we can still capture the data we need.

How to Capture Background Requests

Almost every major browser automation library offers background request and response capture features. We cover all of these methods in our web scraping knowledge base:

Example Project

Let's take a look at the scraping of background requests through an example project using web-scraping.dev - scraper testing website developed by Scrapfly.

There are a few XHR-powered pages on web-scraping.dev that we can take a look at that will be perfect for this tutorial.

Scraping Button Click Requests

To start let's take a look at XHR generated by button clicks.
For this, we'll be taking a look at product pages on web-scraping.dev like web-scraping.dev/product/1.
In this scenario, more product reviews are loaded as JSON when the user clicks on the "Load More" button. The click triggers a background request that loads more reviews:

Example of a background request on web-scraping.dev demo page

So, to scrape this we can apply the same approach:

Initiate the browser and enable background request capture.
Load the page.
Wait for the initial page to load.
Find the "Load More" button and click it.
Parsed captured background requests for review data.

Here's how to scrape button clicks using Python:

Playwright

ScrapFly

from playwright.sync_api import sync_playwright
import json


reviews = []
with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()

    # enable our response interceptor
    def intercept_response(response):
        global reviews
        # capture all review requests and save the data
        if "/api/reviews" in response.request.url:
            reviews.extend(json.loads(response.text())["results"])
        return response
    page.on("response", intercept_response)

    # got to the page and clickt the load more reviews button
    page.goto("https://web-scraping.dev/product/1")
    page.wait_for_selector("#load-more-reviews")
    page.click("#load-more-reviews")
    page.wait_for_timeout(1000)

# finally, we can check the collected results:
for review in reviews:
    print(review["text"])
    print(review["rating"])
    print(review["date"])

import json
from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")

result = scrapfly.scrape(
    ScrapeConfig(
        "https://web-scraping.dev/product/1",
        render_js=True,
        # we can scrapfly an automation scenario: click a button and wait a bit
        js_scenario=[
            {"click": {"selector": "#load-more-reviews"}},
            {"wait": 1000}
        ]
    )
)

# retrieve background requests scrapfly captured:
xhr_calls = result.scrape_result["browser_data"]["xhr_call"]
# extract reviews:
reviews = []
for xhr in xhr_calls:
    if "/api/reviews" not in xhr["url"]:
        continue
    reviews.extend(json.loads(xhr["response"]["body"])['results'])

# output:
for review in reviews:
    print(review["text"])
    print(review["rating"])
    print(review["date"])

Scraping Endless Paging

Another popular use case is when XHRs are generated by scrolling.
For this example, the web-scraping.dev/testimonials endpoint uses endless pagination to load testimonial data. It works by using Javascript to detect scrolling changes and loading new testimonial HTML through XHR requests:

Example of a background request on web-scraping.dev testimonial scroll demo page

To scrape this page, we'll follow this simple algorithm:

Initiate the browser and enable background request capture.
Load the page.
Wait for the initial page to load.
Scroll to the bottom of the page until no new elements are loaded.
Parsed captured background requests for testimonial data.

When it comes to Python, this can be achieved in almost every major browser automation library:

Playwright

ScrapFly

from playwright.sync_api import sync_playwright


testimonials = ""

def intercept_response(response):
    # we can extract details from background requests
    global testimonials
    if "/api/testimonials" in response.request.url:
        testimonials += response.text() + "\n"
    return response


with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    # enable our response interceptor
    page.on("response", intercept_response)
    page.goto("https://web-scraping.dev/testimonials")

    # scroll to the bottom:
    _prev_height = -1
    _max_scrolls = 100
    _scroll_count = 0
    while _scroll_count < _max_scrolls:
        # Execute JavaScript to scroll to the bottom of the page
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        # Wait for new content to load (change this value as needed)
        page.wait_for_timeout(1000)
        # Check whether the scroll height changed - means more pages are there
        new_height = page.evaluate("document.body.scrollHeight")
        if new_height == _prev_height:
            break
        _prev_height = new_height
        _scroll_count += 1

# we collected all results, we can parse them with tools like
from bs4 import BeautifulSoup
soup = BeautifulSoup(testimonials, "html.parser")
for testimonial in soup.select('.testimonial .text'):
    print(testimonial.text)

from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")

result = scrapfly.scrape(
    ScrapeConfig(
        "https://web-scraping.dev/testimonials/",
        render_js=True,
        auto_scroll=True,
        rendering_wait=2000,
    )
)

xhr_calls = result.scrape_result["browser_data"]["xhr_call"]
testimonials = ""
for xhr in xhr_calls:
    if "/api/testimonials" not in xhr["url"]:
        continue
    testimonials += xhr["response"]["body"] + "\n"

# we collected all results, we can parse them with tools like
from bs4 import BeautifulSoup
soup = BeautifulSoup(testimonials, "html.parser")
for testimonial in soup.select(".testimonial .text"):
    print(testimonial.text)
print(result)

By collecting background requests here we didn't have to reverse-engineer the backend API which can be obfuscated and require a lot of complex engineering to scrape.

Common Challenges

When scraping XHR responses there are a few common challenges that you might encounter.

Waiting for Page Load

To start, the headless browser needs to ensure the action that triggers a background request is executed.
If it's page state loading itself then the browser needs to wait for the page to load. See these Scrapfly knowledgebase entries for how to ensure waiting in each headless browser tool:

Page Scroll

Other times, the XHR trigger is a user action like a page scroll or a button click. All of these actions can be achieved in most browser automation libraries. Though, the actions must be executed the same way they would be in a real browsert to ensure XHR is triggered:

FAQ

What does XHR mean?

XHR stands for XMLHTTPRequest however in modern web development it's used to refer to any background data request that is triggered by javsacript. Note that non-data requests like image loading, font loading, and other assets are not considered XHR-type requests.

Can I block XHR requests?

Yes. Blocking XHR requests can be an important step in web scraping to reduce the amount of data that is loaded. This can save a lot of bandwidth and significantly speed up scraping. For more see how to block resources in Playwright, Selenium or Puppeteer

What is the difference between XHR and AJAX?

Both are used to refer to the same thing. XHR is the technical term for the background request while AJAX is the term for the technique of using XHR requests to load data in the background. That being said, the AJAX term is not used as much anymore in favor of XHR or simply data request.

What is the difference between XHR and Fetch?

Fetch is a popular Javascript API for making XHR (background requests). It comes included with all web browsers and is used by most websites for background requests.

What is the difference between XHR and API?

Every dynamic website has a backend API powering its data. This API is called by background requests (XHR) to load data from the server. So, XHR is the technique used to load data from the API.

What is the difference between XHR and GraphQL?

GraphQL is a query language for APIs. It's used by some websites to power their backend API. Many XHR requests are made to GraphQL APIs to load data. So, XHR is the technique used to load data from the GraphQL API.

Summary

In this quick tutorial, we've taken a look at a powerful web scraping technique - background request capture.

This technique uses real web browsers to browse the web as a real user would and capture data requests that happen in the background. To execute it, we've taken a look at an example project scraping web-scraping.dev for background data coming from scrolling or button clicks.

Web Scraping Background Requests with Headless Browsers

Setup

Benefits of Scraping Background Requests

How to Capture Background Requests

Example Project

Scraping Button Click Requests

Scraping Endless Paging

Common Challenges

Waiting for Page Load

Page Scroll

FAQ

What does XHR mean?

Can I block XHR requests?

What is the difference between XHR and AJAX?

What is the difference between XHR and Fetch?

What is the difference between XHR and API?

What is the difference between XHR and GraphQL?

Summary

Explore this Article with AI

Related Knowledgebase

How to get page source in Puppeteer?

How to get page source in Selenium?

How to find elements by XPath in Selenium

How to take a screenshot with Selenium?

How to save and load cookies in Selenium?

How to take screenshots in NodeJS?

What Python libraries support HTTP2?

How to handle popup dialogs in Playwright?

How to use proxies with Python httpx?

How to scrape images from a website?

How to use headless browsers with scrapy?

What are some ways to parse JSON datasets in Python?

Related Articles

How To Take Screenshots In Python?

How to Scrape Forms

How to Scrape With Headless Firefox

Selenium Wire Tutorial: Intercept Background Requests

Web Scraping Dynamic Websites With Scrapy Playwright

Web Scraping Dynamic Web Pages With Scrapy Selenium