Background requests power much of the modern web and it can be a powerful web scraping opportunity.
In this tutorial, we'll be taking a look at how to scrape XHR. For this, we'll be using a headless browser scraping technique where we launch a real browser and collect the requests it makes in the background to scrape the data.
For this tutorial, we'll take a look at Playwright - the most commonly used browser automation library for Python and Scrapfly's SDK for our Scrapfly users. All of these tools can be installed using pip terminal command:
$ pip install playwright "scrapfly-sdk[all]"
Benefits of Scraping Background Requests
Modern web-scraping generally falls into one of two categories:
Low-level scraping of using HTTP clients to request website data or their backend APIs.
High-level scraping of using real web browsers through automation tools like Playwright, Selenium or Puppeteer
Scraping XHR requests falls somewhere in between. While we'll be using a real web browser we'll only be capturing its background requests and parsing them for data.
This is a very accessible and maintainable approach to web scraping as even if the website changes its internal behavior we can still capture the data we need.
How to Capture Background Requests
Almost every major browser automation library offers background request and response capture features. We cover all of these methods in our web scraping knowledge base:
Let's take a look at the scraping of background requests through an example project using web-scraping.dev - scraper testing website developed by Scrapfly.
There are a few XHR-powered pages on web-scraping.dev that we can take a look at that will be perfect for this tutorial.
Scraping Button Click Requests
To start let's take a look at XHR generated by button clicks.
For this, we'll be taking a look at product pages on web-scraping.dev like web-scraping.dev/product/1.
In this scenario, more product reviews are loaded as JSON when the user clicks on the "Load More" button. The click triggers a background request that loads more reviews:
So, to scrape this we can apply the same approach:
Initiate the browser and enable background request capture.
Load the page.
Wait for the initial page to load.
Find the "Load More" button and click it.
Parsed captured background requests for review data.
Here's how to scrape button clicks using Python:
Playwright
ScrapFly
from playwright.sync_api import sync_playwright
import json
reviews = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
# enable our response interceptor
def intercept_response(response):
global reviews
# capture all review requests and save the data
if "/api/reviews" in response.request.url:
reviews.extend(json.loads(response.text())["results"])
return response
page.on("response", intercept_response)
# got to the page and clickt the load more reviews button
page.goto("https://web-scraping.dev/product/1")
page.wait_for_selector("#load-more-reviews")
page.click("#load-more-reviews")
page.wait_for_timeout(1000)
# finally, we can check the collected results:
for review in reviews:
print(review["text"])
print(review["rating"])
print(review["date"])
import json
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
result = scrapfly.scrape(
ScrapeConfig(
"https://web-scraping.dev/product/1",
render_js=True,
# we can scrapfly an automation scenario: click a button and wait a bit
js_scenario=[
{"click": {"selector": "#load-more-reviews"}},
{"wait": 1000}
]
)
)
# retrieve background requests scrapfly captured:
xhr_calls = result.scrape_result["browser_data"]["xhr_call"]
# extract reviews:
reviews = []
for xhr in xhr_calls:
if "/api/reviews" not in xhr["url"]:
continue
reviews.extend(json.loads(xhr["response"]["body"])['results'])
# output:
for review in reviews:
print(review["text"])
print(review["rating"])
print(review["date"])
Scraping Endless Paging
Another popular use case is when XHRs are generated by scrolling.
For this example, the web-scraping.dev/testimonials endpoint uses endless pagination to load testimonial data. It works by using Javascript to detect scrolling changes and loading new testimonial HTML through XHR requests:
To scrape this page, we'll follow this simple algorithm:
Initiate the browser and enable background request capture.
Load the page.
Wait for the initial page to load.
Scroll to the bottom of the page until no new elements are loaded.
Parsed captured background requests for testimonial data.
When it comes to Python, this can be achieved in almost every major browser automation library:
Playwright
ScrapFly
from playwright.sync_api import sync_playwright
testimonials = ""
def intercept_response(response):
# we can extract details from background requests
global testimonials
if "/api/testimonials" in response.request.url:
testimonials += response.text() + "\n"
return response
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
# enable our response interceptor
page.on("response", intercept_response)
page.goto("https://web-scraping.dev/testimonials")
# scroll to the bottom:
_prev_height = -1
_max_scrolls = 100
_scroll_count = 0
while _scroll_count < _max_scrolls:
# Execute JavaScript to scroll to the bottom of the page
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
# Wait for new content to load (change this value as needed)
page.wait_for_timeout(1000)
# Check whether the scroll height changed - means more pages are there
new_height = page.evaluate("document.body.scrollHeight")
if new_height == _prev_height:
break
_prev_height = new_height
_scroll_count += 1
# we collected all results, we can parse them with tools like
from bs4 import BeautifulSoup
soup = BeautifulSoup(testimonials, "html.parser")
for testimonial in soup.select('.testimonial .text'):
print(testimonial.text)
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
result = scrapfly.scrape(
ScrapeConfig(
"https://web-scraping.dev/testimonials/",
render_js=True,
auto_scroll=True,
rendering_wait=2000,
)
)
xhr_calls = result.scrape_result["browser_data"]["xhr_call"]
testimonials = ""
for xhr in xhr_calls:
if "/api/testimonials" not in xhr["url"]:
continue
testimonials += xhr["response"]["body"] + "\n"
# we collected all results, we can parse them with tools like
from bs4 import BeautifulSoup
soup = BeautifulSoup(testimonials, "html.parser")
for testimonial in soup.select(".testimonial .text"):
print(testimonial.text)
print(result)
By collecting background requests here we didn't have to reverse-engineer the backend API which can be obfuscated and require a lot of complex engineering to scrape.
Common Challenges
When scraping XHR responses there are a few common challenges that you might encounter.
Waiting for Page Load
To start, the headless browser needs to ensure the action that triggers a background request is executed.
If it's page state loading itself then the browser needs to wait for the page to load. See these Scrapfly knowledgebase entries for how to ensure waiting in each headless browser tool:
Other times, the XHR trigger is a user action like a page scroll or a button click. All of these actions can be achieved in most browser automation libraries. Though, the actions must be executed the same way they would be in a real browsert to ensure XHR is triggered:
XHR stands for XMLHTTPRequest however in modern web development it's used to refer to any background data request that is triggered by javsacript. Note that non-data requests like image loading, font loading, and other assets are not considered XHR-type requests.
Can I block XHR requests?
Yes. Blocking XHR requests can be an important step in web scraping to reduce the amount of data that is loaded. This can save a lot of bandwidth and significantly speed up scraping. For more see how to block resources in Playwright, Selenium or Puppeteer
What is the difference between XHR and AJAX?
Both are used to refer to the same thing. XHR is the technical term for the background request while AJAX is the term for the technique of using XHR requests to load data in the background. That being said, the AJAX term is not used as much anymore in favor of XHR or simply data request.
What is the difference between XHR and Fetch?
Fetch is a popular Javascript API for making XHR (background requests). It comes included with all web browsers and is used by most websites for background requests.
What is the difference between XHR and API?
Every dynamic website has a backend API powering its data. This API is called by background requests (XHR) to load data from the server. So, XHR is the technique used to load data from the API.
What is the difference between XHR and GraphQL?
GraphQL is a query language for APIs. It's used by some websites to power their backend API. Many XHR requests are made to GraphQL APIs to load data. So, XHR is the technique used to load data from the GraphQL API.
Summary
In this quick tutorial, we've taken a look at a powerful web scraping technique - background request capture.
This technique uses real web browsers to browse the web as a real user would and capture data requests that happen in the background. To execute it, we've taken a look at an example project scraping web-scraping.dev for background data coming from scrolling or button clicks.
In this tutorial we'll take a look at website change tracking using Python, Playwright and Wand. We'll build a tracking tool and schedule it to send us emails on detected changes.