Web Scraping Endless Paging

Endless paging also known as dynamic or interactive paging is a data pagination technique where every page is loaded dynamically on user action like scroll or a button press.

This paging method can be difficult to scrape as it requires reverse engineering knowledge to replicate the actions browser performs in the background to load next page data. Let's take a look at a real-life example to learn more.

Real Life Example

For this example let's take a look at the testominial paging on web-scraping.dev/testimonials

We can see that testimonial results keep appearing as we scroll down in our browser window. We can confirm this further by disabling javascript which will only show the first set of testimonials.

To scrape this we have two options:

Reverse engineer the way paging works and replicate it in our scraper
Use browser automation tools like Playwright, Puppeteer or Selenium to emulate paging load actions

Both of these methods are valid and while the 1st one is a bit more difficult it's much more efficient. Let's take a look at it first.

Reverse Engineering Endless Paging

Using browser developer tools (F12 key) Network Tab we can observe XHR-type requests when triggering page load actions such as scrolling:

In our web-scraping.dev example we can see an XHR request is being made every time we scroll to the end of the page. This backend XHR request contains data for the next page. Let's see how to replicate this in web scraping:

Scrapfly.py

Scrapfly.js

python

javascript

from scrapfly import ScrapflyClient, ScrapeConfig, UpstreamHttpClientError

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
HEADERS = {"Referer": "https://web-scraping.dev/testimonials"}

page_results = []
page = 1
while True:
    try:
        result = client.scrape(
            ScrapeConfig(
                f"https://web-scraping.dev/api/testimonials?page={page}",
                headers=HEADERS,
            )
        )
    except UpstreamHttpClientError:
        break
    page_results.append(result.content)
    page += 1

print(f"scraped {len(page_results)} pages")
# scraped 6 pages

import { ScrapflyClient, ScrapeConfig } from 'scrapfly-sdk';
const client = new ScrapflyClient({ key: "YOUR SCRAPFLY KEY" });
const HEADERS = { "Referer": "https://web-scraping.dev/testimonials" };

let pageResults = [];
let page = 1;

while (true) {
    try {
        const result = await client.scrape(
            new ScrapeConfig({
                url: `https://web-scraping.dev/api/testimonials?page=${page}`,
                headers: HEADERS,
            }),
        );
        pageResults.push(result.result.content);
        page += 1;
    } catch {
        break;
    }
}
console.log(`Scraped ${pageResults.length} pages`);
// scraped 6 pages

import httpx  # or requests

# note: /testominials endpoit requires a referer header:
HEADERS = {"Referer": "https://web-scraping.dev/testimonials"}

page_results = []
page = 1
while True:
    response = httpx.get(f"https://web-scraping.dev/api/testimonials?page={page}", headers=HEADERS)
    if response.status_code != 200:
        break
    page_results.append(response.text)
    page += 1

print(f"scraped {len(page_results)} pages")
# scraped 6 pages

import axios from 'axios'; // axios as http client
const HEADERS = {"Referer": "https://web-scraping.dev/testimonials"};

let pageResults = [];
let page = 1;

while (true) {
    try {
        const response = await axios.get(`https://web-scraping.dev/api/testimonials?page=${page}`, { headers: HEADERS });
        if (response.status !== 200) {
            break;
        }
        pageResults.push(response.data);
        page += 1;
    } catch (error) {
        break;
    }
}
console.log(`Scraped ${pageResults.length} pages`);
// scraped 6 pages

Above, we are replicating the endless scroll pagination by calling the backend API directly in a scraping loop until we receive no more results.

Endless Paging with Headless Browsers

Alternatively, we can run an entire headless browser and tell it to perform the dynamic paging actions to load of the paging data and then retrieve it.

There are a few ways to approach this and for that let's take a look at out web-scraping.dev example again.

The first approach is to load the /testimonials page and scroll to the bottom of the page and then collect all results:

Scrapfly.py

Scrapfly.js

python

javascript

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
result = client.scrape(
    ScrapeConfig(
        "https://web-scraping.dev/testimonials",
        render_js=True,  # enable javascript rendering through headless browsers
        auto_scroll=True,  # scroll to the bottom of the page
    )
)
testimonials = result.selector.css(".testimonial .text::text").getall()
print(f"scraped {len(testimonials)} testimonials")
# scraped 40 testimonials

import { ScrapflyClient, ScrapeConfig } from 'scrapfly-sdk';
import cheerio from 'cheerio';
const client = new ScrapflyClient({ key: "YOUR SCRAPFLY KEY" });
const result = await client.scrape(
    new ScrapeConfig({
        url: "https://web-scraping.dev/testimonials",
        render_js: true,  // enable javascript rendering through headless browsers
        auto_scroll: true, // scroll to the bottom of the page
    }),
);

let selector = cheerio.load(result.result.content);
let testimonials = selector(".testimonial .text").map((i, el) => selector(el).text()).get();
console.log(`Scraped ${testimonials.length} testimonials`);
// scraped 40 testimonials

# This example is using Playwright but it's also possible to use Selenium with similar approach
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://web-scraping.dev/testimonials/")

    # scroll to the bottom:
    _prev_height = -1
    _max_scrolls = 100
    _scroll_count = 0
    while _scroll_count < _max_scrolls:
        # Execute JavaScript to scroll to the bottom of the page
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        # Wait for new content to load (change this value as needed)
        page.wait_for_timeout(1000)
        # Check whether the scroll height changed - means more pages are there
        new_height = page.evaluate("document.body.scrollHeight")
        if new_height == _prev_height:
            break
        _prev_height = new_height
        _scroll_count += 1
    # now we can collect all loaded data:
    results = []
    for element in page.locator(".testimonial").element_handles():
        text = element.query_selector(".text").inner_html()
        results.append(text)
    print(f"scraped {len(results)} results")
    # scraped 40 results

// This example is using puppeteer but it's also possible to use Playwright with almost identical API
const puppeteer = require('puppeteer');

async function scrapeTestimonials() {
    const browser = await puppeteer.launch({headless: false});
    const page = await browser.newPage();

    await page.goto('https://web-scraping.dev/testimonials/');

    let prevHeight = -1;
    let maxScrolls = 100;
    let scrollCount = 0;

    while (scrollCount < maxScrolls) {
        // Scroll to the bottom of the page
        await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
        // Wait for page load
        await page.waitForTimeout(1000);
        // Calculate new scroll height and compare
        let newHeight = await page.evaluate('document.body.scrollHeight');
        if (newHeight == prevHeight) {
            break;
        }
        prevHeight = newHeight;
        scrollCount += 1;
    }

    // Collect all loaded data
    let elements = await page.$$('.testimonial');
    let results = [];
    for(let element of elements) {
        let text = await element.$eval('.text', node => node.innerHTML);
        results.push(text);
    }

    console.log(`Scraped ${results.length} results`);
    // Scraped 40 results

    await browser.close();
}

scrapeTestimonials();

Using headless browsers is a much more simple and elegant solution however as we're using a real browser we use significantly more resources

So, whether to reverse-engineer backend API or use a headless browser depends on available resources and the complexity of the targeted system.