How to Scrape Hidden APIs

How to Scrape Hidden APIs

Modern websites use hidden APIs to render dynamic data on the fly through background requests. Often this is used in product pagination, search functionality and other dynamic page parts. So, scraping these backend APIs can often be unavoidable challenge.

In this web scraping tutorial, we'll take a look at how to find hidden APIs, how to scrape them, and what are some common challenges people face when developing web scrapers for hidden APIs.

We'll be taking a look at how to use the browser's devtools to find and reverse engineer backend API behavior and some popular challenges encountered in this form of web scraping. For this, we'll be scraping real APIs hosted on ScrapFly's web-scraping.dev playground for testing web scrapers.

Alternatively - Scrape using Real Web Browsers

Alternative to using hidden APIs is to scrape using real headless web browsers. For that, see this introduction

Alternatively - Scrape using Real Web Browsers

Setup

In this article for scraper code, we'll be using Python with a few community packages:

  • httpx - feature rich http client that we'll use to interact with the found hidden APIs.
  • parsel - HTML parsing library for extracting data from HTML documents.

We'll also be providing Scrapfly version of the code snippets for our scrapfly users so we'll be using scrapfly-sdk package as well.
All of these packages can be installed using the pip terminal command:

$ pip install httpx parsel scrapfly-sdk

Intro to Devtools

To find hidden APIs, we'll be using browser developer tools that come included in every web browser such as Chrome or Firefox.

The devtools suite features a variety of developer tools which can help us to discover and understand hidden APIs the websites are using.

To start, the devtools can be accessed using the F12 key or right-click -> inspect function:

screengrab instructions of how to launch developer tools on chrome
Devtools can be opened by right click -> inspect or the F12 key

Note: we'll be referencing Chrome browser's devtools suite in this tutorial, though this suite is similar in every web browser

The devtools suite reveals the state of the web browser and the current page. It can tell us everything about what's going on in the background and what's available on the page.

The suite is made up of a variety of tools, each with a specific purpose:

screengrab instructions of how to launch developer tools on chrome

Each tab of the suite indicates a specific function like current document structure, styling details, javascript context, and what's going on in the background.
For scraping hidden web APIs we're mostly interested in the Network tab, so let's take a look at it!

Tracking Network Activity

The Network tab of devtools shows all the connections our browser is making when loading a web page:

screengrab of rows of captured requests by network inspector

Each row represents a single request and response pair. Clicking on it will show us more details which we can inspect to see the exact request and response details.

As this tool is quite big let's take a look at the main features that are used in reading hidden APIs:

Flags to Preserve Log and Disable Cache

flags for devtools network inspector

Preserve log prevents data clear on page reload and disable cache prevents browser from using cached data. With these flags we'll prevent accidental data loss and ensure we see all requests.

Filter by XHR and Doc

flags for devtools network inspector

The requests can be filtered by type indicated by content-type header. For hidden APIs we're mostly interested in XHR and doc types. XHR stands for XMLHttpRequest which basically means a background data request.

Filter by text

flags for devtools network inspector

We can filter for specific requests by their URL or other properties. This is useful when we know what we're looking for but filtering by /api is often a good start.

Text Search (ctrl+f)

flags for devtools network inspector
using search we can find the exact request responsible for selected content

We can search for specific text in the in any network activity. This tool is brilliant for finding requests that generate specific data points - just find a unique identifier and look up what generated it!

Request Headers

flags for devtools network inspector

Shows the request's headers. This is vital for understanding hidden APIs as many require certain header values to be accessed. See the hidden api header section for more.

Request Payload

flags for devtools network inspector

Shows the request's payload. Replicating request's payload to the exact format seen in devtools is vital for successful requests though usually it's the standard JSON or Formdata encoding.

Right-click -> Copy -> Copy as cURL

flags for devtools network inspector

This shortcut allows to copy the request as a cURL command which we can use to replicate the request in code. When combined with tools like curlconverter we can generate Python code straight from devtools requests! Though it's worth noting that generated requests often need some manual adjustments.


Using the Network Inspector, we can find hidden APIs and figure out how to replicate them in our web scraper. Let's look at each part, common problems, and real-life examples.

Request Types in Hidden APIs

Before we begin scraping web APIs it's important to understand the two most common request
types used in API request and their differences.

Generally, hidden APIs communicate through either GET or POST requests.

The main difference from the web scraper's perspective is how we pass API arguments. We use URL parameters for GET requests, and request body for POST requests.

illustration of GET vs POST request usage
The difference between GET and POST requests is the way parameters are transferred

For POST request special attention needs to be paid to request headers.
Particularly Content-Type header which indicates the document format of the request body. In most of the cases it's either Formdata (application/x-www-form-urlencoded) or JSON (application/json).

Python
ScrapFly
import httpx

# 1. example of how to POST JSON 
response = httpx.post(
   "https://httpbin.dev/post", 
   json={
    "key1": "value1",
    "key2": "value2"
  }
)
# we can see that this request was sent with content type and payload as:
response.json()['headers']['Content-Type']
# ['application/json']
response.json()['data']
# '{"key1": "value1", "key2": "value2"}'

# 2. example how to POST Formdata
response = httpx.post(
   "https://httpbin.dev/post", 
   data={
    "key1": "value1",
    "key2": "value2"
  }
)
# we can see that this request was sent with content type and payload as:
response.json()['headers']['Content-Type']
# ['application/x-www-form-urlencoded']
response.json()['data']
# 'key1=value1&key2=value2'
from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
# POST JSON
result: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
    headers={"content-type": "application/json"},
    url="https://httpbin.dev/post",
    body="{\"example\":\"value\"}",
    method="POST",
))
# POST Formdata
result: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
    url="https://httpbin.dev/post",
    body="{\"example\":\"value\"}",
    method="POST",
))

Generally speaking, POST requests are used for bigger more complex actions where GET is insufficient.
For example, a search API may use a POST request to allow for more complex search parameters and filters while a pagination API may use a GET requests for simplicity.

Headers in Hidden APIs

Another key component to hidden API scraping is the request headers.

When we look around our Network inspector, we can see that the requests our browser sends contain all sorts of different headers. Where are they coming from?

We can separate these headers into three categories:

Default Browser Headers
Provided by the browser like User-Agent, Accept-Language, etc. The browser generates these for each request it makes. It's important to replicate them in scraping.

Contextual Headers
Generated by the browser based on user action. For example, clicking a link will add a Referer header. It's important to maintain this in scraper logic.

Custom Headers
Generated by custom javascript code present on the website. These are usually prefixed by X- like X-CSRF-Token and are often required. Missing these headers will cause API to respond with 400-500 response codes.

Most HTTP clients configure some headers by default though we can override and add headers ourselves:

Python
ScrapFly
import httpx
httpx.get(
  "http://httpbin.dev/headers",
  headers={
    "Origin": "http://google.com",
    "Referer": "http://google.com/search",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36",
    "X-CSRF-Token": "123",
  }
)
from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
result: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
    headers={
      "Origin": "http://google.com",
      "Referer": "http://google.com/search",
      "X-CSRF-Token": "123",
    },
    url="https://httpbin.dev/headers",
))

To summarize, when scraping backend APIs it's vital to replicate the data browser is sending for successful requests.

Now with all the details out of the way, let's take a look at some real hidden API scenarios!

Real Life Walkthrough

With theory out of the way let's go through some real life scenarios and see how we can scrape hidden APIs using Python and browser developer tools.

We'll be taking a look at a scenario hosted on web-scraping.dev playground.
Using devtools we'll investigate how this web page functions and connects to the hidden API, and replicate all that in a Python scraper.

The web-scraping.dev/testimonials page features endless pagination of testimonials. New testimonials are being loaded from a hidden HTML API when the user scrolls the page.

With our devtools open, we should focus to XHR tab and take a look at what's happening when we scroll the page:

0:00
/
we can see pages of testimonials load in the background

We can see that the browser is making a GET request to https://web-scraping.dev/api/testimonials?page=2 when we scroll the page. This is the hidden API we're looking for!

Now, to scrape this we should take a look at the request details so we can replicate the request in our Python code:

screencapture of testimonials background request capture by chrome devtools

Here's what we can make out:

  • The URL format for the api is GET https://web-scraping.dev/api/testimonials?page={page_number}
  • There are custom headers being sent. In particular X-Secret-Token - this is a custom header is essentially a password for the API.
  • Browser default headers being sent like Referer and User-Agent that are required by many API endpoints.

This request then returns an HTML document with a page of testimonials. Let's put everything together and see how this looks in a Python scraper:

Python
ScrapFly
import httpx
from parsel import Selector

api_url = "https://web-scraping.dev/api/testimonials"
# we'll use a simple loop to continue scraping next paging page until no results
page_number = 1
while True:
    print(f"scraping page {page_number}")
    response = httpx.get(
        api_url,
        params={"page": page_number},
        headers={
            # this API requires these headers:
            "Referer": "https://web-scraping.dev/testimonials",
            "X-Secret-Token": "secret123",
        }

    )

    # check if scrape is success:
    if response.status_code != 200:
        # last page reached
        if response.json()['detail']['error'] = 'invalid page':
            break
        # something else went wrong like missing header or wrong url?
        raise ValueError("API returned an error - something is missing?", response.json())

    # parse the HTML
    selector = Selector(response.text)
    for testimonial in selector.css('.testimonial'):
        text = testimonial.css('.text::text').get()
        rating = len(testimonial.css('.rating>svg').getall())
        print(text)
        print(f'rating: {rating}/5 stars')
        print('-------------')

    # next page!
    page_number += 1
import json
from urllib.parse import urlencode
from scrapfly import ScrapflyClient, ScrapeConfig, UpstreamHttpClientError

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")

api_url = "https://web-scraping.dev/api/testimonials"
page_number = 1
while True:
    print(f"scraping page {page_number}")
    try:
        result = client.scrape(ScrapeConfig(
            api_url + "?" + urlencode({"page": page_number}),
            headers={
                # this API requires these headers:
                "Referer": "https://web-scraping.dev/testimonials",
                "X-Secret-Token": "secret123",
            }
        ))
    except UpstreamHttpClientError as error:
        # check if scrape is success:
        if error.http_status_code != 200:
            # last page reached
            data = json.loads(error.api_response.content)
            if data['detail']['error'] == 'invalid page':
                break
        # something else went wrong like missing header or wrong url?
        raise ValueError("API returned an error - something is missing?", result.json())

    # parse the HTML
    for testimonial in result.selector.css('.testimonial'):
        text = testimonial.css('.text::text').get()
        rating = len(testimonial.css('.rating>svg').getall())
        print(text)
        print(f'rating: {rating}/5 stars')
        print('-------------')

    # next page!
    page_number += 1

Above we're reimplementing what we've observed in our developer tools. We're making requests to /api/testimonials endpoint with URL parameter page for paging number and secret authentications headers we found in the inspector. Then, we parse the HTML results using the usual HTML parsing tools.

This example illustrates how a backend API is reverse engineered using chrome devtools Network Inspector and it's a very common web scraping pattern.

Finding Keys and Tokens

When it comes to hidden API scraping replicating headers and the values we see in the devtools is very important to access hidden APIs successfully.

In our scraper example we just hard-coded the required X-Secret-Token: secret123 headers but what if this value was dynamic and changed often? Our scrapers need to know how to retrieve or generate this token to continue scraping.

Finding Values in HTML

By far the most common scenario is to store the values used in the backend API in the corresponding HTML body. In our /testimonials example, we can open up page source or Elements devtools and ctrl+f for known value - secret123:

capture of token lookup in elements devtools
We can see that the secret token is hidden in a script element in HTML

This means that we can easily pick this value up by scraping the HTML and extracting it for our backend API scraper.

Alternative Locations

Here are a few possibilities where these values could be located:

  • JavaScript files can contain secret values or functions that generate them.
  • Random looking numbers could be just random. uuid4 is the most commonly used way to generate random values in these cases.
  • The value could be stored in a cookie. In this case it would be seen in the Cookie header in the devtools request details.
  • IndexDB and Local Storage browser databases. These can be observed in devtools->Application->IndexDB or Local Storage sections.

Generally, if the value is stored using content search in the Networks Inspector tab should reveal the location. Note that some values are perfectly safe to hard-code as they can be version numbers or other static details that are required for functionality but rarely change.

Bypass Blocking with Scrapfly

While hidden APIs are generally easy to scrape they are often under the same anti-scraping systems the rest of the website is using which can result in scraper blocking.

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

With Scrapfly we can scrape hidden APIs at scale without worrying about blocking. Here's an example of using Python SDK to scrape hidden APIs:

from scrapfly import ScrapflyClient, ScrapeConfig, UpstreamHttpClientError

client = ScrapflyClient(key=os.environ['SCRAPFLY_KEY'])

api_url = "https://web-scraping.dev/api/testimonials"
page_number = 1
result = client.scrape(ScrapeConfig(
    "https://web-scraping.dev/api/testimonials?page=1"
    headers={
        # this API requires these headers:
        "Referer": "https://web-scraping.dev/testimonials",
        "X-Secret-Token": "secret123",
    }
))
for testimonial in result.selector.css('.testimonial'):
    text = testimonial.css('.text::text').get()
    rating = len(testimonial.css('.rating>svg').getall())
    print(text)
    print(f'rating: {rating}/5 stars')

FAQ

To wrap up our hidden api scraping intro let's take a look at some frequently asked questions:

Yes, it's perfectly legal to scrape hidden APIs as long as they are part of a publicly available website. Note that this area is still not well understood legally in the global world so for high scale scraping you should consult a lawyer in your country. Generally, websites prefer to be scraped through hidden APIs compared to public pages as it's significantly less resources used and don't interfere with web analytics like headless browsers could.

What are the dangers of scraping hidden APIs?

Since hidden APIs are by design not public they can change at any point without warning breaking web scrapers. When scraping hidden APIs it's a good idea to keep the scraper as simple as possible to prevent crashing on minor API updates.

Can I scrape hidden APIs using a headless browser?

Yes, a common way to discover hidden APIs is to use a headless browser and capture all background requests. These captured requests can be analyzed and replicated in non-browser web scrapers.

How to prevent scraper blocking when scraping hidden APIs?

They key to hidden API scraping is to handle request details like headers, cookies and payload correctly. Additionally, some pages might be protected by anti-bot systems which require additional bypass measures. For that see introduction to scraper blocking bypass.

Summary

In this tutorial, we've taken a look at how to scrape hidden web APIs which is one of the most efficient and powerful web scraping techniques. To find hidden APIs we used a network inspector tool that comes with every web browser. Then, we replicated hidden API requests in our python scraper. We also looked into common challenges like required headers through a real-life example.

Related Posts

Playwright Examples for Web Scraping and Automation

Learn Playwright with Python and JavaScript examples for automating browsers like Chromium, WebKit, and Firefox.

How to use wget in Python

Learn how to use wget in Python through subprocess calls and what are other options.

Ultimate Guide to JSON Parsing in Python

Learn JSON parsing in Python with this ultimate guide. Explore basic and advanced techniques using json, and tools like ijson and nested-lookup