Hidden Web Data

Dynamic web pages use javascript to render page structures and content and that data has to come from somewhere. We've taken a look at hidden APIs that generate data through background requests but sometimes the data is already in the HTML just not visible.

Introduction

Modern front-end frameworks like Next.js, React, Vue.js and similar often store data in HTML <script> elements as javascript variables or JSON. Then, on page load or user interaction turn this data into HTML elements.

For example, we can take a look at the product page on web-scraping.dev/product/1. With javascript disabled, we can see that the product has no reviews, though if we re-enable it we can see them pop in on page load.

It doesn't make any background requests to load the reviews either to they must come from somewhere.

The easiest way to figure this out is simply CTRL+F unique element detail in page source:

as seen in devtools Elements search

Hidden web data is great for web scraping as it allows to capture entire datasets with very little code or resource use.

python icon
Intro to Scraping Hidden Web Data

For more see our full hidden web data scraping walkthrough on Scrapfly blog which covers a real-life example, best practices and automatic ways of finding JSON in HTML bodies.

Next - JSON Parsing

Scrapped hidden web datasets are often huge and complex JSON blobs, so in the next section, we'll take a look at how to parse JSON, best practices and how to handle large datasets.

< >

Summary