How to Scrape Hidden Web Data

article feature image

Modern websites store data not only in the visible HTML page but in the embedded javascript code as well. This is especially common in dynamic website elements that are rendered by javascript on page load or triggered by user interactions.

The most common way to scraping dynamic data is to use a headless browser to force hidden data rendering in the HTML. In this article, however, we'll be taking a look at how can we extract this data directly without the use of web browsers which can be a thousand times faster and more efficient approach.

We'll take a look at what is hidden data, some common examples and how can we scrape it using regular expressions and other clever parsing algorithms.

Scraping Dynamic Websites Using Web Browsers

If you'd like to learn about the alternative approach of using headless browsers for this challenge see our complete introduction article

Scraping Dynamic Websites Using Web Browsers

What is Hidden Web Data?

Dynamic web front-ends often store data in javascript variables and then render it as HTML on demand (like page load or user action). This means the data is not visible on the page directly though it's still there!

For example, a website could do this:

<html>
  <head>
  </head>
  <body>
    <div id="product">
      <!-- There's no product data in the html -->
    </div>
    <script>
        // but we can see data here
        var data = {"product": {"name": "some product", "price": 44.33}};
        // and it's being  put into the HTML on page load:
        productName = document.createElement("div");
        productName.setAttribute("id", "product-name");
        productName.innerText = data['product']['name'];
        product = document.getElementById("product");
        product.appendChild(productName);
    </script>
  </body>
</html>

We see that the initial HTML just has an empty product <div> node and the data itself resides in a javascript variable data. Then, on page load, javascript is used to turn that data into visible HTML nodes. If we look at the page source in our javascript-enabled browser we would see:

<div id="product">
  <div id="product-name">some product</div>
</div>

Modern web developers love this technique as they can just hide all of the data in the page and update the front-end to represent data any way they like.
Unfortunately, web scrapers, which do not execute javascript (anything that doesn't run a browser) don't see this data rendered to HTML - meaning, they have to find ways to find and parse those Javascript variables.

How to Find Hidden Web Data

We can approach hidden web data in two ways:

Tools like Playwright, Puppeteer and Selenium can be used to control a real, headless web browser to render the pages and return final rendered HTML. Though this is expensive and slow - we need to run a whole web browser and wait for everything to load!

Alternatively, we can parse the HTML for these hidden state/cache variables using HTML parsing tools, regular expressions and common parsing algorithms. We have to get our hands dirty but our process will be significantly faster and we'll have access to the whole dataset which might contain more details than we can see in the visible HTML.

Hidden web data also often contains various tokens used by website's hidden APIs or details used to obfuscate data or for web scraper blocking.

Let's take a look at some common ways hidden data is stored and how we can find it.

Finding Hidden JSON Data

To confirm whether the website contains hidden web data we can employ a simple test:

  1. Load the page in our web browser and find a unique data identifier (such as product name, id or part of the description).
  2. Disable javascript in our browser and reload the page.
  3. Check page source (right click on the page) and look for our unique identifier (e.g. ctrl+f)

Almost all forms of hidden data are stored in HTML nodes such as <script>. Which could be a JSON object or a variable. So, the first thing we can do is capture the script text containing this data.

We can do this using common HTML parsing packages like parsel or beautifulsoup:

import json
html = """
<html>
  <head>
  </head>
  <body>
    <script id="__NEXT_DATA__" type="application/json">
      {"product": {"id": 1, "name": "first product"}}
    </script>
  </body>
</html
"""

# using parsel
from parsel import Selector
selector = Selector(html)
data = selector.css("#__NEXT_DATA__::text").get()
data = json.loads(data)
print(data['product'])
# {"id": 1, "name": "first product"}

# using beautifulsoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
data = soup.select_one("#__NEXT_DATA__").text
data = json.loads(data)
print(data['product'])
# {"id": 1, "name": "first product"}

In both cases above we load HTML and find text in the <script> node with the specific id attribute. Then load the found JSON data as Python dictionary and we can parse it as we wish!

This often can be enough to retrieve hidden data if it's stored as type=application-json as it is in our example. However, that's not always the case and the data in the script can be under a javascript variable.

Using Regex

Regular Expressions are perfect for finding structured text data such as JSON. For example, if our hidden data appears like this in the source code:

<script id="__NEXT_DATA__">
  // javascript data:
  var product = {"product": {"id": "1", "name": "first product"}};
  var _meta = ...
</script>

Python's JSON module is not smart enough to extract this. Instead, we can assist it with regular expressions:

html = """
<html>
  <head>
  </head>
  <body>
    <script id="__NEXT_DATA__">
      // javascript data:
      var product = {"product": {"id": "1", "name": "first product"}};
      var _meta = ...
    </script>
  </body>
</html>
"""

# find script text using parsel:
from parsel import Selector

selector = Selector(html)
script_text = selector.css("#__NEXT_DATA__::text").get()

# find json using regular expressions:
import re
import json

data = re.findall(r"product = ({.*?});", script_text)
data = json.loads(data[0])
print(data["product"])

In the example above we used a regular expression pattern to select the text between product = and }; tokens which is the hidden JSON web data.

Regular expressions work great but can get quite complicated and break easily. Another approach to extract this data is to use common data parsing algorithms - let's take a look at that next.

Using JSON Finding Algorithms

Python comes with a great JSON data decoder that can be used to find JSON documents in any text!

For example, here's a popular function that can find all valid JSON objects in a text string:

import json


def find_json_objects(text: str, decoder=json.JSONDecoder()):
    """Find JSON objects in text, and generate decoded JSON data"""
    pos = 0
    while True:
        match = text.find("{", pos)
        if match == -1:
            break
        try:
            result, index = decoder.raw_decode(text[match:])
            yield result
            pos = match + index
        except ValueError:
            pos = match + 1


text = """
This text contains some {"json": "objects"} and some json products like
product = {"product": {"id": 1, "name": "first product"}};
console.log("more javascript");
"""
found = list(find_json_objects(text))
print(found)
# [{'json': 'objects'}, {'product': {'id': 1, 'name': 'first product'}}]

The function finds all JSON objects in any text string which is much more convenient than our regex example. Also, since we know how our product data object looks (e.g. it contains a product key) we can select it exclusively without much extra effort:

product = next(data for data in found if data.get('product'))
print(product)
# {'product': {'id': 1, 'name': 'first product'}}

Finding Javascript Data

JSON objects in javascript are native meaning they can contain javascript code itself and that's where things get complicated. A valid javascript code object is not a valid JSON data object. Let's take a look at this example:

text = """
var product = {
    // some comment:
    "element": document.createElement("div"),
    "url": "http://foo.com", // some trailing comment
    "price": 44.23,"discount": 22.11,
    "features": ["warm", "cold"],
    "product": {"id": 1, "name": "first product"}
}
"""
print(list(find_json_objects(text)))

Both of our regex and JSON finder based solutions would fail to parse the whole object successfully. That's because this is a valid javascript object and not a valid JSON data object. It contains comments and code blocks that our scraper cannot understand without a web browser.

There are a few ways we could approach this:

  • Remove comments and anything that is not a base data type (string, number, boolean etc.) and then use our JSON finder.
  • Parse javascript code using javascript language parsers and then extract that data.

Depending on your project size and complexity either one of these approaches could be more fitting. For example, for something small projects we can hack our JSON finder to remove the garbage data though for bigger projects we'd probably need to invest more time into a more resilient language-parsing-based approach.
Let's take a look at both!

Removing Javascript from JSON

To convert javascript objects to JSON objects all we have to do is remove any values that are not primitive values like strings, booleans or numbers and remove comments.

To clear the objects we can use regular expressions and for comments, we can take advantage of existing packages like pyparsing:

import re
import pyparsing
import json

comment_remover = pyparsing.cpp_style_comment.suppress()
comment_remover.ignore(pyparsing.QuotedString('"') | pyparsing.QuotedString("'"))


def remove_objects(text):
    """
    replaces all `"key": object` ocurrances in text
    with `"key": {}`
    """
    text = comment_remover.transform_string(text)

    def _rm(match: re.Match):
        key, value, trail = match.groups()
        return key + "{}" + trail

    return re.sub(r'("[^"]+?"\s*:\s*)([^"\s[{\d(?:true|false)].+?)(,|$|})', _rm, text)


cleaned = remove_objects(text)
# let's try it with our text:
text = """
var product = {
    // some comment:
    "element": document.createElement("div"),
    "url": "http://foo.com", // some trailing comment
    "price": 44.23,"discount": 22,
    "features": ["warm", "cold"],
    "product": {"id": 1, "name": "first product"}
}
"""
clean_text = remove_objects(comment_remover.transform_string(text))
print(list(find_json_objects(clean_text)))
# will print:
[
    {
        "element": {},
        "url": "http://foo.com",
        "price": 44.23,
        "discount": 22.11,
        "features": ["warm", "cold"],
        "product": {"id": 1, "name": "first product"},
    }
]

With this quick hack, we can easily scrape more complex embedded JSON structures. Though, we are losing all of that javascript data - what if there's something valuable there? Additionally, regular expression patterns although fast, are complicated and can break easily upon website changes.
Let's take a look at another approach - parsing javascript code itself.

Parsing Javascript with js2xml

Just like javascript interpreters need to parse the code to understand it we can also parse it for variable data.

Using js2xml we can convert javascript code (including JSON) to XML document which we can parse using CSS or XPath selectors. Let's take a look at our example again:

import js2xml
from js2xml.utils.vars import get_vars, make_obj

text = """
var product = {
    // some comment:
    "element": document.createElement("div"),
    "url": "http://fo,o.com", // some trailing comment
    "price": 44.23,"discount": document.deleteElement(foo),
    "features": ["warm", "cold"],
    "product": {"id": 1, "name": "first, product"}
}
"""
# first convert javascript code to XML tree (return lxml.Element)
parsed_tree = js2xml.parse(text)

# we can see generated XML tree:
print(js2xml.pretty_print(parsed_tree))
"""
<program>
  <var name="product">
    <object>
      <property name="element">
        <functioncall>
          <function>
            <dotaccessor>
              <object>
                <identifier name="document"/>
              </object>
              <property>
                <identifier name="createElement"/>
              </property>
            </dotaccessor>
          </function>
          <arguments>
            <string>div</string>
          </arguments>
        </functioncall>
      </property>
      <property name="url">
        <string>http://fo,o.com</string>
      </property>
      <property name="price">
        <number value="44.23"/>
      </property>
      <property name="discount">
        <functioncall>
          <function>
            <dotaccessor>
              <object>
                <identifier name="document"/>
              </object>
              <property>
                <identifier name="deleteElement"/>
              </property>
            </dotaccessor>
          </function>
          <arguments>
            <identifier name="foo"/>
          </arguments>
        </functioncall>
      </property>
      <property name="features">
        <array>
          <string>warm</string>
          <string>cold</string>
        </array>
      </property>
      <property name="product">
        <object>
          <property name="id">
            <number value="1"/>
          </property>
          <property name="name">
            <string>first, product</string>
          </property>
        </object>
      </property>
    </object>
  </var>
</program>
"""

# we can also extract this tree as json
print(get_vars(parsed_tree))
{
    "product": {
        "element": None,
        "url": "http://fo,o.com",
        "price": 44.23,
        "discount": None,
        "features": ["warm", "cold"],
        "product": {"id": 1, "name": "first, product"},
    }
}

# or if the json is deep in the code we can find it with xpath and then convert it
print(make_obj(parsed_tree.xpath('//property[@name="product"]/object')[0]))
{"id": 1, "name": "first, product"}

In the example above we used js2xml to convert javascript code to XML and then we can either parse it with css/xpath selectors or convert data to python dictionaries.

Some Real Examples

We encounter hidden web data often in our scrapeguide blog series which cover tutorials on how to scrape popular web scraping targets.

For example, in we use simple regex patterns when scraping https://www.glassdoor.com/index.htm in our How to Scrape Glassdoor article:

import re
import httpx
import json

def extract_apollo_state(html):
    """Extract apollo graphql state data from HTML source"""
    # here we use regex pattern to find first json object after apolloState keyword:
    data = re.findall('apolloState":\s*({.+})};', html)[0]
    return json.loads(data)

def scrape_overview(company_id: int):
    short_url = f"https://www.glassdoor.com/Overview/-IE_EI{company_id}.htm"
    response = httpx.get(short_url)
    apollo_state = extract_apollo_state(response.text)
    return next(v for k, v in state.items() if k.startswith("Employer:"))

# Ebay's glassdoor profile page:
print(json.dumps(scrape_overview("7671"), indent=2))

Some other hidden web data examples we've covered on this blog:

Using Browsers

We covered how scraping hidden web data can be an alternative to using headless browsers to fully render dynamic data. In the same way, we can use headless browsers to retrieve javascript variables present in the page which returns fully rendered hidden web datasets.

For example, let's say we have this hidden web data piece:

html = """
<html>
  <head>
  </head>
  <body>
    <script id="__NEXT_DATA__">
      var product = {
        "product": {
          "id": "1",
          "name": "first product",
          "secret": create_secret()
        }};
      var _meta = ...
    </script>
  </body>
</html

Here, we can see that the secret field is dynamically generated by a javascript function. If we scrape this as is we'd just get the function name in our data.

Instead, we can fire up a real, headless web browser through Playwright, Puppeteer or Selenium and evaluate custom javascript to capture this data.

As a real-life example, let's go back to Glassdoor and see how we could do this in Playwright and Python:

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=True)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()

    # got to glassdoor url
    page.goto("https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm") 
    # extract apolloState data, Eployer:7853 contains company overview data 
    # of ebay which is ID 7853:
    data = page.evaluate("window.appCache.apolloState['Employer:7853']")
    print(data)
# will print
{
    '__typename': 'Employer', 
    'id': 7853, 
    'shortName': 'eBay', 
    'website': 'www.ebayinc.com', 
    'type': 'Company - Public', 
    'revenue': '$10+ billion (USD)', 
    'headquarters': 'San Jose, CA', 
    'size': '10000+ Employees', 
    'stock': 'EBAY', 
    ...
}

In the example above, we fire up a headless instance of a Chrome browser, tell it to go to Ebay's profile page on glassdoor.com and extract hidden web data through javascript evaluation function.

Scraping Hidden Data with ScrapFly

ScrapFly web scraping API is a great tool for collecting hidden API data as it can bypass anti scraping protection services and in general prevent scrapers from being blocked by the use of millions residential of proxies.

illustration of scrapfly's middleware

ScrapFly headless browser feature also gives as an extremely easy access to javascript evaluation without having to run a headless browser ourselves. We can replicate our Glassdoor example from above either through ScrapFly SDK:

from scrapfly import ScrapeConfig, ScrapflyClient

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
    url="https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm",
    # enable headless browser use and evaluate javascript script
    render_js=True,
    js="return window.appCache.apolloState['Employer:7853']",
    # we can tell the headless browser to wait 2 seconds for the content to load:
    rendering_wait=2_000,
    # we can set specific proxy country:
    country="CA",
    # we can also take screenshots to see what our browser is doing:
    screenshots={"fullpage": "fullpage"}

))
# get the javascript result:
print(result.scrape_result['browser_data']['javascript_evaluation_result'])

Or try it through the interactive web player directly.

FAQ

To wrap this article up let's take a look at some frequently asked questions regarding the scraping of hidden web data:

Yes, hidden web data is the same public data as the visible HTML. Note that due to GDRP in the European Union region hidden web data should be cleared of user-identifying information.

Summary

Hidden web data is becoming increasingly popular as websites rely more and more on javascript to generate web content dynamically. So, in this extensive tutorial, we've taken a look at how to scrape this data, how to parse it and what are common challenges in these areas.

We explored common regular expression patterns, JSON parsing algorithms and tools like js2xml and pyparsing for lexical data parsing - all of which are great tools to find public hidden datasets on the web.

Related Posts

Quick Intro to Parsing JSON with JMESPath in Python

Introduction to JMESPath - JSON query language which is used in web scraping to parse JSON datasets for scrape data.

How to Scrape Redfin Real Estate Property Data in Python

Tutorial on how to scrape Redfin.com sale and rent property data, using Python and how to avoid blocking to scrape at scale.

How to Scrape Real Estate Property Data using Python

Introduction to scraping real estate property data. What is it, why and how to scrape it? We'll also list dozens of popular scraping targets and common challenges.