How to Scrape Hidden Web Data

Q: When should I use regex vs JSON parsing to extract hidden web data?

Use regex for simple, predictable patterns like var data = {...}; or when you need to extract specific JSON objects. Use JSON parsing algorithms for complex, nested data or when you need to validate JSON structure. Regex is faster but less reliable, while JSON parsing is more robust.

Q: What are the differences between BeautifulSoup and browser automation for scraping hidden data?

BeautifulSoup Playwright Selenium

by Bernardas Ališauskas Sep 26, 2025

#python #data-parsing

Modern websites store data not only in the visible HTML page but in the embedded javascript code as well. This is especially common in dynamic website elements that are rendered by javascript on page load or triggered by user interactions.

The most common way to scraping dynamic data is to use a headless browser to force hidden data rendering in the HTML. In this article, however, we'll be taking a look at how can we extract this data directly without the use of web browsers which can be a thousand times faster and more efficient approach.

We'll take a look at what is hidden data, some common examples and how can we scrape it using regular expressions and other clever parsing algorithms.

How to Scrape Dynamic Websites Using Headless Web Browsers

Introduction to using web automation tools such as Puppeteer, Playwright, Selenium and ScrapFly to render dynamic websites for web scraping

Key Takeaways

Learn how to view hidden content on a website by extracting JavaScript variables, JSON data, and embedded information using regex and parsing techniques.

Extract hidden JSON data embedded in JavaScript variables using regex patterns and string manipulation
Parse JavaScript object literals and data structures embedded within HTML script tags
Use regex and string parsing to extract data that's not visible in rendered HTML but present in source code
Handle different data formats including escaped JSON, base64 encoded data, and URL-encoded strings
Implement efficient parsing algorithms that are faster than browser automation for hidden data extraction
Apply data validation and error handling when parsing unpredictable JavaScript-embedded data structures

What is Hidden Web Data?

Dynamic web front-ends often store data in javascript variables and then render it as HTML on demand (like page load or user action). This means the data is not visible on the page directly though it's still there!

For example, a website could do this:

<html>
  <head>
  </head>
  <body>
    <div id="product">
      <!-- There's no product data in the html -->
    </div>
    <script>
        // but we can see data here
        var data = {"product": {"name": "some product", "price": 44.33}};
        // and it's being  put into the HTML on page load:
        productName = document.createElement("div");
        productName.setAttribute("id", "product-name");
        productName.innerText = data['product']['name'];
        product = document.getElementById("product");
        product.appendChild(productName);
    </script>
  </body>
</html>

We see that the initial HTML just has an empty product <div> node and the data itself resides in a javascript variable data. Then, on page load, javascript is used to turn that data into visible HTML nodes. If we look at the page source in our javascript-enabled browser we would see:

<div id="product">
  <div id="product-name">some product</div>
</div>

Modern web developers love this technique as they can just hide all of the data in the page and update the front-end to represent data any way they like.
Unfortunately, web scrapers, which do not execute javascript (anything that doesn't run a browser) don't see this data rendered to HTML - meaning, they have to find ways to find and parse those Javascript variables.

How to Find Hidden Web Data

We can approach hidden web data in two ways:

Tools like Playwright, Puppeteer and Selenium can be used to control a real, headless web browser to render the pages and return final rendered HTML. Though this is expensive and slow - we need to run a whole web browser and wait for everything to load!

Alternatively, we can parse the HTML for these hidden state/cache variables using HTML parsing tools, regular expressions and common parsing algorithms. We have to get our hands dirty but our process will be significantly faster and we'll have access to the whole dataset which might contain more details than we can see in the visible HTML.

Hidden web data also often contains various tokens used by website's hidden APIs or details used to obfuscate data or for web scraper blocking.

Let's take a look at some common ways hidden data is stored and how we can find it.

Finding Hidden JSON Data

To confirm whether the website contains hidden web data we can employ a simple test:

Load the page in our web browser and find a unique data identifier (such as product name, id or part of the description).
Disable javascript in our browser and reload the page.
Check page source (right click on the page) and look for our unique identifier (e.g. ctrl+f)

Almost all forms of hidden data are stored in HTML nodes such as <script>. Which could be a JSON object or a variable. So, the first thing we can do is capture the script text containing this data.

We can do this using common HTML parsing packages like parsel or beautifulsoup:

import json
html = """
<html>
  <head>
  </head>
  <body>
    <script id="__NEXT_DATA__" type="application/json">
      {"product": {"id": 1, "name": "first product"}}
    </script>
  </body>
</html
"""

# using parsel
from parsel import Selector
selector = Selector(html)
data = selector.css("#__NEXT_DATA__::text").get()
data = json.loads(data)
print(data['product'])
# {"id": 1, "name": "first product"}

# using beautifulsoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
data = soup.select_one("#__NEXT_DATA__").text
data = json.loads(data)
print(data['product'])
# {"id": 1, "name": "first product"}

In both cases above we load HTML and find text in the <script> node with the specific id attribute. Then load the found JSON data as Python dictionary and we can parse it as we wish!

This often can be enough to retrieve hidden data if it's stored as type=application-json as it is in our example. However, that's not always the case and the data in the script can be under a javascript variable.

Using Regex

Regular Expressions are perfect for finding structured text data such as JSON. For example, if our hidden data appears like this in the source code:

<script id="__NEXT_DATA__">
  // javascript data:
  var product = {"product": {"id": "1", "name": "first product"}};
  var _meta = ...
</script>

Python's JSON module is not smart enough to extract this. Instead, we can assist it with regular expressions:

html = """
<html>
  <head>
  </head>
  <body>
    <script id="__NEXT_DATA__">
      // javascript data:
      var product = {"product": {"id": "1", "name": "first product"}};
      var _meta = ...
    </script>
  </body>
</html>
"""

# find script text using parsel:
from parsel import Selector

selector = Selector(html)
script_text = selector.css("#__NEXT_DATA__::text").get()

# find json using regular expressions:
import re
import json

data = re.findall(r"product = ({.*?});", script_text)
data = json.loads(data[0])
print(data["product"])

In the example above we used a regular expression pattern to select the text between product = and }; tokens which is the hidden JSON web data.

Regular expressions work great but can get quite complicated and break easily. Another approach to extract this data is to use common data parsing algorithms - let's take a look at that next.

Using JSON Finding Algorithms

Python comes with a great JSON data decoder that can be used to find JSON documents in any text!

For example, here's a popular function that can find all valid JSON objects in a text string:

import json


def find_json_objects(text: str, decoder=json.JSONDecoder()):
    """Find JSON objects in text, and generate decoded JSON data"""
    pos = 0
    while True:
        match = text.find("{", pos)
        if match == -1:
            break
        try:
            result, index = decoder.raw_decode(text[match:])
            yield result
            pos = match + index
        except ValueError:
            pos = match + 1


text = """
This text contains some {"json": "objects"} and some json products like
product = {"product": {"id": 1, "name": "first product"}};
console.log("more javascript");
"""
found = list(find_json_objects(text))
print(found)
# [{'json': 'objects'}, {'product': {'id': 1, 'name': 'first product'}}]

The function finds all JSON objects in any text string which is much more convenient than our regex example. Also, since we know how our product data object looks (e.g. it contains a product key) we can select it exclusively without much extra effort:

product = next(data for data in found if data.get('product'))
print(product)
# {'product': {'id': 1, 'name': 'first product'}}

Finding Javascript Data

JSON objects in javascript are native meaning they can contain javascript code itself and that's where things get complicated. A valid javascript code object is not a valid JSON data object. Let's take a look at this example:

text = """
var product = {
    // some comment:
    "element": document.createElement("div"),
    "url": "http://foo.com", // some trailing comment
    "price": 44.23,"discount": 22.11,
    "features": ["warm", "cold"],
    "product": {"id": 1, "name": "first product"}
}
"""
print(list(find_json_objects(text)))

Both of our regex and JSON finder based solutions would fail to parse the whole object successfully. That's because this is a valid javascript object and not a valid JSON data object. It contains comments and code blocks that our scraper cannot understand without a web browser.

There are a few ways we could approach this:

Remove comments and anything that is not a base data type (string, number, boolean etc.) and then use our JSON finder.
Parse javascript code using javascript language parsers and then extract that data.

Depending on your project size and complexity either one of these approaches could be more fitting. For example, for something small projects we can hack our JSON finder to remove the garbage data though for bigger projects we'd probably need to invest more time into a more resilient language-parsing-based approach.
Let's take a look at both!

Removing Javascript from JSON

To convert javascript objects to JSON objects all we have to do is remove any values that are not primitive values like strings, booleans or numbers and remove comments.

To clear the objects we can use regular expressions and for comments, we can take advantage of existing packages like pyparsing:

import re
import pyparsing
import json

comment_remover = pyparsing.cpp_style_comment.suppress()
comment_remover.ignore(pyparsing.QuotedString('"') | pyparsing.QuotedString("'"))


def remove_objects(text):
    """
    replaces all `"key": object` ocurrances in text
    with `"key": {}`
    """
    text = comment_remover.transform_string(text)

    def _rm(match: re.Match):
        key, value, trail = match.groups()
        return key + "{}" + trail

    return re.sub(r'("[^"]+?"\s*:\s*)([^"\s[{\d(?:true|false)].+?)(,|$|})', _rm, text)


cleaned = remove_objects(text)
# let's try it with our text:
text = """
var product = {
    // some comment:
    "element": document.createElement("div"),
    "url": "http://foo.com", // some trailing comment
    "price": 44.23,"discount": 22,
    "features": ["warm", "cold"],
    "product": {"id": 1, "name": "first product"}
}
"""
clean_text = remove_objects(comment_remover.transform_string(text))
print(list(find_json_objects(clean_text)))
# will print:
[
    {
        "element": {},
        "url": "http://foo.com",
        "price": 44.23,
        "discount": 22.11,
        "features": ["warm", "cold"],
        "product": {"id": 1, "name": "first product"},
    }
]

With this quick hack, we can easily scrape more complex embedded JSON structures. Though, we are losing all of that javascript data - what if there's something valuable there? Additionally, regular expression patterns although fast, are complicated and can break easily upon website changes.
Let's take a look at another approach - parsing javascript code itself.

Parsing Javascript with js2xml

Just like javascript interpreters need to parse the code to understand it we can also parse it for variable data.

Using js2xml we can convert javascript code (including JSON) to XML document which we can parse using CSS or XPath selectors. Let's take a look at our example again:

import js2xml
from js2xml.utils.vars import get_vars, make_obj

text = """
var product = {
    // some comment:
    "element": document.createElement("div"),
    "url": "http://fo,o.com", // some trailing comment
    "price": 44.23,"discount": document.deleteElement(foo),
    "features": ["warm", "cold"],
    "product": {"id": 1, "name": "first, product"}
}
"""
# first convert javascript code to XML tree (return lxml.Element)
parsed_tree = js2xml.parse(text)

# we can see generated XML tree:
print(js2xml.pretty_print(parsed_tree))
"""
<program>
  <var name="product">
    <object>
      <property name="element">
        <functioncall>
          <function>
            <dotaccessor>
              <object>
                <identifier name="document"/>
              </object>
              <property>
                <identifier name="createElement"/>
              </property>
            </dotaccessor>
          </function>
          <arguments>
            <string>div</string>
          </arguments>
        </functioncall>
      </property>
      <property name="url">
        <string>http://fo,o.com</string>
      </property>
      <property name="price">
        <number value="44.23"/>
      </property>
      <property name="discount">
        <functioncall>
          <function>
            <dotaccessor>
              <object>
                <identifier name="document"/>
              </object>
              <property>
                <identifier name="deleteElement"/>
              </property>
            </dotaccessor>
          </function>
          <arguments>
            <identifier name="foo"/>
          </arguments>
        </functioncall>
      </property>
      <property name="features">
        <array>
          <string>warm</string>
          <string>cold</string>
        </array>
      </property>
      <property name="product">
        <object>
          <property name="id">
            <number value="1"/>
          </property>
          <property name="name">
            <string>first, product</string>
          </property>
        </object>
      </property>
    </object>
  </var>
</program>
"""

# we can also extract this tree as json
print(get_vars(parsed_tree))
{
    "product": {
        "element": None,
        "url": "http://fo,o.com",
        "price": 44.23,
        "discount": None,
        "features": ["warm", "cold"],
        "product": {"id": 1, "name": "first, product"},
    }
}

# or if the json is deep in the code we can find it with xpath and then convert it
print(make_obj(parsed_tree.xpath('//property[@name="product"]/object')[0]))
{"id": 1, "name": "first, product"}

In the example above we used js2xml to convert javascript code to XML and then we can either parse it with css/xpath selectors or convert data to python dictionaries.

Some Real Examples

We encounter hidden web data often in our scrapeguide blog series which cover tutorials on how to scrape popular web scraping targets.

For example, in we use simple regex patterns when scraping https://www.glassdoor.com/index.htm in our How to Scrape Glassdoor (2024 update) article:

import re
import httpx
import json

def extract_apollo_state(html):
    """Extract apollo graphql state data from HTML source"""
    # here we use regex pattern to find first json object after apolloState keyword:
    data = re.findall('apolloState":\s*({.+})};', html)[0]
    return json.loads(data)

def scrape_overview(company_id: int):
    short_url = f"https://www.glassdoor.com/Overview/-IE_EI{company_id}.htm"
    response = httpx.get(short_url)
    apollo_state = extract_apollo_state(response.text)
    return next(v for k, v in state.items() if k.startswith("Employer:"))

# Ebay's glassdoor profile page:
print(json.dumps(scrape_overview("7671"), indent=2))

Some other hidden web data examples we've covered on this blog:

Using Headless Browsers

We covered how scraping hidden web data can be an alternative to using headless browsers to fully render dynamic data. In the same way, we can use headless browsers to retrieve javascript variables present in the page which returns fully rendered hidden web datasets.

For example, let's say we have this hidden web data piece:

html = """
<html>
  <head>
  </head>
  <body>
    <script id="__NEXT_DATA__">
      var product = {
        "product": {
          "id": "1",
          "name": "first product",
          "secret": create_secret()
        }};
      var _meta = ...
    </script>
  </body>
</html

Here, we can see that the secret field is dynamically generated by a javascript function. If we scrape this as is we'd just get the function name in our data.

Instead, we can fire up a real, headless web browser through Playwright, Puppeteer or Selenium and evaluate custom javascript to capture this data.

As a real-life example, let's go back to Glassdoor and see how we could do this in Playwright and Python:

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=True)
    context = browser.new_context(viewport={"width": 1920, "height": 1080})
    page = context.new_page()

    # got to glassdoor url
    page.goto("https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm") 
    # extract apolloState data, Eployer:7853 contains company overview data 
    # of ebay which is ID 7853:
    data = page.evaluate("window.appCache.apolloState['Employer:7853']")
    print(data)
# will print
{
    '__typename': 'Employer', 
    'id': 7853, 
    'shortName': 'eBay', 
    'website': 'www.ebayinc.com', 
    'type': 'Company - Public', 
    'revenue': '$10+ billion (USD)', 
    'headquarters': 'San Jose, CA', 
    'size': '10000+ Employees', 
    'stock': 'EBAY', 
    ...
}

In the example above, we fire up a headless instance of a Chrome browser, tell it to go to Ebay's profile page on glassdoor.com and extract hidden web data through javascript evaluation function.

Scraping Hidden Data with ScrapFly

Hidden data is not overly complex when it comes to scraping but it can quickly become a tough issue when starting to scale scrapers. For this, we made Scrapfly!

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - extract web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
LLM prompts - extract data or ask questions using LLMs
Extraction models - automatically find objects like products, articles, jobs, and more.
Extraction templates - extract data using your own specification.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

For example, we can replicate our Glassdoor example using ScrapFly SDK:

from scrapfly import ScrapeConfig, ScrapflyClient

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
    url="https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm",
    # enable headless browser use and evaluate javascript script
    render_js=True,
    js="return window.appCache.apolloState['Employer:7853']",
    # we can tell the headless browser to wait 2 seconds for the content to load:
    rendering_wait=2_000,
    # we can set specific proxy country:
    country="CA",
    # we can also take screenshots to see what our browser is doing:
    screenshots={"fullpage": "fullpage"}

))
# get the javascript result:
print(result.scrape_result['browser_data']['javascript_evaluation_result'])

in ScrapFly player

Or try it through the interactive web player directly.

FAQs

What is the fastest way to extract JSON data from HTML without rendering JavaScript?

Use regex patterns to find JSON objects in script tags, or use JSON parsing algorithms like json.JSONDecoder().raw_decode() to extract valid JSON from text. This is much faster than browser automation and works for most static hidden data.

When should I use regex vs JSON parsing to extract hidden web data?

Use regex for simple, predictable patterns like var data = {...}; or when you need to extract specific JSON objects. Use JSON parsing algorithms for complex, nested data or when you need to validate JSON structure. Regex is faster but less reliable, while JSON parsing is more robust.

How do I find where JSON data is hidden in a webpage's source code?

Search for unique identifiers (product names, IDs, descriptions) in the page source, look for script tags containing JSON objects, check for common patterns like __INITIAL_DATA__, window.__data__, or var data =, and use browser developer tools to inspect network requests for API responses.

What are the differences between BeautifulSoup and browser automation for scraping hidden data?

BeautifulSoup can only parse static HTML and requires regex/JSON parsing for hidden data. Browser automation (Playwright, Selenium) renders JavaScript and provides access to executed data but is slower and more resource-intensive. Use BeautifulSoup for static hidden data, browser automation for dynamic content.

Can I scrape data from JavaScript variables without executing the JavaScript?

Yes, use regex patterns to extract JavaScript variable assignments, parse JavaScript code with tools like js2xml to convert to XML, or use JSON parsing algorithms to find valid JSON objects within JavaScript code. This works for static data but not for dynamically generated values.

Is it legal to scrape hidden web data?

Yes, hidden web data is the same public data as the visible HTML. Note that due to GDRP in the European Union region hidden web data should be cleared of user-identifying information.

Hidden Data Scraping Summary

Hidden web data is becoming increasingly popular as websites rely more and more on javascript to generate web content dynamically. So, in this extensive tutorial, we've taken a look at how to scrape this data, how to parse it and what are common challenges in these areas.

We explored common regular expression patterns, JSON parsing algorithms and tools like js2xml and pyparsing for lexical data parsing - all of which are great tools to find public hidden datasets on the web.

How to Scrape Hidden Web Data

Explore this Article with AI

How to Scrape Dynamic Websites Using Headless Web Browsers

Key Takeaways

What is Hidden Web Data?

How to Find Hidden Web Data

Finding Hidden JSON Data

Using Regex

Using JSON Finding Algorithms

Finding Javascript Data

Removing Javascript from JSON

Parsing Javascript with js2xml

Some Real Examples

Using Headless Browsers

Scraping Hidden Data with ScrapFly

FAQs

What is the fastest way to extract JSON data from HTML without rendering JavaScript?

When should I use regex vs JSON parsing to extract hidden web data?

How do I find where JSON data is hidden in a webpage's source code?

What are the differences between BeautifulSoup and browser automation for scraping hidden data?

Can I scrape data from JavaScript variables without executing the JavaScript?

Is it legal to scrape hidden web data?

Hidden Data Scraping Summary

Explore this Article with AI

Related Knowledgebase

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

What are some ways to parse JSON datasets in Python?

How to select dictionary key recursively in Python?

How to use XPath selectors in Python?

How to find elements without a specific attribute in BeautifulSoup?

What are some BeautifulSoup alternatives in Python?

How to handle popup dialogs in Playwright?

How to use proxies with Python httpx?

How to scrape images from a website?

How to select last element in XPath?

How to check if element exists in Playwright?

How to select all elements between two elements in XPath?

Related Articles

Ultimate Guide to JSON Parsing in Python

What is Parsing? From Raw Data to Insights

Intro to Parsing HTML and XML with Python and lxml

How to Parse XML

Web Scraping to Google Sheets

Web Scraping Emails using Python