Quick Intro to Parsing JSON with JSONPath in Python

article feature image

JSONPath is a path expression language for JSON. It is used to query data from JSON datasets and it is similar to XPath query language for XML documents.

Parsing HTML with Xpath

For more on XPath and HTML parsing see our full interactive introduction tutorial

Parsing HTML with Xpath

As the name implies JSONPath is heavily inspired by XPath and offers similar syntax and querying capabilities:

  • Recursive data lookup. e.g. find all price nodes in the dataset
  • Wildcard matching
  • Array slicing
  • Filtering
  • Function calls and custom function extensions

In this JSONPath tutorial, we'll take a look at how to use this path language in the context of web scraping. JSONPath is implemented in many different languages, but in this tutorial, we'll cover the most popular Python implementations.

JSONPath Setup

JSONPath is a JSON query specification with no centralized body so it is implemented in many different languages by many different projects:

Language Implementation
Python jsonpath-ng
jsonpath2
Javascript jsonpath-plus
Ruby jsonpath
R rjsonpath
Go ojg
implementation by Kubernetes

Introduction to JSONPath

To start, all JSONPath query expressions are simple strings made up from JSON keys and operators. Let's take a look at this example:

import jsonpath_ng.ext as jp

data = {
    "products": [
        {"name": "Apple", "price": 12.88, "tags": ["fruit", "red"]},
        {"name": "Peach", "price": 27.25, "tags": ["fruit", "yellow"]},
        {"name": "Cake", "tags": ["pastry", "sweet"]},
    ]
}

# find all product names:
query = jp.parse("products[*].name")
for match in query.find(data):
    print(match.value)

# find all products with price > 20
query = jp.parse("products[?price>20].name")
for match in query.find(data):
    print(match.value)

Here to extract all product names we use products[*].name query which iterates through all products array elements (the [*] operator) and returns the name property of each element. JSONPath also supports filtering expressions like [?price>20] which returns all elements where the price property is greater than 20.

Let's take a look at all of the available operators and some examples:

operator function
$ object root selector
@ or this current object selector
.. recursive descendant selector
* wildcard, selects any key of an object or index of an array
[] subscript operator
[start:end:step] array slice operator
[?<predicate>] or (?<predicate>) filter operator where predicate is some evaluation rule like [?price>20], more examples:
[?price > 20 & price < 10] multiple
[?address.city = "Boston"] for exact matches
[?description.text =~ "house"] for containing values

These basic operators open up a lot of powerful querying options. Let's take a look at some examples in the context of web scraping.

Web Scraper Example

Let's JSONPath in a real example scraper by taking a look at how it would be used in web scraping with Python.

We'll be scraping real estate property data from realtor.com which is a popular portal for renting and selling real estate properties.

This website like many modern websites uses Javascript to render its pages which means we can't just scrape the HTML code. Instead, we'll find the JSON variable data that is used by the frontend to render the page. This is called hidden web data.

Hidden web data can often be found in the HTML code and extracted using HTML parser however this data is often filled with keywords, ids and other non-data fields that's why we'll be using JSONPath to extract only the useful data.

Let's take a look at any random example property like this one

If we take a look at the page source we can see the JSON data set hidden in a <script> tag:

hidden web data illustration in realtor.com page source
We can see entire property dataset hidden in a script element

To scrape this, we'll be using a few Python packages:

  • httpx - HTTP client library to retrieve the page.
  • parsel - HTML parsing library to extract <script> element data.
  • jsonpath-ng - To parse the JSON data for property data fields.

All of these can be installed using pip install command:

$ pip install jsonpath-ng httpx parsel

So, we'll retrieve the HTML page, find the <script> element that contains the hidden web data and then use JSONPath to extract the most important property data fields:

import json
import httpx
from parsel import Selector
import jsonpath_ng as jp

# establish HTTP client and to prevent being instantly banned lets set some browser-like headers
session = httpx.Client(
    headers={
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    },
)

# 1. Scrape the page and parse hidden web data
response = session.get(
    "https://www.realtor.com/realestateandhomes-detail/335-30th-Ave_San-Francisco_CA_94121_M17833-49194"
)
assert response.status_code == 200, "response is banned - try ScrapFly? 😉"
selector = Selector(text=response.text)
# find <script id="__NEXT_DATA__"> node and select it's text:
data = selector.css("script#__NEXT_DATA__::text").get()
# load the hidden JSON as python dictionary:
data = json.loads(data)

# here we define our JSONPath helpers: one to select first match and one to select all matches:
jp_first = lambda query, data: jp.parse(query).find(data)[0].value
jp_all = lambda query, data: [match.value for match in jp.parse(query).find(data)]

prop_data = jp_first("$..propertyDetails", data)
result = {
    # for some fields we don't need complex queries:
    "id": prop_data["listing_id"],
    "url": prop_data["href"],
    "status": prop_data["status"],
    "price": prop_data["list_price"],
    "price_per_sqft": prop_data["price_per_sqft"],
    "date": prop_data["list_date"],
    "details": prop_data["description"],
    # to reduce complex datafields we can use jsonpath again:
    # e.g. we can select by key anywhere in the data structure:
    "estimate_high": jp_first("$..estimate_high", prop_data),
    "estimate_low": jp_first("$..estimate_low", prop_data),
    "post_code": jp_first("$..postal_code", prop_data),
    # or iterate through arrays:
    "features": jp_all("$..details[*].text[0]", prop_data),
    "photos": jp_all("$..photos[*].href", prop_data),
    "buyer_emails": jp_all("$..buyers[*].email", prop_data),
    "buyer_phones": jp_all("$..buyers[*].phones[*].number", prop_data),
}
print(result)
Example Output
{
  "id": "2950457253",
  "url": "https://www.realtor.com/realestateandhomes-detail/335-30th-Ave_San-Francisco_CA_94121_M17833-49194",
  "status": "sold",
  "price": 2995000,
  "price_per_sqft": 982,
  "date": "2022-12-04T23:43:42Z",
  "details": {
    "baths": 4,
    "baths_3qtr": null,
    "baths_full": 3,
    "baths_full_calc": 3,
    "baths_half": 1,
    "baths_max": null,
    "baths_min": null,
    "baths_partial_calc": 1,
    "baths_total": null,
    "beds": 4,
    "beds_max": null,
    "beds_min": null,
    "construction": null,
    "cooling": null,
    "exterior": null,
    "fireplace": null,
    "garage": null,
    "garage_max": null,
    "garage_min": null,
    "garage_type": null,
    "heating": null,
    "logo": null,
    "lot_sqft": 3000,
    "name": null,
    "pool": null,
    "roofing": null,
    "rooms": null,
    "sqft": 3066,
    "sqft_max": null,
    "sqft_min": null,
    "stories": null,
    "styles": [
      "craftsman_bungalow"
    ],
    "sub_type": null,
    "text": "With four bedrooms, three and one-half baths, and over 3, 000 square feet of living space, 335 30th avenue offers a fantastic modern floor plan with classic finishes in the best family-friendly neighborhood in San Francisco. Originally constructed in 1908, the house underwent a total gut renovation and expansion in 2014, with an upgraded foundation, all new plumbing and electrical, double-pane windows and all new energy efficient appliances. Interior walls were removed on the main level to create a large flowing space. The home is detached on three sides (East, South, and West) and enjoys an abundance of natural light. The top floor includes the primary bedroom with two gorgeous skylights and an en-suite bath; two kids bedrooms and a shared hall bath. The main floor offers soaring ten foot ceilings and a modern, open floor plan perfect for entertaining. The combined family room - kitchen space is the heart of the home and keeps everyone together in one space. Just outside the breakfast den, the back deck overlooks the spacious yard and offers indoor/outdoor living. The ground floor encompasses the garage, a laundry room, and a suite of rooms that could serve as work-from-home space, AirBnB, or in-law unit.",
    "type": "single_family",
    "units": null,
    "year_built": 1908,
    "year_renovated": null,
    "zoning": null,
    "__typename": "HomeDescription"
  },
  "estimate_high": 3253200,
  "estimate_low": 2824400,
  "post_code": "94111",
  "features": [
    "Bedrooms: 4",
    "Total Rooms: 11",
    "Total Bathrooms: 4",
    "Built-In Gas Oven",
    "Breakfast Area",
    "Fireplace Features: Brick, Family Room, Wood Burning",
    "Interior Amenities: Dining Room, Family Room, Guest Quarters, Kitchen, Laundry, Living Room, Primary Bathroom, Primary Bedroom, Office, Workshop",
    "Balcony",
    "Lot Description: Adjacent to Golf Course, Landscape Back, Landscape Front, Low Maintenance, Manual Sprinkler Rear, Zero Lot Line",
    "Driveway: Gated, Paved Sidewalk, Sidewalk/Curb/Gutter",
    "View: Bay, Bridges, City, San Francisco",
    "Association: No",
    "Source Listing Status: Closed",
    "Total Square Feet Living: 3066",
    "Sewer: Public Sewer, Septic Connected"
  ],
  "photos": [
    "http://ap.rdcpix.com/f707c59fa49468fde4999bbd9e2d433bl-m872089375s.jpg",
    "http://ap.rdcpix.com/f707c59fa49468fde4999bbd9e2d433bl-m872089375s.jpg",
    "http://ap.rdcpix.com/f707c59fa49468fde4999bbd9e2d433bl-m872089375s.jpg"
  ],
  "buyer_emails": [
    "REDACTED_FOR_BLOG@REDACTED_FOR_BLOG.com",
  ],
  "buyer_phones": [
    "415296XXXX",
    "415901XXXX",
    "415901XXXX",
  ]
}

Just as we'd use XPath to parse HTML datasets we can use JSONPath to parse JSON datasets. JSONPath is a powerful yet simple language that works especially well when working with hidden web data.

How to Scrape Realtor.com - Real Estate Property Data

For a full tutorial on scraping Realtor.com see our complete scrape guide which covers, scraping, parsing and how to avoid blocking

How to Scrape Realtor.com - Real Estate Property Data

FAQ

To wrap this up JSONPath tutorial let's take a look at some frequently asked questions:

What's the difference between JMESPath and JSONPath?

JMESPath is another popular JSON query language that is available in more programming languages. Main difference is that JSONPath follows XPath syntax allowing recursive selectors and easy extendability while JMESPath allows easier dataset mutation and filtering. We recommend JSONPath for extracting nested data while JMESPath is better for processing more complex but predictable datasets.

Is JSONPath Slow?

Since JSON data is translated to native objects JSONPath can be very fast depending on the implementation and used algorithms. Since JSONPath is just a query specification not an individual project speed varies by each implementation but generally, it should be as fast as XPath for XML or even faster.

JsonPath in Web Scraping Summary

In this introduction tutorial, we've taken a look at JSONPath query language for JSON in Python. This path language is heavily inspired by XPath and allows us to extract nested data from JSON datasets which means it fits web scraping stack perfectly as we can use two similar technologies for extracting data from HTML and JSON.

Quick Intro to Parsing JSON with JMESPath in Python

For an alternative to JSONPath see our intro to JMESPath which is another popular format for parsing JSON datasets in Python

Quick Intro to Parsing JSON with JMESPath in Python

Finally, we've taken a look at a real life example of using jsonpath-ng library for parsing hidden web data from realtor.com where we extracted main property listing data fields in just a few lines of code.

Related Posts

Intro to Parsing HTML and XML with Python and lxml

In this tutorial, we'll take a deep dive into lxml, a powerful Python library that allows for parsing HTML and XML effectively. We'll start by explaining what lxml is, how to install it and using lxml for parsing HTML and XML files. Finally, we'll go over a practical web scraping with lxml.

How to Parse XML

In this article, we'll explain about XML parsing. We'll start by defining XML files, their format and how to navigate them for data extraction.

Web Scraping to Google Sheets

Google sheets is an easy to store scraped data. In this tutorial we'll take a look at how to use this free online database for storing scraped data!