Introduction to Parsing JSON with Python JSONPath

by Bernardas Ališauskas Jul 24, 2024

#data-parsing #python

Introduction to Parsing JSON with Python JSONPath

JSONPath is a path expression language to parse JSON data. It's used to query data from JSON objects using a similar syntax to the XPath query language used to parse XML documents.

Parsing HTML with Xpath

Introduction to xpath in the context of web-scraping. How to extract data from HTML documents using xpath, best practices and available tools.

As the name implies, JSONPath syntax is heavily inspired by XPath and offers similar atomic expressions and querying capabilities:

Array slicing
Filtering and wildcard matching
Function calls and custom function extensions
Recursive data lookup, such as finding specific data points across the entire dataset nodes

In this tutorial, we'll explore how to use the JSONPath expressions for web scraping. For this, we'll be using the Python client, but the same concepts can be applied to other JSONPath implementations. Let's get started!

JSONPath Setup

JSONPath's query specification lacks a centralized standard, resulting in slight differences for each language implementation across different projects. Below are common clients for different languages:

Language	Implementation
Python	jsonpath-ng jsonpath2 jsonpath-rw
JavaScript	jsonpath-plus
Ruby	jsonpath
R	rjsonpath
Go	ojg jsonpath

For this guide, we'll be using the JSONPath Python implementation, specifically jsonpath-rw. Install it using the following pip command:

pip install jsonpath-rw

Introduction to Python JSONPath

Let's start out with simple JSONPath Python parser. We'll parse the below JSON object using simple string operations:

import jsonpath_ng.ext as jp

data = {
    "products": [
        {"name": "Apple", "price": 12.88, "tags": ["fruit", "red"]},
        {"name": "Peach", "price": 27.25, "tags": ["fruit", "yellow"]},
        {"name": "Cake", "tags": ["pastry", "sweet"]},
    ]
}

# find all product names:
query = jp.parse("products[*].name")
for match in query.find(data):
    print(match.value)

# find all products with price > 20
query = jp.parse("products[?price>20].name")
for match in query.find(data):
    print(match.value)

The above JSONPath finder uses the products[*].name expression to iterate over the products key as the root object where its elements are first-class objects using the products[*].name expressions.

Then, we return the name property of each element. Since JSONPath supports filtering expressions, we use the [?price>20] parse method to return all elements with a price greater than 20.

JSONPath Expressions

The below table illustrates some of the common binary comparison operators used with a JSONPath finder:

operator	function
`$`	object root selector
`@` or `this`	current object selector
`..`	recursive descendant selector
`*`	wildcard, selects any key of an object or index of an array
`[]`	subscript operator
`[start:end:step]`	array slice operator
`[?<predicate>]` or `(?<predicate>)`	filter operator where predicate is some evaluation rule like `[?price>20]`, more examples:
	`[?price > 20 & price < 10]` multiple
	`[?address.city = "Boston"]` for exact matches
	`[?description.text =~ "house"]` for containing values

The above expression examples open the door for powerful JSON parsing options. Let's explore using them in the context of web scraping!

Web Scraper Example

Let's JSONPath in a real example scraper by taking a look at how it would be used in web scraping with Python.

We'll be scraping real estate property data from realtor.com which is a popular portal for renting and selling real estate properties.

This website like many modern websites uses Javascript to render its pages which means we can't just scrape the HTML code. Instead, we'll find the JSON variable data that is used by the frontend to render the page. This is called hidden web data.

Hidden web data can often be found in the HTML code and extracted using HTML parser however this data is often filled with keywords, ids and other non-data fields that's why we'll be using JSONPath to extract only the useful data.

Let's take a look at any random example property like this one

If we take a look at the page source we can see the JSON data set hidden in a <script> tag:

hidden web data illustration in realtor.com page source — We can see entire property dataset hidden in a script element

To scrape this, we'll be using a few Python packages:

httpx - HTTP client library to retrieve the page.
parsel - HTML parsing library to extract <script> element data.
jsonpath-ng - To parse the JSON data for property data fields.

All of these can be installed using pip install command:

$ pip install jsonpath-ng httpx parsel

So, we'll retrieve the HTML page, find the <script> element that contains the hidden web data and then use JSONPath to extract the most important property data fields:

import json
import httpx
from parsel import Selector
import jsonpath_ng as jp

# establish HTTP client and to prevent being instantly banned lets set some browser-like headers
session = httpx.Client(
    headers={
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    },
)

# 1. Scrape the page and parse hidden web data
response = session.get(
    "https://www.realtor.com/realestateandhomes-detail/335-30th-Ave_San-Francisco_CA_94121_M17833-49194"
)
assert response.status_code == 200, "response is banned - try ScrapFly? 😉"
selector = Selector(text=response.text)
# find <script id="__NEXT_DATA__"> node and select it's text:
data = selector.css("script#__NEXT_DATA__::text").get()
# load the hidden JSON as python dictionary:
data = json.loads(data)

# here we define our JSONPath helpers: one to select first match and one to select all matches:
jp_first = lambda query, data: jp.parse(query).find(data)[0].value
jp_all = lambda query, data: [match.value for match in jp.parse(query).find(data)]

prop_data = jp_first("$..propertyDetails", data)
result = {
    # for some fields we don't need complex queries:
    "id": prop_data["listing_id"],
    "url": prop_data["href"],
    "status": prop_data["status"],
    "price": prop_data["list_price"],
    "price_per_sqft": prop_data["price_per_sqft"],
    "date": prop_data["list_date"],
    "details": prop_data["description"],
    # to reduce complex datafields we can use jsonpath again:
    # e.g. we can select by key anywhere in the data structure:
    "estimate_high": jp_first("$..estimate_high", prop_data),
    "estimate_low": jp_first("$..estimate_low", prop_data),
    "post_code": jp_first("$..postal_code", prop_data),
    # or iterate through arrays:
    "features": jp_all("$..details[*].text[0]", prop_data),
    "photos": jp_all("$..photos[*].href", prop_data),
    "buyer_emails": jp_all("$..buyers[*].email", prop_data),
    "buyer_phones": jp_all("$..buyers[*].phones[*].number", prop_data),
}
print(result)

Example Output

{
  "id": "2950457253",
  "url": "https://www.realtor.com/realestateandhomes-detail/335-30th-Ave_San-Francisco_CA_94121_M17833-49194",
  "status": "sold",
  "price": 2995000,
  "price_per_sqft": 982,
  "date": "2022-12-04T23:43:42Z",
  "details": {
    "baths": 4,
    "baths_3qtr": null,
    "baths_full": 3,
    "baths_full_calc": 3,
    "baths_half": 1,
    "baths_max": null,
    "baths_min": null,
    "baths_partial_calc": 1,
    "baths_total": null,
    "beds": 4,
    "beds_max": null,
    "beds_min": null,
    "construction": null,
    "cooling": null,
    "exterior": null,
    "fireplace": null,
    "garage": null,
    "garage_max": null,
    "garage_min": null,
    "garage_type": null,
    "heating": null,
    "logo": null,
    "lot_sqft": 3000,
    "name": null,
    "pool": null,
    "roofing": null,
    "rooms": null,
    "sqft": 3066,
    "sqft_max": null,
    "sqft_min": null,
    "stories": null,
    "styles": [
      "craftsman_bungalow"
    ],
    "sub_type": null,
    "text": "With four bedrooms, three and one-half baths, and over 3, 000 square feet of living space, 335 30th avenue offers a fantastic modern floor plan with classic finishes in the best family-friendly neighborhood in San Francisco. Originally constructed in 1908, the house underwent a total gut renovation and expansion in 2014, with an upgraded foundation, all new plumbing and electrical, double-pane windows and all new energy efficient appliances. Interior walls were removed on the main level to create a large flowing space. The home is detached on three sides (East, South, and West) and enjoys an abundance of natural light. The top floor includes the primary bedroom with two gorgeous skylights and an en-suite bath; two kids bedrooms and a shared hall bath. The main floor offers soaring ten foot ceilings and a modern, open floor plan perfect for entertaining. The combined family room - kitchen space is the heart of the home and keeps everyone together in one space. Just outside the breakfast den, the back deck overlooks the spacious yard and offers indoor/outdoor living. The ground floor encompasses the garage, a laundry room, and a suite of rooms that could serve as work-from-home space, AirBnB, or in-law unit.",
    "type": "single_family",
    "units": null,
    "year_built": 1908,
    "year_renovated": null,
    "zoning": null,
    "__typename": "HomeDescription"
  },
  "estimate_high": 3253200,
  "estimate_low": 2824400,
  "post_code": "94111",
  "features": [
    "Bedrooms: 4",
    "Total Rooms: 11",
    "Total Bathrooms: 4",
    "Built-In Gas Oven",
    "Breakfast Area",
    "Fireplace Features: Brick, Family Room, Wood Burning",
    "Interior Amenities: Dining Room, Family Room, Guest Quarters, Kitchen, Laundry, Living Room, Primary Bathroom, Primary Bedroom, Office, Workshop",
    "Balcony",
    "Lot Description: Adjacent to Golf Course, Landscape Back, Landscape Front, Low Maintenance, Manual Sprinkler Rear, Zero Lot Line",
    "Driveway: Gated, Paved Sidewalk, Sidewalk/Curb/Gutter",
    "View: Bay, Bridges, City, San Francisco",
    "Association: No",
    "Source Listing Status: Closed",
    "Total Square Feet Living: 3066",
    "Sewer: Public Sewer, Septic Connected"
  ],
  "photos": [
    "http://ap.rdcpix.com/f707c59fa49468fde4999bbd9e2d433bl-m872089375s.jpg",
    "http://ap.rdcpix.com/f707c59fa49468fde4999bbd9e2d433bl-m872089375s.jpg",
    "http://ap.rdcpix.com/f707c59fa49468fde4999bbd9e2d433bl-m872089375s.jpg"
  ],
  "buyer_emails": [
    "REDACTED_FOR_BLOG@REDACTED_FOR_BLOG.com",
  ],
  "buyer_phones": [
    "415296XXXX",
    "415901XXXX",
    "415901XXXX",
  ]
}

Just as we'd use XPath to parse HTML datasets we can use JSONPath to parse JSON datasets. JSONPath is a powerful yet simple language that works especially well when working with hidden web data.

How to Scrape Realtor.com - Real Estate Property Data

In this scrape guide we'll be taking a look at real estate property scraping from Realtor.com. We'll also build a tracker scraper that checks for new listings or price changes.

FAQ

To wrap this up JSONPath tutorial let's take a look at some frequently asked questions:

What's the difference between JMESPath and JSONPath?

JMESPath is another popular JSON query language that is available in more programming languages. Main difference is that JSONPath follows XPath syntax allowing recursive selectors and easy extendability while JMESPath allows easier dataset mutation and filtering. We recommend JSONPath for extracting nested data while JMESPath is better for processing more complex but predictable datasets.

Is JSONPath Slow?

Since JSON data is translated to native objects JSONPath can be very fast depending on the implementation and used algorithms. Since JSONPath is just a query specification not an individual project speed varies by each implementation but generally, it should be as fast as XPath for XML or even faster.

JsonPath in Web Scraping Summary

In this introduction tutorial, we've taken a look at JSONPath query language for JSON in Python. This path language is heavily inspired by XPath and allows us to extract nested data from JSON datasets which means it fits web scraping stack perfectly as we can use two similar technologies for extracting data from HTML and JSON.

Quick Intro to Parsing JSON with JMESPath in Python

Introduction to JMESPath - JSON query language which is used in web scraping to parse JSON datasets for scrape data.

Finally, we've taken a look at a real life example of using jsonpath-ng library for parsing hidden web data from realtor.com where we extracted main property listing data fields in just a few lines of code.

Introduction to Parsing JSON with Python JSONPath

Parsing HTML with Xpath

JSONPath Setup

Introduction to Python JSONPath

JSONPath Expressions

Web Scraper Example

How to Scrape Realtor.com - Real Estate Property Data

FAQ

What's the difference between JMESPath and JSONPath?

Is JSONPath Slow?

JsonPath in Web Scraping Summary

Quick Intro to Parsing JSON with JMESPath in Python

Related Knowledgebase

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

How to select dictionary key recursively in Python?

What are some ways to parse JSON datasets in Python?

How to use XPath selectors in Python?

How to find elements without a specific attribute in BeautifulSoup?

What are some BeautifulSoup alternatives in Python?

How to handle popup dialogs in Playwright?

How to use proxies with Python httpx?

How to scrape images from a website?

How to select last element in XPath?

How to check if element exists in Playwright?

How to select all elements between two elements in XPath?

Related Articles

Ultimate Guide to JSON Parsing in Python

What is Parsing? From Raw Data to Insights

Intro to Parsing HTML and XML with Python and lxml

How to Parse XML

Web Scraping to Google Sheets

Web Scraping Emails using Python