JSONPath is a path expression language to parse JSON data. It's used to query data from JSON objects using a similar syntax to the XPath query language used to parse XML documents.
As the name implies, JSONPath syntax is heavily inspired by XPath and offers similar atomic expressions and querying capabilities:
Array slicing
Filtering and wildcard matching
Function calls and custom function extensions
Recursive data lookup, such as finding specific data points across the entire dataset nodes
In this tutorial, we'll explore how to use the JSONPath expressions for web scraping. For this, we'll be using the Python client, but the same concepts can be applied to other JSONPath implementations. Let's get started!
JSONPath Setup
JSONPath's query specification lacks a centralized standard, resulting in slight differences for each language implementation across different projects. Below are common clients for different languages:
For this guide, we'll be using the JSONPath Python implementation, specifically jsonpath-rw. Install it using the following pip command:
pip install jsonpath-rw
Introduction to Python JSONPath
Let's start out with simple JSONPath Python parser. We'll parse the below JSON object using simple string operations:
import jsonpath_ng.ext as jp
data = {
"products": [
{"name": "Apple", "price": 12.88, "tags": ["fruit", "red"]},
{"name": "Peach", "price": 27.25, "tags": ["fruit", "yellow"]},
{"name": "Cake", "tags": ["pastry", "sweet"]},
]
}
# find all product names:
query = jp.parse("products[*].name")
for match in query.find(data):
print(match.value)
# find all products with price > 20
query = jp.parse("products[?price>20].name")
for match in query.find(data):
print(match.value)
The above JSONPath finder uses the products[*].name expression to iterate over the products key as the root object where its elements are first-class objects using the products[*].name expressions.
Then, we return the name property of each element. Since JSONPath supports filtering expressions, we use the [?price>20] parse method to return all elements with a price greater than 20.
JSONPath Expressions
The below table illustrates some of the common binary comparison operators used with a JSONPath finder:
operator
function
$
object root selector
@ or this
current object selector
..
recursive descendant selector
*
wildcard, selects any key of an object or index of an array
[]
subscript operator
[start:end:step]
array slice operator
[?<predicate>] or (?<predicate>)
filter operator where predicate is some evaluation rule like [?price>20], more examples:
[?price > 20 & price < 10] multiple
[?address.city = "Boston"] for exact matches
[?description.text =~ "house"] for containing values
The above expression examples open the door for powerful JSON parsing options. Let's explore using them in the context of web scraping!
Web Scraper Example
Let's JSONPath in a real example scraper by taking a look at how it would be used in web scraping with Python.
We'll be scraping real estate property data from realtor.com which is a popular portal for renting and selling real estate properties.
This website like many modern websites uses Javascript to render its pages which means we can't just scrape the HTML code. Instead, we'll find the JSON variable data that is used by the frontend to render the page. This is called hidden web data.
Hidden web data can often be found in the HTML code and extracted using HTML parser however this data is often filled with keywords, ids and other non-data fields that's why we'll be using JSONPath to extract only the useful data.
Let's take a look at any random example property like this one
If we take a look at the page source we can see the JSON data set hidden in a <script> tag:
To scrape this, we'll be using a few Python packages:
parsel - HTML parsing library to extract <script> element data.
jsonpath-ng - To parse the JSON data for property data fields.
All of these can be installed using pip install command:
$ pip install jsonpath-ng httpx parsel
So, we'll retrieve the HTML page, find the <script> element that contains the hidden web data and then use JSONPath to extract the most important property data fields:
import json
import httpx
from parsel import Selector
import jsonpath_ng as jp
# establish HTTP client and to prevent being instantly banned lets set some browser-like headers
session = httpx.Client(
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
},
)
# 1. Scrape the page and parse hidden web data
response = session.get(
"https://www.realtor.com/realestateandhomes-detail/335-30th-Ave_San-Francisco_CA_94121_M17833-49194"
)
assert response.status_code == 200, "response is banned - try ScrapFly? 😉"
selector = Selector(text=response.text)
# find <script id="__NEXT_DATA__"> node and select it's text:
data = selector.css("script#__NEXT_DATA__::text").get()
# load the hidden JSON as python dictionary:
data = json.loads(data)
# here we define our JSONPath helpers: one to select first match and one to select all matches:
jp_first = lambda query, data: jp.parse(query).find(data)[0].value
jp_all = lambda query, data: [match.value for match in jp.parse(query).find(data)]
prop_data = jp_first("$..propertyDetails", data)
result = {
# for some fields we don't need complex queries:
"id": prop_data["listing_id"],
"url": prop_data["href"],
"status": prop_data["status"],
"price": prop_data["list_price"],
"price_per_sqft": prop_data["price_per_sqft"],
"date": prop_data["list_date"],
"details": prop_data["description"],
# to reduce complex datafields we can use jsonpath again:
# e.g. we can select by key anywhere in the data structure:
"estimate_high": jp_first("$..estimate_high", prop_data),
"estimate_low": jp_first("$..estimate_low", prop_data),
"post_code": jp_first("$..postal_code", prop_data),
# or iterate through arrays:
"features": jp_all("$..details[*].text[0]", prop_data),
"photos": jp_all("$..photos[*].href", prop_data),
"buyer_emails": jp_all("$..buyers[*].email", prop_data),
"buyer_phones": jp_all("$..buyers[*].phones[*].number", prop_data),
}
print(result)
Example Output
{
"id": "2950457253",
"url": "https://www.realtor.com/realestateandhomes-detail/335-30th-Ave_San-Francisco_CA_94121_M17833-49194",
"status": "sold",
"price": 2995000,
"price_per_sqft": 982,
"date": "2022-12-04T23:43:42Z",
"details": {
"baths": 4,
"baths_3qtr": null,
"baths_full": 3,
"baths_full_calc": 3,
"baths_half": 1,
"baths_max": null,
"baths_min": null,
"baths_partial_calc": 1,
"baths_total": null,
"beds": 4,
"beds_max": null,
"beds_min": null,
"construction": null,
"cooling": null,
"exterior": null,
"fireplace": null,
"garage": null,
"garage_max": null,
"garage_min": null,
"garage_type": null,
"heating": null,
"logo": null,
"lot_sqft": 3000,
"name": null,
"pool": null,
"roofing": null,
"rooms": null,
"sqft": 3066,
"sqft_max": null,
"sqft_min": null,
"stories": null,
"styles": [
"craftsman_bungalow"
],
"sub_type": null,
"text": "With four bedrooms, three and one-half baths, and over 3, 000 square feet of living space, 335 30th avenue offers a fantastic modern floor plan with classic finishes in the best family-friendly neighborhood in San Francisco. Originally constructed in 1908, the house underwent a total gut renovation and expansion in 2014, with an upgraded foundation, all new plumbing and electrical, double-pane windows and all new energy efficient appliances. Interior walls were removed on the main level to create a large flowing space. The home is detached on three sides (East, South, and West) and enjoys an abundance of natural light. The top floor includes the primary bedroom with two gorgeous skylights and an en-suite bath; two kids bedrooms and a shared hall bath. The main floor offers soaring ten foot ceilings and a modern, open floor plan perfect for entertaining. The combined family room - kitchen space is the heart of the home and keeps everyone together in one space. Just outside the breakfast den, the back deck overlooks the spacious yard and offers indoor/outdoor living. The ground floor encompasses the garage, a laundry room, and a suite of rooms that could serve as work-from-home space, AirBnB, or in-law unit.",
"type": "single_family",
"units": null,
"year_built": 1908,
"year_renovated": null,
"zoning": null,
"__typename": "HomeDescription"
},
"estimate_high": 3253200,
"estimate_low": 2824400,
"post_code": "94111",
"features": [
"Bedrooms: 4",
"Total Rooms: 11",
"Total Bathrooms: 4",
"Built-In Gas Oven",
"Breakfast Area",
"Fireplace Features: Brick, Family Room, Wood Burning",
"Interior Amenities: Dining Room, Family Room, Guest Quarters, Kitchen, Laundry, Living Room, Primary Bathroom, Primary Bedroom, Office, Workshop",
"Balcony",
"Lot Description: Adjacent to Golf Course, Landscape Back, Landscape Front, Low Maintenance, Manual Sprinkler Rear, Zero Lot Line",
"Driveway: Gated, Paved Sidewalk, Sidewalk/Curb/Gutter",
"View: Bay, Bridges, City, San Francisco",
"Association: No",
"Source Listing Status: Closed",
"Total Square Feet Living: 3066",
"Sewer: Public Sewer, Septic Connected"
],
"photos": [
"http://ap.rdcpix.com/f707c59fa49468fde4999bbd9e2d433bl-m872089375s.jpg",
"http://ap.rdcpix.com/f707c59fa49468fde4999bbd9e2d433bl-m872089375s.jpg",
"http://ap.rdcpix.com/f707c59fa49468fde4999bbd9e2d433bl-m872089375s.jpg"
],
"buyer_emails": [
"REDACTED_FOR_BLOG@REDACTED_FOR_BLOG.com",
],
"buyer_phones": [
"415296XXXX",
"415901XXXX",
"415901XXXX",
]
}
Just as we'd use XPath to parse HTML datasets we can use JSONPath to parse JSON datasets. JSONPath is a powerful yet simple language that works especially well when working with hidden web data.
FAQ
To wrap this up JSONPath tutorial let's take a look at some frequently asked questions:
What's the difference between JMESPath and JSONPath?
JMESPath is another popular JSON query language that is available in more programming languages. Main difference is that JSONPath follows XPath syntax allowing recursive selectors and easy extendability while JMESPath allows easier dataset mutation and filtering. We recommend JSONPath for extracting nested data while JMESPath is better for processing more complex but predictable datasets.
Is JSONPath Slow?
Since JSON data is translated to native objects JSONPath can be very fast depending on the implementation and used algorithms. Since JSONPath is just a query specification not an individual project speed varies by each implementation but generally, it should be as fast as XPath for XML or even faster.
JsonPath in Web Scraping Summary
In this introduction tutorial, we've taken a look at JSONPath query language for JSON in Python. This path language is heavily inspired by XPath and allows us to extract nested data from JSON datasets which means it fits web scraping stack perfectly as we can use two similar technologies for extracting data from HTML and JSON.
Finally, we've taken a look at a real life example of using jsonpath-ng library for parsing hidden web data from realtor.com where we extracted main property listing data fields in just a few lines of code.
In this tutorial, we'll take a deep dive into lxml, a powerful Python library that allows for parsing HTML and XML effectively. We'll start by explaining what lxml is, how to install it and using lxml for parsing HTML and XML files. Finally, we'll go over a practical web scraping with lxml.