JMESPath is a popular JSON query language used for parsing JSON datasets. It has gained popularity in web scraping as JSON is becoming the most popular data structure in this medium.
Many popular web scraping targets contain hidden JSON data that can be extracted directly. Unfortunately, these datasets are huge and contain loads of useless data. This makes JSON parsing an important part of the modern web scraping process.
In this Jmespath tutorial, we'll take a quick overview of this path language in web scraping and Python. We'll cover setup, the most used features and do a quick real-life example project by scraping Realtor.com.
What is JMESPath?
JMESPath is a path language for parsing JSON datasets. In short, it allows writing path rules for selecting specific data fields in JSON.
When web scraping, JMESPath is similar to XPath or CSS selectors we use to parse HTML - but for JSON. This makes JMESPath a brilliant addition to our web scraping toolset as HTML and JSON are the most common data formats in this niche.
JMESPath Setup
JMESPath is implemented in many different languages:
In this tutorial, we'll be using Python though other languages should be very similar.
To install jmespath in Python we can use pip install terminal command:
$ pip install jmespath
Jmespath Usage Tutorial
You're probably familiar with dictionary/hashtable dot-based path selectors like data.address.zipcode - this dot notation is the foundation of JMESPath but it can do much more!
Just like with Python's lists we can also slice and index jmespath arrays:
import jmespath
data = {
"people": [
{"address": ["123 Main St", "California" ,"US"]},
]
}
jmespath.search("people[0].address[:2]", data)
['123 Main St', 'California']
Further, we can apply projections that apply rules for each list element. This is being done through the [] syntax:
data = {
"people": [
{"address": ["123 Main St", "California" ,"US"]},
{"address": ["345 Alt St", "Alaska" ,"US"]},
]
}
jmespath.search("people[].address[:2]", data)
[
['123 Main St', 'California'],
['345 Alt St', 'Alaska']
]
Just like with lists we can also apply similar projections to objects (dictionaries). For this * is used:
The most interesting feature of JMESPath for web scraping has to be data reshaping. Using the .[] and .{} syntax we can completely reshape lists and objects:
As you can see, using JMESPath we can easily parse complex datasets into something more digestible which is especially useful when web scraping JSON datasets.
Web Scraper Example
Let's explore a real-life JMESPath python example by taking a look at how it would be used in web scraping.
In this example project, we'll be scraping real estate property data on realtor.com which is a popular US portal for renting and selling properties.
We'll be using a few Python packages:
httpx - HTTP client library which will let us communicate with Realtor.com's servers
parsel - HTML parsing library which will help us to parse our web scraped HTML files.
And of course jmespath for parsing JSON. All of these can be installed using pip install command:
$ pip install jmespath httpx parsel
Realtor.com is using hidden web data to render its property pages which means instead of parsing HTML we can find the whole JSON dataset hidden in the HTML code.
Let's take a look at any random example property like this one
If we take a look at the page source we can see the JSON data set hidden in a <script> tag:
We can extract it using HTML parser though it's huge and contains a bunch of gibberish computer data. So we can parse out the useful bits using JMESPath:
import json
import httpx
import jmespath
from parsel import Selector
# establish HTTP client and to prevent being instantly banned lets set some browser-like headers
session = httpx.Client(
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
},
)
# 1. Scrape the page and parse hidden web data
response = session.get(
"https://www.realtor.com/realestateandhomes-detail/335-30th-Ave_San-Francisco_CA_94121_M17833-49194"
)
assert response.status_code == 200, "response is banned - try ScrapFly? 😉"
selector = Selector(text=response.text)
# find <script id="__NEXT_DATA__"> node and select it's text:
data = selector.css("script#__NEXT_DATA__::text").get()
# load JSON as python dictionary and select property value:
data = json.loads(data)["props"]["pageProps"]["initialProps"]["propertyData"]
# 2. Reduce web dataset to important data fields:
result = jmespath.search(
"""{
id: listing_id,
url: href,
status: status,
price: list_price,
price_per_sqft: price_per_sqft,
date: list_date,
details: description,
features: details[].text[],
photos: photos[].{url: href, tags: tags[].label}
}
""", data)
print(result)
Example Output
{
"id": "2950457253",
"url": "https://www.realtor.com/realestateandhomes-detail/335-30th-Ave_San-Francisco_CA_94121_M17833-49194",
"status": "for_sale",
"price": 2995000,
"price_per_sqft": 977,
"date": "2022-12-04T23:43:42Z",
"details": {
"baths": 4,
"baths_consolidated": "3.5",
"baths_full": 3,
"baths_3qtr": null,
"baths_half": 1,
"baths_1qtr": null,
"baths_min": null,
"baths_max": null,
"beds_min": null,
"beds_max": null,
"beds": 4,
"garage": null,
"pool": null,
"sqft": 3066,
"sqft_min": null,
"sqft_max": null,
"lot_sqft": 3000,
"rooms": null,
"stories": null,
"sub_type": null,
"text": "With four bedrooms, three and one-half baths, and over 3, 000 square feet of living space, 335 30th avenue offers a fantastic modern floor plan with classic finishes in the best family-friendly neighborhood in San Francisco. Originally constructed in 1908, the house underwent a total gut renovation and expansion in 2014, with an upgraded foundation, all new plumbing and electrical, double-pane windows and all new energy efficient appliances. Interior walls were removed on the main level to create a large flowing space. The home is detached on three sides (East, South, and West) and enjoys an abundance of natural light. The top floor includes the primary bedroom with two gorgeous skylights and an en-suite bath; two kids bedrooms and a shared hall bath. The main floor offers soaring ten foot ceilings and a modern, open floor plan perfect for entertaining. The combined family room - kitchen space is the heart of the home and keeps everyone together in one space. Just outside the breakfast den, the back deck overlooks the spacious yard and offers indoor/outdoor living. The ground floor encompasses the garage, a laundry room, and a suite of rooms that could serve as work-from-home space, AirBnB, or in-law unit.",
"type": "single_family",
"units": null,
"unit_type": null,
"year_built": 1908,
"name": null
},
"features": [
"Bedrooms: 4",
"...etc, reduced for blog"
],
"photos": [
{
"url": "https://ap.rdcpix.com/f707c59fa49468fde4999bbd9e2d433bl-m872089375s.jpg",
"tags": [
"garage",
"house_view",
"porch",
"complete",
"front"
]
},
"...etc, reduced for blog"
]
}
Using JMESpath we managed to reduce thousands of lines of JSON to essential data fields in just a few lines of Python code and a single JMESPath query - pretty awesome!
FAQ
To wrap this up JMESPath tutorial let's take a look at some frequently asked questions:
What's the difference between [] and [*] in JMESPath?
The [] flattens all results while [*] keeps the structure as it is in the orignal dataset. See this example in Python:
data = {
"employees": [
{
"people": [
{"address": ["123 Main St", "California", "US"]},
{"address": ["456 Sec St", "Nevada", "US"]},
],
},
{
"people": [
{"address": ["789 Main St", "Washington", "US"]},
{"address": ["a12 Sec St", "Alaska", "US"]},
],
},
]
}
jmespath.search("employees[*].people[*].address", data)
[
# fromt he first group:
[['123 Main St', 'California', 'US'], ['456 Sec St', 'Nevada', 'US']],
# from the second group:
[['789 Main St', 'Washington', 'US'], ['a12 Sec St', 'Alaska', 'US']]
]
jmespath.search("employees[].people[].address", data)
[
# all groups merged:
['123 Main St', 'California', 'US'],
['456 Sec St', 'Nevada', 'US'],
['789 Main St', 'Washington', 'US'],
['a12 Sec St', 'Alaska', 'US']
]