Quick Intro to Parsing JSON with JMESPath in Python

Q: What are some alternatives to JMESPath?

Some other popular query language for JSON are JsonPath , JQ and pyquery

by Bernardas Ališauskas May 10, 2024

#data-parsing #python

Quick Intro to Parsing JSON with JMESPath in Python

JMESPath is a popular JSON query language used for parsing JSON datasets. It has gained popularity in web scraping as JSON is becoming the most popular data structure in this medium.

Many popular web scraping targets contain hidden JSON data that can be extracted directly. Unfortunately, these datasets are huge and contain loads of useless data. This makes JSON parsing an important part of the modern web scraping process.

In this Jmespath tutorial, we'll take a quick overview of this path language in web scraping and Python. We'll cover setup, the most used features and do a quick real-life example project by scraping Realtor.com.

What is JMESPath?

JMESPath is a path language for parsing JSON datasets. In short, it allows writing path rules for selecting specific data fields in JSON.

When web scraping, JMESPath is similar to XPath or CSS selectors we use to parse HTML - but for JSON. This makes JMESPath a brilliant addition to our web scraping toolset as HTML and JSON are the most common data formats in this niche.

JMESPath Setup

JMESPath is implemented in many different languages:

Language	Implementation
Python	jmespath.py
PHP	jmespath.php
Javascript	jmespath.js
Ruby	jmespath.rb
Lua	jmespath.lua
Go	go-jmespath
java	jmespath-java
Rust	jmespath.rs
DotNet	jmespath.net

In this tutorial, we'll be using Python though other languages should be very similar.

To install jmespath in Python we can use pip install terminal command:

$ pip install jmespath

Jmespath Usage Tutorial

You're probably familiar with dictionary/hashtable dot-based path selectors like data.address.zipcode - this dot notation is the foundation of JMESPath but it can do much more!

Just like with Python's lists we can also slice and index jmespath arrays:

import jmespath
data = {
    "people": [
        {"address": ["123 Main St", "California" ,"US"]},
    ]
}
jmespath.search("people[0].address[:2]", data)
['123 Main St', 'California']

Further, we can apply projections that apply rules for each list element. This is being done through the [] syntax:

data = {
  "people": [
    {"address": ["123 Main St", "California" ,"US"]},
    {"address": ["345 Alt St", "Alaska" ,"US"]},
  ]
}
jmespath.search("people[].address[:2]", data)
[
  ['123 Main St', 'California'], 
  ['345 Alt St', 'Alaska']
]

Just like with lists we can also apply similar projections to objects (dictionaries). For this * is used:

data = {
  "people": {
    "foo": {"email": "foo@email.com"},
    "bar": {"email": "bar@email.com"},
  }
}
jmespath.search("people.*.email", data)
[
  "foo@email.com",
  "bar@email.com",
]

The most interesting feature of JMESPath for web scraping has to be data reshaping. Using the .[] and .{} syntax we can completely reshape lists and objects:

data = {
  "people": [
    {
      "name": "foo", 
      "age": 33, 
      "addresses": [
        "123 Main St", "California", "US"
      ],
      "primary_email": "foo@email.com",
      "secondary_email": "bar@email.com",
    }
  ]
}
jmespath.search("""
  people[].{
    first_name: name, 
    age_in_years: age,
    address: addresses[0],
    state: addresses[1],
    country: addresses[2],
    emails: [primary_email, secondary_email]
  }
""", data)
[
  {
    'address': '123 Main St',
    'age_in_years': 33,
    'country': 'US',
    'emails': ['foo@email.com', 'bar@email.com'],
    'first_name': 'foo',
    'state': 'California'
  }
]

As you can see, using JMESPath we can easily parse complex datasets into something more digestible which is especially useful when web scraping JSON datasets.

Web Scraper Example

Let's explore a real-life JMESPath python example by taking a look at how it would be used in web scraping.

In this example project, we'll be scraping real estate property data on realtor.com which is a popular US portal for renting and selling properties.

We'll be using a few Python packages:

httpx - HTTP client library which will let us communicate with Realtor.com's servers
parsel - HTML parsing library which will help us to parse our web scraped HTML files.

And of course jmespath for parsing JSON. All of these can be installed using pip install command:

$ pip install jmespath httpx parsel

Realtor.com is using hidden web data to render its property pages which means instead of parsing HTML we can find the whole JSON dataset hidden in the HTML code.

Let's take a look at any random example property like this one

If we take a look at the page source we can see the JSON data set hidden in a <script> tag:

page source hidden data illustration on realtor.com — We can see entire property dataset hidden in a script element

We can extract it using HTML parser though it's huge and contains a bunch of gibberish computer data. So we can parse out the useful bits using JMESPath:

import json
import httpx
import jmespath
from parsel import Selector

# establish HTTP client and to prevent being instantly banned lets set some browser-like headers
session = httpx.Client(
    headers={
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    },
)

# 1. Scrape the page and parse hidden web data
response = session.get(
    "https://www.realtor.com/realestateandhomes-detail/335-30th-Ave_San-Francisco_CA_94121_M17833-49194"
)
assert response.status_code == 200, "response is banned - try ScrapFly? 😉"
selector = Selector(text=response.text)
# find <script id="__NEXT_DATA__"> node and select it's text:
data = selector.css("script#__NEXT_DATA__::text").get()
# load JSON as python dictionary and select property value:
data = json.loads(data)["props"]["pageProps"]["initialProps"]["propertyData"]

# 2. Reduce web dataset to important data fields:
result = jmespath.search(
    """{
    id: listing_id,
    url: href,
    status: status,
    price: list_price,
    price_per_sqft: price_per_sqft,
    date: list_date,
    details: description,
    features: details[].text[],
    photos: photos[].{url: href, tags: tags[].label}
}
""", data)
print(result)

Example Output

{
  "id": "2950457253",
  "url": "https://www.realtor.com/realestateandhomes-detail/335-30th-Ave_San-Francisco_CA_94121_M17833-49194",
  "status": "for_sale",
  "price": 2995000,
  "price_per_sqft": 977,
  "date": "2022-12-04T23:43:42Z",
  "details": {
    "baths": 4,
    "baths_consolidated": "3.5",
    "baths_full": 3,
    "baths_3qtr": null,
    "baths_half": 1,
    "baths_1qtr": null,
    "baths_min": null,
    "baths_max": null,
    "beds_min": null,
    "beds_max": null,
    "beds": 4,
    "garage": null,
    "pool": null,
    "sqft": 3066,
    "sqft_min": null,
    "sqft_max": null,
    "lot_sqft": 3000,
    "rooms": null,
    "stories": null,
    "sub_type": null,
    "text": "With four bedrooms, three and one-half baths, and over 3, 000 square feet of living space, 335 30th avenue offers a fantastic modern floor plan with classic finishes in the best family-friendly neighborhood in San Francisco. Originally constructed in 1908, the house underwent a total gut renovation and expansion in 2014, with an upgraded foundation, all new plumbing and electrical, double-pane windows and all new energy efficient appliances. Interior walls were removed on the main level to create a large flowing space. The home is detached on three sides (East, South, and West) and enjoys an abundance of natural light. The top floor includes the primary bedroom with two gorgeous skylights and an en-suite bath; two kids bedrooms and a shared hall bath. The main floor offers soaring ten foot ceilings and a modern, open floor plan perfect for entertaining. The combined family room - kitchen space is the heart of the home and keeps everyone together in one space. Just outside the breakfast den, the back deck overlooks the spacious yard and offers indoor/outdoor living. The ground floor encompasses the garage, a laundry room, and a suite of rooms that could serve as work-from-home space, AirBnB, or in-law unit.",
    "type": "single_family",
    "units": null,
    "unit_type": null,
    "year_built": 1908,
    "name": null
  },
  "features": [
    "Bedrooms: 4",
    "...etc, reduced for blog"
  ],
  "photos": [
    {
      "url": "https://ap.rdcpix.com/f707c59fa49468fde4999bbd9e2d433bl-m872089375s.jpg",
      "tags": [
        "garage",
        "house_view",
        "porch",
        "complete",
        "front"
      ]
    },
    "...etc, reduced for blog"
  ]
}

Using JMESpath we managed to reduce thousands of lines of JSON to essential data fields in just a few lines of Python code and a single JMESPath query - pretty awesome!

How to Scrape Realtor.com - Real Estate Property Data

In this scrape guide we'll be taking a look at real estate property scraping from Realtor.com. We'll also build a tracker scraper that checks for new listings or price changes.

FAQ

To wrap this up JMESPath tutorial let's take a look at some frequently asked questions:

What's the difference between [] and [*] in JMESPath?

The [] flattens all results while [*] keeps the structure as it is in the orignal dataset. See this example in Python:

data = {
  "employees": [
    {
      "people": [
        {"address": ["123 Main St", "California", "US"]},
        {"address": ["456 Sec St", "Nevada", "US"]},
      ],
    },
    {
      "people": [
        {"address": ["789 Main St", "Washington", "US"]},
        {"address": ["a12 Sec St", "Alaska", "US"]},
      ],
    },
  ]
}

jmespath.search("employees[*].people[*].address", data)
[
  # fromt he first group:
  [['123 Main St', 'California', 'US'], ['456 Sec St', 'Nevada', 'US']],
  # from the second group:
  [['789 Main St', 'Washington', 'US'], ['a12 Sec St', 'Alaska', 'US']]
]

jmespath.search("employees[].people[].address", data)
[
  # all groups merged:
  ['123 Main St', 'California', 'US'],
  ['456 Sec St', 'Nevada', 'US'],
  ['789 Main St', 'Washington', 'US'],
  ['a12 Sec St', 'Alaska', 'US']
]

Can JMESPath be used on HTML?

No, for that refer to very similar HTML path languages like CSS Selectors and Xpath Selectors.

What are some alternatives to JMESPath?

Some other popular query language for JSON are JsonPath, JQ and pyquery

Jmespath in Web Scraping Summary

In this tutorial on JMESPath we did a quick overview of what this path language is capable of when it comes to parsing JSON.

We've covered JMESPath's multiple filters and selectors through Python examples using jmespath python library.

Introduction to Parsing JSON with Python JSONPath

Intro to using Python and JSONPath library and a query language for parsing JSON datasets.

Finally, to wrap everything up we've taken a look at how JMESPath is used in web scraping through a real-life scraper.

Quick Intro to Parsing JSON with JMESPath in Python

What is JMESPath?

JMESPath Setup

Jmespath Usage Tutorial

Web Scraper Example

How to Scrape Realtor.com - Real Estate Property Data

FAQ

What's the difference between [] and [*] in JMESPath?

Can JMESPath be used on HTML?

What are some alternatives to JMESPath?

Jmespath in Web Scraping Summary

Introduction to Parsing JSON with Python JSONPath

Related Knowledgebase

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

How to select dictionary key recursively in Python?

What are some ways to parse JSON datasets in Python?

How to use XPath selectors in Python?

How to find elements without a specific attribute in BeautifulSoup?

What are some BeautifulSoup alternatives in Python?

How to handle popup dialogs in Playwright?

How to use proxies with Python httpx?

How to scrape images from a website?

How to select last element in XPath?

How to check if element exists in Playwright?

How to select all elements between two elements in XPath?

Related Articles

Ultimate Guide to JSON Parsing in Python

What is Parsing? From Raw Data to Insights

Intro to Parsing HTML and XML with Python and lxml

How to Parse XML

Web Scraping to Google Sheets

Web Scraping Emails using Python