JSON Parsing Made Easy with ChatGPT in Web Scraping

article feature image

Web scrapers can extract large amounts of JSON data from websites, but data parsing and cleanup is often required. While data parsing is an important step in the web scraping process it can be challenging when dealing with large JSON datasets.

JSON parsing languages like JMESPath use path syntax to filter and reshape JSON datasets but involve a lot of manual labor. In this article, we'll take a look at using chatGPT to generate JMESPath queries for us.

By using ChatGPT we can easily get JMESPath selectors and refine JSON data. Let’s get into the details!

What is JMESPath?

JMESPath is a query language used for searching through JSON documents. It allows to create queries that filter and extract specific data fields from JSON datasets. In a way, JMESPath is very similar to XPath or CSS selectors but for JSON data.

Quick Intro to Parsing JSON with JMESPath in Python

See our complete JMespath introduction which covers everything you need to know about this small path language and explore a web scraping example

Quick Intro to Parsing JSON with JMESPath in Python

In this article, we don’t have to understand or implement JMESPath selectors ourselves. Instead, we'll parse JSON with ChatGPT to get JMESPath queries that we can use in Python or any other JMespath-supported language.

This approach is similar to finding XPath and CSS selectors with ChatGPT we've covered previously. Though this time, we'll use ChatGPT to interpret JSON for JMESPath queries.

Parse JSON with ChatGPT

When using web scrapers that don't support JavaScript (like BeautifulSoup) we might not find the data on the page as it's rendered by JavaScript. This type of data is known as hidden web data, which is often messy and contains meta fields and other useless data keys. This is where we parse JSON with ChatGPT to get JMESPath queries that clean up and refine JSON datasets for us.

We can approach chatgpt JSON parsing in two ways:

  1. Crafting a chatgpt scraper using the code interpreter feature.
    This works by uploading an HTML file through the code feature. ChatGPT will then parse JSON data in the file and return clean JSON data.

  2. Passing a JSON document as text into the chat prompt.
    This works by passing an HTML snippet to the chat prompt. Unfortunately, often we can't pass the whole HTML page due to the character limit, though we can pass the HTML script tags that contain the JSON data. ChatGPT will then return corresponding JMESPath queries that point to specific data fields.

In this tutorial, we'll cover both ChatGPT JSON parsing methods. But before that, Let's take a look at our target website.

Setup

In this example, we’ll be using Oppenhimmer’s page on IMDB.

To parse JSON on this page using ChatGPT, we need to install JMESPath alongside other libraries. Install them using the following command:

pip install jmespath httpx parsel

We will use httpx for sending requests, parsel for selecting HTML script tags and jesmpath for selecting elements in the JSON document.

Extracting HTML Samples

Since we'll parse JSON with chatgpt in two different ways, we need to save the HTML page and extract the JSON data.

Let’s start by extracting JSON data from the script tag in the HTML page. Open browser’s developer tools by pressing the (F12) key. Then copy the script tag with the type application/ld+json:

screen capture of oppenheimer imdb page source JSON data

🙋‍When extracting sample data be aware of chatGPTs character limit

For parsing JSON using the code interpreter feature, we can save the whole HTML page. To do that, simply hit (CTRL + s) in your browser.

Parsing JSON with ChatGPT Chat Prompt

Let’s start by parsing JSON with chatgpt chat prompt. We’ll pass the script tag we copied earlier and ask ChatGPT to parse the JSON. ChatGPT will then return JMESPath queries that point to data in the JSON document:

Can you cleanup this JSON dataset and restructure it to a flat structure using JMESPath and Python?
<add JSON datset here>

And chatGPT should respond with a full JMESPath script that defines the JMESPath query and execution code:

screengrab of gpt response

ChatGPT returned JMESPath queries for the data in the JSON document. It also created a basic Python code that scrapes this data directly from the JSON text. Let’s edit the code to make it scrape data from the target website.

First, we need to send a request to the target website and capture the page HTML:

import json
import httpx
import jmespath
from parsel import Selector

# establish HTTP client and to prevent being instantly banned lets set some browser-like headers
session = httpx.Client(
    headers={
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    },
)

# send a request to the target website
request = session.get(
    "https://www.imdb.com/title/tt15398776/"
)

Now that we have the request object, we will wrap the HTML into a selector using Parsel. This will allow us to capture the JSON data from the HTML using CSS selectors. Here is how:

selector = Selector(text=request.text)

# get the json object from the script tag
json_data = selector.css("script[type='application/ld+json']::text").get()

Next, let’s add the chatGPT code we got earlier:

# Load the JSON data
data = json.loads(json_data)

# JMESPath expressions
expression = jmespath.compile(
    """
    {
      "name": name,
      "url": url,
      "description": description,
      "review_author": review.author.name,
      "review_date": review.dateCreated,
      "review_text": review.reviewBody,
      "review_rating": review.reviewRating.ratingValue,
      "aggregate_rating": aggregateRating.ratingValue,
      "content_rating": contentRating,
      "genre": genre,
      "date_published": datePublished,
      "actors": actor[].name,
      "director": director[].name,
      "duration": duration
    }
    """
)

# Apply JMESPath expression
flat_data = expression.search(data)

# Print the restructured data
print(json.dumps(flat_data, indent=2))

Here, we convert the JSON data we got into Python object and create the JMESPath expressions. We then use these expressions to seach through the data object. Here is the result we got after running the code:

{
  "name": "Oppenheimer",
  "url": "https://www.imdb.com/title/tt15398776/",
  "description": "The story of American scientist, J. Robert Oppenheimer, and his role in the development of the atomic bomb.",
  "review_author": "Dvir971",
  "review_date": "2023-07-19",
  "review_text": "Oppenheimer might be the best film I watched in a long, long time.\n\nVery different than Nolan&apos;s recent films, especially the Sci-Fi ones, but shows that Nolan can master the Biopic/Drama genre just as well as he can any other genre he tried to tackle yet.\n\nThe film is 3-hours long but goes through very quickly and enjoyably. Without spoiling anything, the film presents important and very relevant subjects, and doing so while being non-stop entertainment and a comprehensive character study and a study of our society on a very high pace.\n\nWithout mentioning anything specific, there was one scene that caused almost every single person in the theatre to move nervously in the seats, non-stop for a long period of time, being one of the most intense scenes I ever watched in a movie and reminding me of the true power of the cinematic experience like no other movie did in recent years.\n\nThe year is only half-way through but right now this is my top pick for the upcoming awards season. Picture, Writing, Directing, Acting, Score-- Oppenheimer is a winner on all fronts. A rare feat for filmmaking and a salient reminder that cinema is not dead.\n\nI highly recommend this film to everyone. Watched it once already, and going back to the theatre for at least a few more times soon.",
  "review_rating": 10,
  "aggregate_rating": 8.6,
  "content_rating": "R",
  "genre": [
    "Biography",
    "Drama",
    "History"
  ],
  "date_published": "2023-07-21",
  "actors": [
    "Cillian Murphy",
    "Emily Blunt",
    "Matt Damon"
  ],
  "director": [
    "Christopher Nolan"
  ],
  "duration": "PT3H"
}

In summary, ChatGPT parsed JSON data in the script tag and returned JMESPath queries that find useful data in the JSON document. We then created a web scraper that uses these queries to scrape this data from the web page.

Parsing JSON with ChatGPT Code Interpreter feature

We have seen that ChatGPT JSON parsing with the chat prompt can save a lot of time. It can parse a large JSON dataset and return corresponding JMESPath expressions. However, we have to spend some time locating JSON data in the HTML before passing it to the prompt. Let’s parse JSON directly using the ChatGPT code interpreter feature.

🙋‍ Note that the code interpreter feature is only available for GPT-4 users. You can activate it by clicking on the GPT-4 model and selecting the code interpreter feature.

All we have to do is to upload the HTML file we saved before and ask ChatGPT to parse json data in the file:

screengrab of gpt response using code interpreter

First, ChatGPT parsed the HTML page to find the script tag that has the JSON data. Then, it parsed the HTML script tag and returned the JSON data:

Output data
{"@context": "https://schema.org",
 "@type": "Movie",
 "url": "https://www.imdb.com/title/tt15398776/",
 "name": "Oppenheimer",
 "image": "https://m.media-amazon.com/images/M/MV5BMDBmYTZjNjUtN2M1MS00MTQ2LTk2ODgtNzc2M2QyZGE5NTVjXkEyXkFqcGdeQXVyNzAwMjU2MTY@._V1_.jpg",
 "description": "The story of American scientist, J. Robert Oppenheimer, and his role in the development of the atomic bomb.",
 "review": {"@type": "Review",
  "itemReviewed": {"@type": "Movie",
   "url": "https://www.imdb.com/title/tt15398776/"},
  "author": {"@type": "Person", "name": "Zay-Fee"},
  "dateCreated": "2023-07-20",
  "inLanguage": "English",
  "name": "Exceptional storytelling and Genius Cinametography",
  "reviewBody": "Just came out of the theater and watching Oppenheimer was such a great experience. I know many people will criticize the movie for some historical accuracy absence but I think Christopher Nolan has made this complicated man&apos;s story compelling, engaging, and simple to understand. The actors are phenomenal. Apart from the main leads, Robert Downey has probably done one of his finest work. His expressions, timing, delivery... Everything was on par. The cinematography has been crafted beautifully. I adored and enjoyed the whole three hours with ease and delight. This is the first attempt of Christopher Nolan at biographies and I think we should expect more of his work from this genre since it&apos;s not only entertaining but also sparks an interest to know history more. I have read the book earlier so I went to watch it with a little bit of knowledge and still enjoyed the film. I wish I could tell Cillian Murphy in person how stunning his screen presence has been throughout. Hopefully, this movie wins the awards like it deserves.",
  "reviewRating": {"@type": "Rating",
   "worstRating": 1,
   "bestRating": 10,
   "ratingValue": 9}},
 "aggregateRating": {"@type": "AggregateRating",
  "ratingCount": 303192,
  "bestRating": 10,
  "worstRating": 1,
  "ratingValue": 8.7},
 "contentRating": "R",
 "genre": ["Biography", "Drama", "History"],
 "datePublished": "2023-07-21",
 "keywords": "american politics,manhattan project,nuclear physicist,nuclear,year 1945",
 "trailer": {"@type": "VideoObject",
  "name": "Get Tickets",
  "embedUrl": "https://www.imdb.com/video/imdb/vi3600860953",
  "thumbnail": {"@type": "ImageObject",
   "contentUrl": "https://m.media-amazon.com/images/M/MV5BMTRjYzEzOGMtNDZjYi00MzY0LThkYzEtYjE1Njg0YzU3Y2UzXkEyXkFqcGdeQXZ3ZXNsZXk@._V1_.jpg"},
  "thumbnailUrl": "https://m.media-amazon.com/images/M/MV5BMTRjYzEzOGMtNDZjYi00MzY0LThkYzEtYjE1Njg0YzU3Y2UzXkEyXkFqcGdeQXZ3ZXNsZXk@._V1_.jpg",
  "url": "https://www.imdb.com/video/vi3600860953/",
  "description": "The story of American scientist J. Robert Oppenheimer and his role in the development of the atomic bomb.",
  "duration": "PT31S",
  "uploadDate": "2023-06-28T21:18:17.102Z"},
 "actor": [{"@type": "Person",
   "url": "https://www.imdb.com/name/nm0614165/",
   "name": "Cillian Murphy"},
  {"@type": "Person",
   "url": "https://www.imdb.com/name/nm1289434/",
   "name": "Emily Blunt"},
  {"@type": "Person",
   "url": "https://www.imdb.com/name/nm0000354/",
   "name": "Matt Damon"}],
 "director": [{"@type": "Person",
   "url": "https://www.imdb.com/name/nm0634240/",
   "name": "Christopher Nolan"}],
 "creator": [{"@type": "Organization",
   "url": "https://www.imdb.com/company/co0005073/"},
  {"@type": "Organization", "url": "https://www.imdb.com/company/co0028338/"},
  {"@type": "Organization", "url": "https://www.imdb.com/company/co1007122/"},
  {"@type": "Person",
   "url": "https://www.imdb.com/name/nm0634240/",
   "name": "Christopher Nolan"},
  {"@type": "Person",
   "url": "https://www.imdb.com/name/nm3284831/",
   "name": "Kai Bird"},
  {"@type": "Person",
   "url": "https://www.imdb.com/name/nm2452558/",
   "name": "Martin Sherwin"}],
 "duration": "PT3H"}

We can see that we got better results by parsing JSON with chatGPT code interpreter feature. This is because the code interpreter is using a superior GPT-4 model technology.

FAQ

To wrap up this article, let's take a look at some frequently asked questions about chatgpt json parsing.

What is JMESPath?

JMESPath is a path language tool used for querying JSON documents. It's implemented in all major programming languages making it easily accessible in any web scraper. It's often used in web scraping because of its accessible dataset reshaping features.

What is the difference between ChatGPT chat prompt and the code interpreter feature?

Both methods can greatly assist with web scraper development, particularly in HTML and JSON parsing. By using the code interpreter feature, it's possible to have AI parse HTML documents for real datafields. Without the code interpreter, it's only possible to parse smaller HTML documents and GPT's output is less accurate.

Can ChatGPT scrape hidden JSON data?

Yes, by passing HTML code to the chat prompt, you can ChatGPT can parse HTML and find hidden JSON data in script tags or JavaScript variables.

ChatGPT JSON Parsing Summary

In this article, we have explained how to parse JSON with ChatGPT. This method works by supplying a page sample to the AI and asking it to come up with parsing instructions. In our example, we asked it to parse JSON data from an IMDB page using JMESPath.

AI website scrapers are becoming an increasingly viable option in web scraping, and while current technology is too resource intensive to be viable, it's a great educational tool, as demonstrated in this article!

Related Posts

What is Parsing? From Raw Data to Insights

Learn about the fundamentals of parsing data, across formats like JSON, XML, HTML, and PDFs. Learn how to use Python parsers and AI models for efficient data extraction.

How to Power-Up LLMs with Web Scraping and RAG

In depth look at how to use LLM and web scraping for RAG applications using either LlamaIndex or LangChain.

Intro to Using Web Scraping For Sentiment Analysis

In this article, we'll explore using web scraping for sentiment analysis. We'll start by defining sentiment analysis and then walk through a practical example of performing sentiment analysis on web-scraped data with community Python libraries.