JSONPath is a path expression language for JSON. It is used to query data from JSON datasets and it is similar to XPath query language for XML documents. Parsing HTML
When it comes to parsing web scraped HTML content, there are multiple techniques to select the data we want. For simple text parsing - various text parsing techniques like regular expression can be used. However, HTML is designed to be a machine-readable text structure - we can take advantage of this fact and use special path languages like CSS selectors to extract data in a much more efficient and reliable way!
In this article, we'll be taking a deep look at this unique path language and how can we use it to extract needed details from modern, complex HTML documents.
If ever done some web development, you are probably familiar with CSS selectors for applying styles to HTML websites - we can use the same tool for parsing HTML data!
Cascading Style Sheets protocol offers a unique path language for selecting HTML nodes to apply style to - these are called CSS Selectors. While this path language is designed to find nodes for styling, we can also use it to find nodes for parsing and navigation in our web scrapers.
CSS selectors tend to be brief but powerful enough for most web-scraping related parsing. Let's take a quick look at common pros and cons, especially compared to Xpath selectors:
::textor simple client methods like
.attribute()are being added as extra functionality.
HTML (HyperText Markup Language) is designed to be easily machine-readable and parsable. In other words, HTML follows a tree-like structure of nodes and their attributes, which we can easily navigate programmatically.
Let's start off with a small example page and illustrate its structure:
<head> <title> Document Title </title> </head> <body> <h1>Introduction</h1> <div> <p>some description text: </p> <a class="link" href="https://example.com">example link</a> </div> </body>
In this basic example of a simple web page, we can see that the document already resembles a data tree. Let's go a bit further and illustrate this:
Here we can wrap our heads around it a bit more easily: it's a tree of nodes, and each node can also have properties attached to them like keyword attributes (like
href) and natural attributes such as text.
Now that we're familiar with HTML let's familiarize ourselves with CSS selector syntax and let's use it to parse some data!
Css selectors are often referred to as selectors and a single selector indicates a path to a particular HTML node.
To test our CSS selectors, we'll be using an embedded selector playground
The average CSS selector in web scraping often looks something like this:
In this example, our selector will select all
<b> nodes that are children of an
<a> node with an ID of "title" which is under a
<div> node with a CLASS of "content":
As you can see, CSS selectors is just a chain of various expressions joined either by a space or ">" character. Let's see the most commonly used expressions in web-scraping:
||selects any descendant (child, grandchild etc.) that matches node name|
||selects a direct child that matches node name|
||selects sibling that matches node name|
||selects only adjacent siblings that matches node name|
||class constraint - select only nodes that contain this class|
||ID constraint - select only nodes that contain this ID|
||attribute match constraint, e.g.
||attribute contains constraint, e.g.
||attribute constrain match case insensitivity indicator, e.g.
||allows grouping of multiple selectors joining all results, e.g.
This is the core syntax available in most CSS selector clients, which should provide us with enough flexibility to parse most of HTML trees we might encounter on the modern web. Let's take a look at some examples!
The most important feature of CSS selectors is node selection by name and descendant chaining. For example, using
> character we can chain multiple node selectors:
This allows us to define strict selection paths. However, using direct child selector (
>) can make our selectors too strict for highly dynamic HTML files found in modern websites. In other words, what if the
<a> node gets wrapped in some other styling node? That would break our selector.
Instead, we should use a mixture of space and
> selectors to find the sweet spot of stability and accuracy:
Here instead of defining a direct path we root our selector to
<div> node that contains class
.socials class constraint) and from there we can assume that any link in an unordered list is a social link.
Relying less on HTML structure and more on the context allows creating selectors that break less often on HTML structure changes.
Ideally, when designing our selectors we want to find the sweet spot between structure and context which will result in no false positives and something that doesn't break on small HTML tree changes:
Here, we're using attribute contains constraint to restrict extraction only of links that contain
"linkedin" in their urls.
In this section, we've discovered what makes a good CSS selector in web scrapers: we want something robust that doesn't select false positives and something not too strict that might miss some results.
Further, let's take a look at some more complex parsing scenarios and how can we solved them using CSS selectors alone
Unfortunately, modern websites can have very complex and dynamic HTML trees that difficult to navigate reliably. Let's take a look at few common examples of complex structures and how can we solve them with CSS selectors.
Firstly, it's nice to remember that we don't have to cram everything into a single CSS selector, and we can safely join multiple selectors using the
In this example, we use two selectors for 2 different spellings of the word "favorite". We're also using a special pseudo class
:contains which allows us to check whether text value of the node contains some string.
Another cool feature of selector joining is that all results come in ordered by their appearance, which means we can safely join selectors and retain the content structure:
In this example, we extract recipe text while avoiding promos and other non-recipe related texts. We're also using
:not pseudo class which allows us to reverse our selector constraints, which is very useful for filtering out unwanted nodes.
Finally, last one of important CSS selector features for web scraping is result slicing. Often we want to select only matching nodes of specific indices:
In this example, we're using
:nth-last-of-type to implement basic result slicing which allows us to filter out first and last nodes from our selection!
While Css selectors might appear a bit clunky compared to Xpath selectors (or other options) there's a surprising amount of power there that with some clever engineering can help us reliably extract data from HTML documents.
CSS selectors are primarily used in front-end web development, however there are few backend implementations that are used as clients for HTML parsing. Let's take a look at the most popular libraries that implement CSS selectors.
To parse HTML in Python we have several options. However, many of them instead of executing natural CSS selectors convert them to XPath by using cssselect and run the XPath selectors through lxml XPath client. An example of such client would be parsel.
Other packages implement CSS selectors in varying capacity:
For Python, to parse HTML the most popular choices are beautifulsoup4 or parsel.
PHP like python also prefers Xpath selectors thus most CSS selector clients use css-selector component to convert CSS selectors to xpath selectors and execute them through either built-in DOMXPath or community favorite DOMCrawler
Alternatively, when using browser emulation through browser emulation clients like Selenium php also gets access to browser's CSS selector capabilities:
// we can use findElements method of Selenium web driver to find elements by CSS selectors $webDriver->findElements(WebDriverBy::cssSelector("div.content a#title>b"));
Ruby has several CSS capable clients, however most popular package is nokogiri which offers both CSS and xpath selectors and loads of parsing utility functions and extensions:
html_doc = Nokogiri::HTML('<html><body><div class="socials"><a href="https://scrapfly.io/blog">Our blog</a></div></body></html>') @doc.css("div.socials>a").attributes["href"]
Often xpath selectors are being favored over CSS selectors which makes CSS less accessible outside the mentioned few. That being said, since CSS selectors are very similar to XPATH selectors there's typically at least a community maintained translation layer available!
Both path languages have their pros and cons. Generally, CSS selectors are briefer but less powerful than xpath. When web scraping it's best to mix both!
CSS selectors do not support selection of node parents. Instead, XPath
.. selector can be used, e.g.
root/child/.. will select
No, typically CSS selectors do not support native extensions. However many web scraping libraries can be patched with extra features as CSS selectors are typically converted to XPath selectors before runtime. That being said, it's better to fallback to XPath if CSS capabilities are inadequate.
In this introduction article we covered the syntax of CSS selectors, explored basic navigation to solidify our knowledge and finally finished off by taking a look at more advanced usages to fully grasp what this little path language is capable off!
While CSS selectors are great ideally when web scraping it's best to take advantage of both CSS and XPATH selectors. Common idiom is to use CSS selectors for simple paths as they are short and easy to follow and for more complex selections fall back to xpath which is more verbose and powerful.
For further CSS selector help we advise checking out #CSS-selectors tag on Stackoverflow which is very active and full of dedicated teachers!