Parsing HTML with Xpath

article feature image

When it comes to parsing web scraped HTML content, there are multiple techniques to select data we want. For simple text parsing regular expression can be used, but HTML is designed to be a machine-readable text structure so we can take advantage of this fact and use special path languages like css selectors and xpath to extract data in a much more efficient and reliable way!

You are probably familiar with css selectors from the style sheet documents (.css) however xpath goes beyond that and implements full document navigation in its own unique syntax.

Parsing HTML with CSS Selectors

For parsing using CSS selectors see the CSS version of this article

Parsing HTML with CSS Selectors

In this article, we'll be taking a deep look at this unique path language and how can it be used to extract needed details from modern, complex HTML documents.

What is Xpath?

XPath stands for "XML Path Language" which essentially means it's a query language that described a path from point A to point B for XML/HTML type of documents.
Other path languages you might know of are CSS selectors which usually describe paths to apply styles to, or tool specific languages like jq which describe path for JSON type documents.

For HTML parsing, Xpath has some advantages over CSS selectors:

  • Xpath can traverse HTML tree in every direction and is location aware.
  • Xpath can transform results before returning them.
  • Xpath is easily extendable with additional functionality.

Before we dig into Xpath let's have a quick overview of HTML itself and how it enables xpath language to find anything with the right instructions.

HTML Overview

HTML (HyperText Markup Language) is designed to be easily machine-readable and parsable. In other words, HTML follows a tree like structure of nodes and their attributes, which we can easily navigate programmatically.

Let's start off with a small example page and illustrate its structure:

<head>
  <title>
  Document Title
  </title>
</head>
<body>
  <h1>Introduction</h1>
  <div>
    <p>some description text: </p>
    <a class="link" href="https://example.com">example link</a>
  </div>
</body>

In this basic example of a simple web page, we can see that the document already resembles a data tree. Let's go a bit further and illustrate this:

html tree illustration

HTML tree is made of nodes which can contain attributes such as classes, ids and text itself.

Here we can wrap our heads around it a bit more easily: it's a tree of nodes and each node can also have properties attached to them like keyword attributes (like class and href) and natural attributes such as text.

Now that we're familiar with HTML let's familiarize ourselves with Xpath itself!

Xpath Syntax Overview

Xpath selectors are usually referred to as "xpaths" and a single xpath indicates a destination from the root to the desired endpoint. It's a rather unique path language, so let's start off with a quick glance over basic syntax.

To test our xpaths, we'll be using ScrapFly's css/xpath selector playground so feel free to edit the results and experiment yourself in real time below!

Average xpath selector in web scraping often looks something like this:

xpath protocol illustration

illustration of a usual xpath selector's structure

In this example, xpath would select href attribute of an <a> node that has a class "button" which is also directly under <div> node:

<div> <a class="button" href="http://scrapfly.io">ScrapFly</a> </div>

Xpath selectors are made up from multiple expressions joined together into a single string. Let's see the most commonly used expressions:

expression description
/node selects a direct child that matches node name.
//node selects any descendant - child, grandchild, gran-grandchild etc.
* wildcard can be used instead of node name
@ selects an attribute of a node e.g. a/@href will select value of href attribute of an a node
text() selects text value of a node
[] selector constraint - can be used to filter out nodes that do no match some condition
parent or .. select current nodes parent e.g. /div/a/.. will select div element
self or . select current node (this is useful as argument in xpath function, we'll cover more later)
following-sibling::node selects all following siblings of type, e.g. following-sibling::div will select all div siblings below the current node
preceding-sibling::node selects all preceding siblings of type, e.g. preceding-sibling::div will select all div siblings below the current node
function() calls registered xpath function e.g. /a/text() will return text value of a node

This is a lot to take in, so let's solidify this knowledge with some real life examples!

Basic Navigation

When writing xpaths the first thing we should be aware of is that all xpaths have to have a root (aka point A) and final target (aka point B). Knowing this and xpath axis syntax, we can start describing our selector path:

<div> <p class="socials"> Follow us on <a href="https://twitter.com/@scrapfly_dev">Twitter!</a> </p> </div>

Here, our simple xpath simply describes a path from the root to the a node.
All we used is / direct child syntax, however with big documents direct xpaths are often unreliable as any changes to the tree structure or order will break our path.

It's better to design our xpaths around some smart context. For example, here we can see that this <a> node is under <p class="socials"> node - we can infer strong sense that these two will most likely go together:

<div> <p class="socials"> Follow us on <a href="https://twitter.com/@scrapfly_dev">Twitter!</a> </p> </div>

With this xpath, we get rid a lot of structure dependency in favor of context. Generally, modern websites have much more stable contexts than structures, and finding the right balance between context and structure is what creates reliable xpaths!

Further, we can combine constrains option ([]) with value testing functions such as contains() to make our xpath even more reliable:

<div> <p class="socials"> Follow us on <a href="https://twitter.com/@scrapfly_dev">Twitter!</a> </p> </div>

Using contains() function, we can filter out any results that don't contain a match.

Xpath functions are very powerful and not only they can check for truthfulness but also modify content during runtime:

<div> <p class="socials"> Follow us on <a href="https://twitter.com/@scrapfly_dev">Twitter!</a> or connect with us on <a href="https://www.linkedin.com/company/scrapfly/">Linkedin</a> </p> </div>

Here, we've added concat() function, which joins all provided arguments into a single value and only then perform our match check.

Sometimes tree complexity outgrows context based selectors and we have to implement some complex structure checks. For that, xpath has powerful tree navigation features that allow to select ancestors and siblings of any level:

<div> <span>For price contact </span> <a>Sales department </a> <div> <span>total: </span> </div> <span>166.00$</span> <span>*taxes apply</span> </div>

In this example, we find a text containing total, navigate up to it's parent and get first following (i.e. lower) sibling.

Other times, we need to use position based predicates and even combine multiple xpaths to reliably parse html data:

<div> <span>items: </span> <span>(taxes not included)</span> <span>166.00$</span> <span>25.00$</span> <span>*taxes apply</span> <div> <span>addons:</span> <span>0.5$</span> </div> </div>

In this example, we used position() function to select only siblings that are in specific range. We also combined to xpaths using the | operator (for or operation or operator can being used) to fully retrieve all pricing info.

As you can see, xpaths can be very powerful and parse almost any html structure if we get creative with path logic!

Extending Functions

Xpath in most clients can be extended with additional functions, and some clients even come with pre-registered non-standard functions.

For example, in Python's lxml (and it's based packages like parsel) we can easily register new functions like this:

from lxml import etree

def myfunc(context, *args):
    return True 

xpath_namespace = etree.FunctionNamespace(None)
xpath_namespace['myfunc'] = myfunc

Other language clients follow a similar process.

Xpath Clients

Almost every programming language contains some sort of xpath client for XML file parsing. Since HTML is just a subset of XML we can safely use xpath in almost every modern language!

Xpath in Python

In Python there are multiple packages that implement xpath functionality, however most of them are based on lxml package which is a pythonic binding of libxml2 and libxslt C language libraries. This means Xpath selectors in Python are blazing fast, as it's using powerful C components under the hood.

While lxml is a great wrapper, it lacks a lot of modern API usability features used in web scraping. For this, lxml based packages parsel (used by scrapy) and pyquery provide a richer feature set.

Example usage with parsel:

from parsel import Selector

html = """
<html>
  <head>
    <title>Page Title</title>
  </head>
  <body>
    <h1 class="title">Page header by <a href="#">company</a></h1>
    <p>This is the first paragraph</p>
    <p>This is the second paragraph</p>
  </body>
</html>
"""
sel = Selector(html)
sel.xpath("//p").getall()
# [
# "<p>This is the first paragraph</p>",
# "<p>This is the second paragraph</p>",
# ]

Other tool recommendations:

  • cssselect - converts css selectors to xpath selectors
  • parsel-cli - real time REPL for testing css/xpath selectors

Xpath in PHP

In PHP most popular xpath processor is Symphony's DomCrawler:

use Symfony\Component\DomCrawler\Crawler;

$html = <<<'HTML'
<html>
  <head>
    <title>Page Title</title>
  </head>
  <body>
    <h1 class="title">Page header by <a href="#">company</a></h1>
    <p>This is the first paragraph</p>
    <p>This is the second paragraph</p>
  </body>
</html>
HTML;

$crawler = new Crawler($html);
$crawler->filterXPath('//p');

Other tool recommendations:

Xpath in Javascript

Javascript supports xpath natively, you can read more about it on MDN's Introduction to Using Xpath in Javascript

Other tool recommendations:

  • jQuery - extra syntax sugar and helpers for xpath querying.

Xpath in Other Languages

Most other languages have some sort of xpath client as XML parsing is an important data exchange feature. Meaning, we can parse web-scraped content in language of our choice!

For example C# supports xpath natively as well, you can read more about it over at the official documentation so does Objecive C and other low-level languages.

While, some languages might not have first-party xpath clients it's very likely there's a community package. For example, Go language has community packages for xpath in xml, html and even json.

FAQ

Is XPATH faster than CSS selectors?

Many CSS selector libraries actually convert CSS selectors to XPATH because it's faster and more powerful. That being said it depends on each individual library and complexity of the selector itself. Some XPATH selectors which use broad selection paths like // can be very expensive computationally.

My xpath selects more data than it should

XPATH broad selector paths like // are global rather than relative. To make them relative we must add the relativity marker . -> .//

How to match nodes by multiple names?

To match nodes by multiple names we can use wildcard selector together with a name check condition: //*[contains("p h1 head", name())] - will select h1, p and head nodes.

How to select select elements between two nodes?

If we know two nodes like text headers we can select text between with clever use of preceding-sibling notation:
//h2[@id="faq"]//following-sibling::p[(preceding-sibling::h2[1])[@id="faq"]] - this xpath selects all paragraph nodes under h2 tag with id faq and not elements under other h2 nodes.

<div> <span>items: </span> <span>(taxes not included)</span> <span>166.00$</span> <span>25.00$</span> <span>*taxes apply</span> <div> <span>addons:</span> <span>0.5$</span> </div> </div>

Summary

In this introduction article, we've introduced ourselves with xpath query language. We've discovered that HTML documents are actually data trees with nodes and attributes which can be machine parsed really efficiently. Finally, we glanced over the most commonly used xpath syntax and explored common html parsing scenarios and best practices.

Xpath is a very powerful and flexible path language that is supported in many low-level and high-level languages: Python, PHP, Javascript etc. - so, whatever stack you're using for web-scraping, xpath can be your to go tool for html parsing!

Finally, for more xpath help we recommend visiting Stackoverflow's #xpath tag.

Related post

Web Scraping With Python Tutorial

Introduction tutorial to web scraping with Python. How to collect and parse public data. Challenges, best practices and an example project.

Web Scraping With R

Introduction to web scraping with R language. How to handle http connections, parse html files, best practices, tips and an example project.

Web Scraping With Ruby

Introduction to web scraping with Ruby. How to handle http connections, parse html files for data, best practices, tips and an example project.

Web Scraping With NodeJS

Introduction to web scraping with NodeJS. How to handle http connections, parse html files for data, best practices, tips and an example project.