Ultimate XPath Cheatsheet for HTML Parsing in Web Scraping

by Bernardas Ališauskas May 27, 2024

#xpath #data-parsing

Ultimate XPath Cheatsheet for HTML Parsing in Web Scraping

XPath is the most powerful HTML parsing engine used in HTML parsing when web scraping. It's available in Scrapy and popular HTML parsing libraries like parsel.

XPath comes in version 1.0, 2.0 and 3.0. Most commonly XPath 1.0 and 2.0 are supported in web scraping toolsets and most functionality can be found in 1.0. In this cheatsheet we'll take a look at XPath 1.0 and 2.0 features that are relevant to HTML parsing.

Parsing HTML with Xpath

Introduction to xpath in the context of web-scraping. How to extract data from HTML documents using xpath, best practices and available tools.

This XPath cheatsheet contains all XPath features used in HTML parsing.
Clicking on the explanation text will take you to a real-life interactive example with more details. Note that XPath selectors can differ in different implementations, so unique non-standard features are marked as such.

🗋⮭

Cheatsheet

Selector	Explanation
	Axes and Navigation
element	selects node by name: `div`, `a`
*	selects any node (wildcard)
. or self	selects current node (useful in matching)
[n]	select n-th node (starts at 1)
/	selects direct child
//	selects any descendant
./ or .//	explicit relativity
..	selects element parent
ancestor::	selects element's ancestors (parent, grandparent)
preceding::	selects all preceding nodes (above)
following::	selects all following nodes (below)
preceding-sibling::	selects preceding siblings (above)
following-sibling::	selects following siblings (below)
	Logical Operators
\|	union logic - joins multiple selectors in order as they appear
`or`	chain multiple optional predicates
`and`	chaing multiple predicates
	Attribute Matching
@attribute	selects attribute by name
text()	retrieve element's text
[]	element predicate (rules to match nodes)
[][]	multiple element predicates
[node[]]	nested element predicates
[a=b]	exact match
[a>b], [a<b]	number values can be matched for greater/less
	Functions
name()	return current node's name
not(A)	reverses A
number(A)	cast value to a number
contains(A, B)	check whether A contains B
matches(pattern, value) or re:test(pattern, value)	check for regular expressions pattern ^xpath2
tokenize(value, pattern)	split string by regex pattern ^xpath2
lower-case(A)	turns value lowercase ^xpath2
starts-with(A, B)	check whether A starts with B
ends-with(A, B)	check whether A ends with B
concat(*args)	join multiple values to a single string
substring(str, start, len)	split string into multiple elements
substring-before(str, separator)	split string and take the beginning value
substring-after(str, separator)	split string and take the end value
normalize-space(A)	normalize space character for A
count(A)	count matched elements
position()	node's position within surrounding siblings
last()	context size: `[last()]` selects last node
string-length(A)	returns string length
	Common Patterns
//a/@href	all links on the page
//img/@src	all images on the page
//text()	all text under node
//child::*/text()	all direct children text
[contains(concat(' ',normalize-space(@class),' '),' myclass ')]	space separated class check (like css' `.class`)
[preceding::div[.="One"] and following::div[.="Two"]]	node between two nodes
	Non-standard Functions
re:test(x, expr, flags)	like matches() but used in scrapy, lxml and parsel

by Element Name

<a href="https://twitter.com/@scrapfly_dev">Twitter</a> <a href="https://www.linkedin.com/company/scrapfly/">LinkedIn</a> <a href="https://scrapfly.io/blog">blog</a>

The most simple selector is the element name. It selects all nodes with the given name.

Element Name Wildcard

<div> <a href="https://twitter.com/@scrapfly_dev">Twitter</a> <b href="https://www.linkedin.com/company/scrapfly/">LinkedIn</b> <c href="https://scrapfly.io/blog">blog</c> </div>

When element name is *, it matches any node. This can be further refined using self() and attribute matching.

Element Self

<div> <a href="https://twitter.com/@scrapfly_dev">Twitter</a> <a href="https://www.linkedin.com/company/scrapfly/">LinkedIn</a> <a href="https://scrapfly.io/blog">blog</a> </div>

self or . syntax can be used to access the current context. This is useful for advanced matching element in [] predicates.

Nth Element

<div> <a href="https://twitter.com/@scrapfly_dev">Twitter</a> <a href="https://www.linkedin.com/company/scrapfly/">LinkedIn</a> <a href="https://scrapfly.io/blog">blog</a> </div>

XPath supports element indexing. a[1] will select the first a element, a[2] will select the second, and so on. Note that XPath indexing starts at 1, not 0.

Direct Child Element

<div> <span> <a href="https://twitter.com/@scrapfly_dev">Twitter</a> </span> </div>

The / selector selects direct children of the current node. a will select all a elements, while div/a will select only a elements that are direct children of a div node.

Any Descendant Element

<div> <span> <a>select</a> <span> <a>select</a> </span> </span> <a>select</a> </div>

The // selects any descendant under current context. This is one of the most useful selectors in XPath, as it allows you to select any element recursively.

Explicit Relativity

./ and .// are explicit versions of / and //. They are useful when you want to be explicit about the relativity of the selector. This is mostly relevant for engines that allow working with multiple selectors like parsel :

from parsel import Selector

sel = Selector()
product = sel.xpath('//div[@class="product"]')
product_features = product.xpath('.//div[@class="features"]')
#                                 ^^^
# without . it would select all features in the document

Element Parent

The .. or parent selector can be used to navigate up the HTML node tree or match by parent values.

Element Ancestors

<div> <article> <a>important link1</a> <a>important link2</a> <span> <button>subscribe</button> <span> </article> <article> ignore </article> </div>

The ancestor:: selector is similar to .. but selects any ancestor no matter the depth level. Note the axes selector :: can be used with * wildcard to select any ancestor.

Preceding Nodes

<div> <div> <a href="important">select</a> </div> <a href="important2">select</a> <separator></separator> <a href="ignore">ignore</a> </div>

The preceding selector selects all nodes that appear before the current node. This is useful for selecting all nodes above the current node.

Following Nodes

<div> <a href="ignore">ignore</a> <separator></separator> <div> <a href="important">select</a> </div> <a href="important2">select</a> </div>

The following selector selects all nodes that appear after the current node. This is useful for selecting all nodes below the current node.

Preceding Siblings

<div> <div> <a href="ignore">ignore</a> </div> <a href="important">select</a> <separator></separator> <a href="ignore">ignore</a> </div>

The preceding-sibling selector selects all nodes that appear before the current node and share the same parent. This is useful for selecting all sibling nodes above the current node.

Following Siblings

<div> <a href="ignore">ignore</a> <separator></separator> <div> <a href="ignore">ignore</a> </div> <a href="important">select</a>

The following-sibling selector selects all nodes that appear after the current node and share the same parent. This is useful for selecting all sibling nodes below the current node.

Union Logic

<a href="https://twitter.com/@scrapfly_dev">Twitter</a> <b href="https://www.linkedin.com/company/scrapfly/">LinkedIn</b> <c href="https://scrapfly.io/blog">blog</c>

XPath supports multiple selectors in a single query. The | operator is used to join multiple selectors in order as they appear.

Or Logic

<a href="https://twitter.com/@scrapfly_dev">Twitter</a> <a href="https://www.linkedin.com/company/scrapfly/">LinkedIn</a> <a href="https://scrapfly.io/blog">blog</a>

XPath supports logic for multiple matchers using the or keyword which can be used to join multiple match rules.

And Logic

<a href="https://twitter.com/@scrapfly_dev">Twitter</a> <a href="https://twitter.com/">Powered by Twitter</a>

XPath supports logic for multiple matchers using the and keyword which can be used to join multiple match rules. Alternatively, multiple predicates can be chained in [] as [contains(@href, "twitter")][contains(@href, "@")].

Attribute by Name

<a href="https://twitter.com/@scrapfly_dev">Twitter</a> <a href="https://www.linkedin.com/company/scrapfly/">LinkedIn</a> <a href="https://scrapfly.io/blog">blog</a>

Using @attribute syntax any HTML attribute can be selected.

Element Text

<a href="https://twitter.com/@scrapfly_dev">Twitter</a> <a href="https://www.linkedin.com/company/scrapfly/">LinkedIn</a> <a href="https://scrapfly.io/blog">blog</a>

The element's text value can be selected using text() function.
Most commonly //text() is used to select all text values under a node and this is the recommended selector in web scraping as it's more robust to HTML changes:

<div> <a> <span>Price</span> <div>14.88</div> USD </a> </div>

Element Predicate

<a class="social">Mastodon</a> <a class="link">Blog</a> <a class="ad">Read my book</a>

All node selectors can have match predicates that add selection rules. The predicates can be nested and chained to narrow down the selection.

Element predicates support basic arithmetic and comparison operators like > and < for greater/less than, = for exact match, and != for not equal:

<a data-price=1>addon product 1</a> <a data-price=10>cheap product 1</a> <a data-price=200>expensive product</a> <a data-price=15>cheap product 2</a>

Get Element Name

<div> <a>ignore</a> <h1>heading 1</h1> <h2>heading 2</h2> <h3>heading 3</h3> <a>ignore</a> </div>

The name() function can be used to retrieve the element's node name to be used in predicate matchers. For example, above we select select all nodes that start with h.

Not Function

<div> <a>select</a> <a class="foo">ignore</a> <a>select</a> </div>

The not() function can reverse any value as a boolean. It's useful for excluding certain elements from the selection.

Number Function

<div> <div class="product" data-price="$22.00"> <span>Product A</span> </div> <div data-price="$88.99"> <span>Product B</span> </div> <div data-price="$15.00"> <span>Product C</span> </div> </div>

The number() function can cast any string numbers to a number. It's mostly used in combination of string processing functions like substring.

Contains Function

<a href="https://twitter.com/@scrapfly_dev">social: Twitter</a> <a href="https://x.com/@scrapfly_dev">social: X.com</a> <a href="https://scrapfly.io/blog">newsletter: scrapfly.io/blog</a>

The contains(A, B) function can check whether B is part of A. Note that contains() is case sensitive

Matches Function

<a>www.select.com</a> <a>www.select-also.com</a> <a>ignore.com</a> <a>www.ignore.net</a>

The matches() function is a powerful regular expression function that can check whether a string matches a pattern. It takes 3 arguments: the string, the pattern, and optional flags where flags can be:

i: Case-insensitive matching
s: Dot matches all (affects the dot . metacharacter to match newlines as well)
m: Multi-line matching (affects how ^ and $ behave)
x: Extended (ignores whitespace within the pattern)

Note that tools like scrapy, parsel and lxml while don't fully support XPath 2.0 have implemented this function as re:test.

Tokenize Function

<p>one two</p> <p>one two three four five six seven</p> <p>one</p>

The tokenize() function can split a string by a given pattern. It's useful for counting words or splitting complex string values.

Lower-Case Function

<a href="https://twitter.com/@scrapfly_dev">Social: Twitter</a> <a href="https://x.com/@scrapfly_dev">SOCIAL: X.com</a> <a href="https://scrapfly.io/blog">newsletter: scrapfly.io/blog</a>

The lower-case(A) function can turn any string value to lowercase. It's useful for case-insensitive matching in combination with contains.
Note lower-case is only available in XPath 2.0

Starts-With Function

<div> <p > Follow us on: <a href="https://twitter.com/@scrapfly_dev">social: Twitter</a> <a href="https://x.com/@scrapfly_dev">social: X.com</a> <a href="https://scrapfly.io/blog">newsletter: scrapfly.io/blog</a> </p> </div>

start-with(a, b) checks whether a starts with b. Like contains, it's a string matching function, but it's more specific to match only the beginning.

Ends-With Function

<div> <p > Follow us on: <a href="https://twitter.com/@scrapfly_dev">News on Twitter</a> <a href="https://twitter.com/@scrapfly_dev">Support on Twitter</a> <a href="https://twitter.com/@scrapfly_dev">News on X.com</a> </p> </div>

ends-with(a, b) checks whether a ends with b. Like contains, it's a string matching function, but it's more specific to match only the end.

Concat Function

<div> <a href="https://twitter.com/@scrapfly_dev">Twitter</a> <a href="https://www.linkedin.com/company/scrapfly/">LinkedIn</a> </div>

The concat(a,b,c...) is a utility function that can join multiple elements into a single string.

Substring Function

The substring() function can slice a string by given indexes. It takes 3 arguments: the string, slice start and slice length.

Substring-Before Function

<div> <a href="https://twitter.com/@scrapfly_dev">News on Twitter</a> <a href="https://twitter.com/@scrapfly_dev">Support on Twitter</a> <a href="https://x.com/@scrapfly_dev">Industry Insights on X.com</a> </div>

The substring-before() function can split a string into multiple elements. It takes 2 arguments: the string and the separator. It returns the part of the string before the separator.

Substring-After Function

<div> <a href="https://twitter.com/@scrapfly_dev">News on Twitter</a> <a href="https://twitter.com/@scrapfly_dev">Support on Linkedin</a> <a href="https://x.com/@scrapfly_dev">Industry Insights on X.com</a> </div>

The substring-after() function can split a string into multiple elements. It takes 2 arguments: the string and the separator. It returns the part of the string after the separator.

Normalize-Space

<div> <a class=" foo bar ">select</a> </div>

The normalize-space() function can normalize space characters in a string. It's vital when matching values with spaces as HTML found online can be inconsistent with space use.

Count Function

<div> <a href="https://twitter.com/@scrapfly_dev">social: Twitter</a> <a href="https://x.com/@scrapfly_dev">social: X.com</a> <a href="https://scrapfly.io/blog">newsletter: scrapfly.io/blog</a> </div>

The count() function can return the count of matched elements. It's useful for checking whether a selector matches any elements.

Position Function

The position() function can return the node's position within surrounding siblings. Note that [position()=1] is equivalent to [1] however position() can be used in more complex matching within predicates.

Last Function

The last() function can return the context size. [last()] will select the last node.

String-Length Function

<article> <p > Important paragraph with a lot of data </p> <p > ad </p> <p > Another important paragraph with a lot of data </p> </article>

The string-length(A) returns the length of a string. It's useful for checking whether a string is empty or not.

Space Separated Class Check

<div> <a class="foo">select</a> <a class="afoo">ignore</a> <a class="bar foo">select</a> <a class="bar foo gaz">select</a> <a class="bar afoo gaz">ignore</a> </div>

Using contains(@class, "foo") is not ideal way to match for class presence as it will match any class that contains foo like xfooy.

The complex selector above is equivalent to CSS selectors .class syntax which checks whether the whole space-separated class value is present.

Node Between Two Nodes

<div> <h2>Section 1</h2> <p>ignore</p> <h2>Section 2</h2> <p>select</p> <p>select 2</p> <h2>Section 3</h2> <p>ignore</p> </div>

The preceding and following selectors can be used to select nodes between two known elements. Alternatively, preceding-sibling and following-sibling can be used as well as a more strict selector.

There are many other creative ways to select element between two known elements.
For example, using checking first preceding-sibling explicitly:

<div> <h2>Section 1</h2> <p>ignore</p> <h2>Section 2</h2> <p>select</p> <p>select 2</p> <h2>Section 3</h2> <p>ignore</p> </div>

Or using count() to count preceding siblings if the text value is not reliable:

<div> <h2>Section 1</h2> <p>ignore</p> <h2>Section 2</h2> <p>select</p> <p>select 2</p> <h2>Section 3</h2> <p>ignore</p> </div>

All Page Links

<div> <a href="https://web-scraping.dev/product/1">product 1</a> <p> <a href="https://web-scraping.dev/product/2">product in paragraph</a> </p> </div>

To extract all links on the page the recursive //a selector with @href attribute selector.

Note that extracted links are often relative (like /product/1) and need to be resolved to absolute URLs
using URL join tools like Python urllib.urljoin() function.

All Image Links

<div> <h2>Images</h2> <img src="https://web-scraping.dev/assets/products/orange-chocolate-box-small-1.webp"/> <img src="https://web-scraping.dev/assets/products/orange-chocolate-box-small-2.webp"/> <p> One more: <img src="https://web-scraping.dev/assets/products/orange-chocolate-box-small-3.webp"/> </p> </div>

To extract all images on the page the recursive //img selector can be used with @src attribute selector.

Note that extracted links are often relative (like /product/image-1.webp) and need to be resolved to absolute URLs
using URL joining tools like Python urllib.urljoin() function.

All Text

<article> <h2>Should you buy Product</h2> <p> This is a paragraph about <a>product 1</a> </p> <ul> <li>feature 1</li> <li>feature 2</li> <li><b>bonus</b> feature 3</li> </ul> </article>

The recursive //text() method will select all text values anywhere under the current node.
Note that this often returns empty values as well so the output needs to be cleaned up manually.

All Direct Children Text

<article> Product features <a>feature 1</a>, <i>feature 2</i>, <b>feature 3</b>. <div> <p> avoids descendants </p> </div> </article>

Selecting only the text of direct children can be done by chaining the child selector in combination with * wildcard for node name.
Unlike /text() this will not select the current nodes text itself but only the text of children.

Ultimate XPath Cheatsheet for HTML Parsing in Web Scraping

Explore this Article with AI

Parsing HTML with Xpath

Cheatsheet

by Element Name

Element Name Wildcard

Element Self

Nth Element

Direct Child Element

Any Descendant Element

Explicit Relativity

Element Parent

Element Ancestors

Preceding Nodes

Following Nodes

Preceding Siblings

Following Siblings

Union Logic

Or Logic

And Logic

Attribute by Name

Element Text

Element Predicate

Get Element Name

Not Function

Number Function

Contains Function

Matches Function

Tokenize Function

Lower-Case Function

Starts-With Function

Ends-With Function

Concat Function

Substring Function

Substring-Before Function

Substring-After Function

Normalize-Space

Count Function

Position Function

Last Function

String-Length Function

Space Separated Class Check

Node Between Two Nodes

All Page Links

All Image Links

All Text

All Direct Children Text

Explore this Article with AI

Related Knowledgebase

How to select last element in XPath?

How to select all elements between two elements in XPath?

What are devtools and how they're used in web scraping?

How to use XPath selectors in NodeJS when web scraping?

How to use XPath selectors in Python?

How to find elements by XPath in Puppeteer?

What are some BeautifulSoup alternatives in Python?

Can I used XPath selectors in BeautifulSoup?

How to find HTML elements by class?

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

How to select elements by attribute value in XPath?

How to select elements of a specific position in XPath?

Related Articles

How to Parse XML

Web Scraping With Ruby

Parsing HTML with Xpath

Web Scraping With PHP 101

Ultimate Guide to JSON Parsing in Python