XPath is the most powerful HTML parsing engine used in HTML parsing when web scraping. It's available in Scrapy and popular HTML parsing libraries like parsel.
XPath comes in version 1.0, 2.0 and 3.0. Most commonly XPath 1.0 and 2.0 are supported in web scraping toolsets and most functionality can be found in 1.0. In this cheatsheet we'll take a look at XPath 1.0 and 2.0 features that are relevant to HTML parsing.
Parsing HTML with Xpath
Introduction to xpath in the context of web-scraping. How to extract data from HTML documents using xpath, best practices and available tools.
This XPath cheatsheet contains all XPath features used in HTML parsing.
Clicking on the explanation text will take you to a real-life interactive example with more details. Note that XPath selectors can differ in different implementations, so unique non-standard features are marked as such.
Cheatsheet
by Element Name
The most simple selector is the element name. It selects all nodes with the given name.
Element Name Wildcard
When element name is *
, it matches any node. This can be further refined using self()
and attribute matching.
Element Self
self
or .
syntax can be used to access the current context. This is useful for advanced matching element in []
predicates.
Nth Element
XPath supports element indexing. a[1]
will select the first a
element, a[2]
will select the second, and so on. Note that XPath indexing starts at 1, not 0.
Direct Child Element
The /
selector selects direct children of the current node. a
will select all a
elements, while div/a
will select only a
elements that are direct children of a div
node.
Any Descendant Element
The //
selects any descendant under current context. This is one of the most useful selectors in XPath, as it allows you to select any element recursively.
Explicit Relativity
./
and .//
are explicit versions of /
and //
. They are useful when you want to be explicit about the relativity of the selector. This is mostly relevant for engines that allow working with multiple selectors like parsel
:
from parsel import Selector
sel = Selector()
product = sel.xpath('//div[@class="product"]')
product_features = product.xpath('.//div[@class="features"]')
# ^^^
# without . it would select all features in the document
Element Parent
The ..
or parent
selector can be used to navigate up the HTML node tree or match by parent values.
Element Ancestors
The ancestor::
selector is similar to ..
but selects any ancestor no matter the depth level. Note the axes selector ::
can be used with *
wildcard to select any ancestor.
Preceding Nodes
The preceding
selector selects all nodes that appear before the current node. This is useful for selecting all nodes above the current node.
Following Nodes
The following
selector selects all nodes that appear after the current node. This is useful for selecting all nodes below the current node.
Preceding Siblings
The preceding-sibling
selector selects all nodes that appear before the current node and share the same parent. This is useful for selecting all sibling nodes above the current node.
Following Siblings
The following-sibling
selector selects all nodes that appear after the current node and share the same parent. This is useful for selecting all sibling nodes below the current node.
Union Logic
XPath supports multiple selectors in a single query. The |
operator is used to join multiple selectors in order as they appear.
Or Logic
XPath supports logic for multiple matchers using the or
keyword which can be used to join multiple match rules.
And Logic
XPath supports logic for multiple matchers using the and
keyword which can be used to join multiple match rules. Alternatively, multiple predicates can be chained in []
as [contains(@href, "twitter")][contains(@href, "@")]
.
Attribute by Name
Using @attribute
syntax any HTML attribute can be selected.
Element Text
The element's text value can be selected using text()
function.
Most commonly //text()
is used to select all text values under a node and this is the recommended selector in web scraping as it's more robust to HTML changes:
Element Predicate
All node selectors can have match predicates that add selection rules. The predicates can be nested and chained to narrow down the selection.
Element predicates support basic arithmetic and comparison operators like >
and <
for greater/less than, =
for exact match, and !=
for not equal:
Get Element Name
The name()
function can be used to retrieve the element's node name to be used in predicate matchers. For example, above we select select all nodes that start with h
.
Not Function
The not()
function can reverse any value as a boolean. It's useful for excluding certain elements from the selection.
Number Function
The number()
function can cast any string numbers to a number. It's mostly used in combination of string processing functions like substring
.
Contains Function
The contains(A, B)
function can check whether B is part of A. Note that contains()
is case sensitive
Matches Function
The matches()
function is a powerful regular expression function that can check whether a string matches a pattern. It takes 3 arguments: the string, the pattern, and optional flags where flags can be:
- i: Case-insensitive matching
- s: Dot matches all (affects the dot . metacharacter to match newlines as well)
- m: Multi-line matching (affects how ^ and $ behave)
- x: Extended (ignores whitespace within the pattern)
Note that tools like scrapy
, parsel
and lxml
while don't fully support XPath 2.0 have implemented this function as re:test
.
Tokenize Function
The tokenize()
function can split a string by a given pattern. It's useful for counting words or splitting complex string values.
Lower-Case Function
The lower-case(A)
function can turn any string value to lowercase. It's useful for case-insensitive matching in combination with contains.
Note lower-case
is only available in XPath 2.0
Starts-With Function
start-with(a, b)
checks whether a
starts with b
. Like contains
, it's a string matching function, but it's more specific to match only the beginning.
Ends-With Function
ends-with(a, b)
checks whether a
ends with b
. Like contains
, it's a string matching function, but it's more specific to match only the end.
Concat Function
The concat(a,b,c...)
is a utility function that can join multiple elements into a single string.
Substring Function
The substring()
function can slice a string by given indexes. It takes 3 arguments: the string, slice start and slice length.
Substring-Before Function
The substring-before()
function can split a string into multiple elements. It takes 2 arguments: the string and the separator. It returns the part of the string before the separator.
Substring-After Function
The substring-after()
function can split a string into multiple elements. It takes 2 arguments: the string and the separator. It returns the part of the string after the separator.
Normalize-Space
The normalize-space()
function can normalize space characters in a string. It's vital when matching values with spaces as HTML found online can be inconsistent with space use.
Count Function
The count()
function can return the count of matched elements. It's useful for checking whether a selector matches any elements.
Position Function
The position()
function can return the node's position within surrounding siblings. Note that [position()=1]
is equivalent to [1]
however position()
can be used in more complex matching within predicates.
Last Function
The last()
function can return the context size. [last()]
will select the last node.
String-Length Function
The string-length(A)
returns the length of a string. It's useful for checking whether a string is empty or not.
Space Separated Class Check
Using contains(@class, "foo")
is not ideal way to match for class presence as it will match any class that contains foo
like xfooy
.
The complex selector above is equivalent to CSS selectors .class
syntax which checks whether the whole space-separated class value is present.
Node Between Two Nodes
The preceding
and following
selectors can be used to select nodes between two known elements. Alternatively, preceding-sibling
and following-sibling
can be used as well as a more strict selector.
There are many other creative ways to select element between two known elements.
For example, using checking first preceding-sibling explicitly:
Or using count()
to count preceding siblings if the text value is not reliable:
All Page Links
To extract all links on the page the recursive //a
selector with @href
attribute selector.
Note that extracted links are often relative (like /product/1
) and need to be resolved to absolute URLs
using URL join tools like Python urllib.urljoin()
function.
All Image Links
To extract all images on the page the recursive //img
selector can be used with @src
attribute selector.
Note that extracted links are often relative (like /product/image-1.webp
) and need to be resolved to absolute URLs
using URL joining tools like Python urllib.urljoin()
function.
All Text
The recursive //text()
method will select all text values anywhere under the current node.
Note that this often returns empty values as well so the output needs to be cleaned up manually.
All Direct Children Text
Selecting only the text of direct children can be done by chaining the child
selector in combination with *
wildcard for node name.
Unlike /text()
this will not select the current nodes text itself but only the text of children.