XPath is the most powerful HTML parsing engine used in HTML parsing when web scraping. It's available in Scrapy and popular HTML parsing libraries like parsel.
XPath comes in version 1.0, 2.0 and 3.0. Most commonly XPath 1.0 and 2.0 are supported in web scraping toolsets and most functionality can be found in 1.0. In this cheatsheet we'll take a look at XPath 1.0 and 2.0 features that are relevant to HTML parsing.
This XPath cheatsheet contains all XPath features used in HTML parsing.
Clicking on the explanation text will take you to a real-life interactive example with more details. Note that XPath selectors can differ in different implementations, so unique non-standard features are marked as such.
XPath supports element indexing. a[1] will select the first a element, a[2] will select the second, and so on. Note that XPath indexing starts at 1, not 0.
The / selector selects direct children of the current node. a will select all a elements, while div/a will select only a elements that are direct children of a div node.
The // selects any descendant under current context. This is one of the most useful selectors in XPath, as it allows you to select any element recursively.
Explicit Relativity
./ and .// are explicit versions of / and //. They are useful when you want to be explicit about the relativity of the selector. This is mostly relevant for engines that allow working with multiple selectors like parsel :
from parsel import Selector
sel = Selector()
product = sel.xpath('//div[@class="product"]')
product_features = product.xpath('.//div[@class="features"]')
# ^^^
# without . it would select all features in the document
The ancestor:: selector is similar to .. but selects any ancestor no matter the depth level. Note the axes selector :: can be used with * wildcard to select any ancestor.
The preceding-sibling selector selects all nodes that appear before the current node and share the same parent. This is useful for selecting all sibling nodes above the current node.
The following-sibling selector selects all nodes that appear after the current node and share the same parent. This is useful for selecting all sibling nodes below the current node.
XPath supports logic for multiple matchers using the or keyword which can be used to join multiple match rules.
And Logic
<a href="https://twitter.com/@scrapfly_dev">Twitter</a>
<a href="https://twitter.com/">Powered by Twitter</a>
XPath supports logic for multiple matchers using the and keyword which can be used to join multiple match rules. Alternatively, multiple predicates can be chained in [] as [contains(@href, "twitter")][contains(@href, "@")].
The element's text value can be selected using text() function.
Most commonly //text() is used to select all text values under a node and this is the recommended selector in web scraping as it's more robust to HTML changes:
The name() function can be used to retrieve the element's node name to be used in predicate matchers. For example, above we select select all nodes that start with h.
The matches() function is a powerful regular expression function that can check whether a string matches a pattern. It takes 3 arguments: the string, the pattern, and optional flags where flags can be:
i: Case-insensitive matching
s: Dot matches all (affects the dot . metacharacter to match newlines as well)
m: Multi-line matching (affects how ^ and $ behave)
x: Extended (ignores whitespace within the pattern)
Note that tools like scrapy, parsel and lxml while don't fully support XPath 2.0 have implemented this function as re:test.
Tokenize Function
<!-- select paragraphs with more than 3 words -->
<p>one two</p>
<p>one two three four five six seven</p>
<p>one</p>
The tokenize() function can split a string by a given pattern. It's useful for counting words or splitting complex string values.
The lower-case(A) function can turn any string value to lowercase. It's useful for case-insensitive matching in combination with contains.
Note lower-case is only available in XPath 2.0
Starts-With Function
<div>
<p >
Follow us on:
<a href="https://twitter.com/@scrapfly_dev">social: Twitter</a>
<a href="https://x.com/@scrapfly_dev">social: X.com</a>
<a href="https://scrapfly.io/blog">newsletter: scrapfly.io/blog</a>
</p>
</div>
start-with(a, b) checks whether a starts with b. Like contains, it's a string matching function, but it's more specific to match only the beginning.
Ends-With Function
<div>
<p >
Follow us on:
<a href="https://twitter.com/@scrapfly_dev">News on Twitter</a>
<a href="https://twitter.com/@scrapfly_dev">Support on Twitter</a>
<a href="https://twitter.com/@scrapfly_dev">News on X.com</a>
</p>
</div>
ends-with(a, b) checks whether a ends with b. Like contains, it's a string matching function, but it's more specific to match only the end.
The concat(a,b,c...) is a utility function that can join multiple elements into a single string.
Substring Function
<div>
<a>+99 12345678 Call Us</a>
<a>+87 87654321 Text Us</a>
</div>
The substring() function can slice a string by given indexes. It takes 3 arguments: the string, slice start and slice length.
Substring-Before Function
<div>
<a href="https://twitter.com/@scrapfly_dev">News on Twitter</a>
<a href="https://twitter.com/@scrapfly_dev">Support on Twitter</a>
<a href="https://x.com/@scrapfly_dev">Industry Insights on X.com</a>
</div>
The substring-before() function can split a string into multiple elements. It takes 2 arguments: the string and the separator. It returns the part of the string before the separator.
Substring-After Function
<div>
<a href="https://twitter.com/@scrapfly_dev">News on Twitter</a>
<a href="https://twitter.com/@scrapfly_dev">Support on Linkedin</a>
<a href="https://x.com/@scrapfly_dev">Industry Insights on X.com</a>
</div>
The substring-after() function can split a string into multiple elements. It takes 2 arguments: the string and the separator. It returns the part of the string after the separator.
Normalize-Space
<div>
<a class=" foo bar ">select</a>
</div>
The normalize-space() function can normalize space characters in a string. It's vital when matching values with spaces as HTML found online can be inconsistent with space use.
The position() function can return the node's position within surrounding siblings. Note that [position()=1] is equivalent to [1] however position() can be used in more complex matching within predicates.
The preceding and following selectors can be used to select nodes between two known elements. Alternatively, preceding-sibling and following-sibling can be used as well as a more strict selector.
There are many other creative ways to select element between two known elements.
For example, using checking first preceding-sibling explicitly:
To extract all links on the page the recursive //a selector with @href attribute selector.
Note that extracted links are often relative (like /product/1) and need to be resolved to absolute URLs
using URL join tools like Python urllib.urljoin() function.
To extract all images on the page the recursive //img selector can be used with @src attribute selector.
Note that extracted links are often relative (like /product/image-1.webp) and need to be resolved to absolute URLs
using URL joining tools like Python urllib.urljoin() function.
All Text
<article>
<h2>Should you buy Product</h2>
<p>
This is a paragraph about <a>product 1</a>
</p>
<ul>
<li>feature 1</li>
<li>feature 2</li>
<li><b>bonus</b> feature 3</li>
</ul>
</article>
The recursive //text() method will select all text values anywhere under the current node.
Note that this often returns empty values as well so the output needs to be cleaned up manually.
Selecting only the text of direct children can be done by chaining the child selector in combination with * wildcard for node name.
Unlike /text() this will not select the current nodes text itself but only the text of children.