Ultimate CSS Selector Cheatsheet for HTML Parsing

by Bernardas Ališauskas May 10, 2024

#css-selectors #data-parsing

Ultimate CSS Selector Cheatsheet for HTML Parsing

CSS selectors is a powerful HTML querying protocol which is used by browsers to determine what HTML elements to style.
It's also incredibly useful in HTML parsing when web scraping or processing HTML data, as the same queries can be used to select values as well.

In web scraping, CSS selectors are an easy and powerful way to parse HTML data and are used in many web scraping libraries. This article is a carefully curated CSS Selector cheatsheet for web scraping, though it can apply to any HTML parsing tasks.

Parsing HTML with CSS Selectors

Introduction to using CSS selectors to parse web-scraped content. Best practices, available tools and common challenges by interactive examples.

🗋⮭

This CSS selector cheatsheet contains all selector features used in HTML parsing.
Clicking on the explanation text will take you to a real-life interactive example with more details. Note that CSS selectors can differ in different implementations, so unique non-standard features are marked as such.

Cheatsheet

Selector	Explanation
	Navigation
>	selects direct child
(space)	selects any descendant
~	selects following sibling
+	selects direct following sibling
,	separator for joined selectors
	Attribute Matching
.	selects by class
#	selects by id
[]	attribute selector
[attr]	select elements that have attribute present (even if it's empty)
[attr=value]	match exact attribute value
[attr=value i]	`i` suffix turns any attribute match case insensitive*
[attr*=value]	match containing attribute value
[attr\|=value]	match exact ignoring "-suffixes" value
[attr^=value]	match attributes that start with value
[attr$=value]	match attributes that end with value
[attr~=value]	match attributes that contain a word
	Element Matching
:not()	reverses selection
:has()	select if element has a matching descendant
:is()	apply multiple selectors
:first-child	select if it's the first element in the group
:last-child	select if it's the last element in the group
:nth-child()	select if it's the Nth element, supports `even`, `odd`
:nth-last-child()	like `nth-child` but reversed
:first-of-type	select if it's the first element of that type in the group.
:last-of-type	select if it's the last element of that type in the group.
:nth-of-type()	select if it's the Nth element of that type in the group.
:only-of-type()	select if it's the only element of that type in the group.
	Non-standard Functions
::attr(name)	select attribute value. Available in scrapy, parsel, Scrapfly SDK
:text	select text value. Available in scrapy, parsel, Scrapfly SDK

* limited availability

What CSS selectors cannot do:

Select preceding siblings.
Select parent or ancestor elements.
Select array slices.
Select by text value.
Select by element count.
Select by element depth.

These features are, however available in XPath selector engine.

Direct Child

<div> <p > Follow us on <a href="https://x.com/@scrapfly_dev">X!</a> <skip>ignore</skip> </p> </div>

The > direct child selector selects only direct children of the parent element. Here, the a element is selected as it's a direct child of p and div. Note that this selector can be dangerous as HTML tree depth can change easily breaking the selector. For example, if the a element is wrapped in span the selector will break.

Any Descendant

<div> <p > Follow us on <a href="https://x.com/@scrapfly_dev">X!</a> <skip>ignore</skip> </p> </div>

Space selects any descended no matter how many layers deep. Here, the a element is selected as it's a descendant of div.

Any Following Sibling

<article> <p>ignore</p> <p class="ad">ignore</p> <p>select</p> <p>select</p> </article>

The ~ selects any following general sibling no matter how many layers deep. Here, the p elements are selected as they are following siblings of .ad.

Direct Following Sibling

<article> <p>ignore</p> <p class="ad">ignore</p> <p>select</p> <p>ignore</p> </article>

The + selects one following adjacent sibling (i.e. has to be right below it). Here, the first p element is selected as it's a direct following sibling of .ad.

Joining Selectors

<div> <article> <p>select paragraph</p> <div> <div>ignore</div> <p>select nested paragraph</p> </div> <span>select span</span> <a>select link</a> <div>ignore</div> </article> </div>

Selectors can be joined with , to select multiple elements. Here, the p, span and a elements are selected. Note that the result order usually follows the structure of the HTML tree.

by Class

<div> <div class="product">select</div> <div class="sold product">select</div> <div class="sold product new">select</div> <div class="product-2">ignore</div> </div>

The . selector can be used to restrict the selection to elements that contain the class value in the class attribute. Here, the div elements with product in the class attribute are selected.

by ID

<div> <div id="product">select</div> <div id="sold product">select</div> <div id="sold product new">select</div> <div id="product-2">ignore</div> </div>

by Attribute

<div> <a href="#">enabled link</a> <a>disabled link</a> <a href="">enabled link</a> </div>

Square brackets ([]) can be used to match elements by attribute values. For example, [href] matches any element that has href attribute (even if it's empty).

by Attribute Value

<div> <span data-item="product">select</span> <div data-item="product">select</div> <span data-item="product-new">ignore</span> </div>

Attributes can be matched exactly using attrib=value syntax. Note that this is case-sensitive.

by Case Insensitive Attribute Value

<div> <span data-item="PRODUCT">select</span> <div data-item="Product">select</div> <div data-item="product">select</div> <span data-item="product-new">ignore</span> </div>

Any attribute matcher can be made case-insensitive by adding i suffix. Here, the span and div elements are selected as they match the data-item attribute value case-insensitively.

by Partial Attribute Value

<div> <a href="social-link.com">select</a> <a href="social-link2.com">select</a> <a href="ignore">ignore</a> </div>

The *= will match when attribute contains the supplied value anywhere in the value string.

by Attribute Value Ignoring Minus Suffix

<div> <a class="important-link">select</a> <a class="important-url">select</a> <a class="important">select</a> <a class="foo important-item">doesn't begin exactly</a> <a class="important item">contains more than just match</a> <a class="importantitem">doesn't match</a> </div>

The |= selector is unique and matches only when value matches exactly or has a trailing -suffix.

by Attribute Value Starting With

<div> <a class="dataname">select</a> <a class="data-age">select</a> <a class="data extra">select</a> <a class="foo data">ignore</a> </div>

The ^= selector matches when attribute value starts with the supplied value exactly.

by Attribute Value Ending With

<div> <a class="name-data">select</a> <a class="age data">select</a> <a class="data">select</a> <a class="data foo">ignore</a> </div>

The $= selector matches when attribute value ends with the supplied value exactly.

by Attribute Containing Word

<div> <a class="data">select</a> <a class="foo data">select</a> <a class="foo data bar">select</a> <a class="datafoo">ignore</a> <a class="data-bar">ignore</a> </div>

The ~= selector matches when attribute value contains the supplied value as a word. A word is defined as a string of characters delimited by spaces.

Reversing Matchers Using Not

<div> <a class="foo">select</a> <a class="ignore">ignore</a> <a class="bar">select</a> <a class="data">select</a> <a class="ignore">ignore</a> </div>

The :not() pseudo selector follows node selector and will reverse any matcher like .class, #id or attribute matchers like [attribute=ignore].

First Child

<div> <div class="products"> <a>select</a> <a>ignore</a> </div> <div class="products"> <a>select</a> <a>ignore</a> </div> <a>ignore</a> </div>

The :first-child pseudo selector will select only the elements that are first children in their group of all siblings. In other words, first element in the group.

Last Child

<div> <div class="products"> <a>ignore</a> <a>select</a> </div> <div class="products"> <a>ignore</a> <a>select</a> </div> <a>ignore</a> </div>

The :last-child pseudo selector will select only the elements that are last children in their group of all siblings. In other words, last element in the group.

Nth Child

<div> <div class="products"> <a>ignore</a> <a>select</a> <a>ignore</a> </div> <div class="products"> <div>ignore</div> <a>select</a> <a>ignore</a> </div> <a>ignore</a> </div>

The :nth-child pseudo selector will select only the elements that are Nth children in their group of all siblings. In other words, Nth element in the group. It also supports special values like even and odd - try them!

Nth Last Child

<div> <div class="products"> <a>ignore</a> <a>ignore</a> <a>select</a> <a>ignore</a> </div> <div class="products"> <div>ignore</div> <a>ignore</a> <a>select</a> <a>ignore</a> </div> <a>ignore</a> </div>

The :nth-last-child pseudo selector is just :nth-child selector but reversed. In the xample above we're selecting 2nd to last element in the group.

First Of Type

<div> <div class="products"> <a>select</a> <a>ignore</a> </div> <div class="products"> <div>ignore</div> <a>select</a> <a>ignore</a> </div> <a>ignore</a> </div>

The :first-of-type pseudo selector will select the last element of given type. It's similar to :first-child but instead of considering all siblings, it considers only siblings of the same node type.

Last Of Type

<div> <div class="products"> <a>ignore</a> <a>select</a> </div> <div class="products"> <div>ignore</div> <a>ignore</a> <a>select</a> </div> <a>ignore</a> </div>

The :last-of-type pseudo selector will select the last element of given type. It's similar to :last-child but instead of considering all siblings, it considers only siblings of the same node type.

Nth Of Type

<div> <div class="products"> <a>ignore</a> <a>select</a> <a>ignore</a> </div> <div class="products"> <div>ignore</div> <a>ignore</a> <a>select</a> <a>ignore</a> </div> <a>ignore</a> </div>

The :nth-of-type pseudo selector will select elements of given type that are Nth element in their group. It's similar to :first-of-type and :last-of-type just more flexible as index can be specified. It also supports special values like even and odd - try them!

Only of Type

<div> <div class="products"> <a>ignore</a> <a>ignore</a> <a>ignore</a> </div> <div class="products"> <span>ignore</span> <a>select</a> <span>ignore</span> </div> <a>ignore</a> </div>

The :only-of-type pseudo selector will select elements of given type that are the only element of said type in their group.

Has Descendant

<article> <div> <a class="product">select</a> <a>select</a> </div> <div> <div class="wrapper"> <a class="product">select</a> <a>select</a> </div> </div> <div> <a class="advertisement">ignore</a> <div>ignore</div> </div> </article>

The :has() pseudo selector is a way of selecting a parent element based on the existence of a certain child. Here, the div elements that have a child with product class are selected. Note that using any descendant selector (space) can cause a lot of duplicate results so using the direct child selector (>) is recommended. Try removing the `>`` to see the difference.

Is Matcher

<article> <div class="product">select</div> <span class="product foo">select</span> <p class="product">ignore</p> </article>

The :is() pseudo selector is a way of selecting elements that match any of the supplied selectors. Here, the div and span elements are selected as they match the :is() selector. This pseudo selector can be very powerful when combined with :not - try to exclude .foo from the selection.

Getting Attribute Value

<article> <a href="some url1">select</a> <a href="some url2">select</a> <span href="some url2">ignore</span> </article>

The ::attr() is a non-standard pseudo selector used in tools like scrapy, parsel, and Scrapfly SDK to select element attribute exclusively.

Getting Element Text

<article> <a href="some url1">select<div>select-nested</div></a> <a href="some url2">select</a> <span href="some url2">ignore</a> </article>

The ::text is a non-standard pseudo selector used in tools like scrapy, parsel and Scrapfly SDK to select element text directly.

Ultimate CSS Selector Cheatsheet for HTML Parsing

Explore this Article with AI

Parsing HTML with CSS Selectors

Cheatsheet

Direct Child

Any Descendant

Any Following Sibling

Direct Following Sibling

Joining Selectors

by Class

by ID

by Attribute

by Attribute Value

by Case Insensitive Attribute Value

by Partial Attribute Value

by Attribute Value Ignoring Minus Suffix

by Attribute Value Starting With

by Attribute Value Ending With

by Attribute Containing Word

Reversing Matchers Using Not

First Child

Last Child

Nth Child

Nth Last Child

First Of Type

Last Of Type

Nth Of Type

Only of Type

Has Descendant

Is Matcher

Getting Attribute Value

Getting Element Text

Explore this Article with AI

Related Knowledgebase

What are devtools and how they're used in web scraping?

How to select HTML elements by text using CSS Selectors?

How to use CSS selectors in NodeJS when web scraping?

How to find elements by CSS selector in Selenium

How to find sibling HTML nodes using BeautifulSoup and Python?

How to find HTML elements by multiple tags with BeautifulSoup?

How to find HTML elements by class?

How to find HTML elements by attribute using BeautifulSoup?

How to use CSS Selectors in Nim ?

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

How to select following siblings using CSS selectors?

How to select elements by class using CSS selectors?

Related Articles

How to Parse XML

Web Scraping With Ruby

Web Scraping With NodeJS and Javascript

Parsing HTML with CSS Selectors

Ultimate Guide to JSON Parsing in Python