CSS selectors is a powerful HTML querying protocol which is used by browsers to determine what HTML elements to style.
It's also incredibly useful in HTML parsing when web scraping or processing HTML data, as the same queries can be used to select values as well.
In web scraping, CSS selectors are an easy and powerful way to parse HTML data and are used in many web scraping libraries. This article is a carefully curated CSS Selector cheatsheet for web scraping, though it can apply to any HTML parsing tasks.
Parsing HTML with CSS Selectors
Introduction to using CSS selectors to parse web-scraped content. Best practices, available tools and common challenges by interactive examples.
This CSS selector cheatsheet contains all selector features used in HTML parsing.
Clicking on the explanation text will take you to a real-life interactive example with more details. Note that CSS selectors can differ in different implementations, so unique non-standard features are marked as such.
Cheatsheet
* limited availability
What CSS selectors cannot do:
- Select preceding siblings.
- Select parent or ancestor elements.
- Select array slices.
- Select by text value.
- Select by element count.
- Select by element depth.
These features are, however available in XPath selector engine.
Direct Child
The >
direct child selector selects only direct children of the parent element. Here, the a
element is selected as it's a direct child of p
and div
. Note that this selector can be dangerous as HTML tree depth can change easily breaking the selector. For example, if the a
element is wrapped in span
the selector will break.
Any Descendant
Space selects any descended no matter how many layers deep. Here, the a
element is selected as it's a descendant of div
.
Any Following Sibling
The ~
selects any following general sibling no matter how many layers deep. Here, the p
elements are selected as they are following siblings of .ad
.
Direct Following Sibling
The +
selects one following adjacent sibling (i.e. has to be right below it). Here, the first p
element is selected as it's a direct following sibling of .ad
.
Joining Selectors
Selectors can be joined with ,
to select multiple elements. Here, the p
, span
and a
elements are selected. Note that the result order usually follows the structure of the HTML tree.
by Class
The .
selector can be used to restrict the selection to elements that contain the class value in the class
attribute. Here, the div
elements with product
in the class
attribute are selected.
by ID
by Attribute
Square brackets ([]
) can be used to match elements by attribute values. For example, [href]
matches any element that has href
attribute (even if it's empty).
by Attribute Value
Attributes can be matched exactly using attrib=value
syntax. Note that this is case-sensitive.
by Case Insensitive Attribute Value
Any attribute matcher can be made case-insensitive by adding i
suffix. Here, the span
and div
elements are selected as they match the data-item
attribute value case-insensitively.
by Partial Attribute Value
The *=
will match when attribute contains the supplied value anywhere in the value string.
by Attribute Value Ignoring Minus Suffix
The |=
selector is unique and matches only when value matches exactly or has a trailing -suffix
.
by Attribute Value Starting With
The ^=
selector matches when attribute value starts with the supplied value exactly.
by Attribute Value Ending With
The $=
selector matches when attribute value ends with the supplied value exactly.
by Attribute Containing Word
The ~=
selector matches when attribute value contains the supplied value as a word. A word is defined as a string of characters delimited by spaces.
Reversing Matchers Using Not
The :not()
pseudo selector follows node selector and will reverse any matcher like .class
, #id
or attribute matchers like [attribute=ignore]
.
First Child
The :first-child
pseudo selector will select only the elements that are first children in their group of all siblings. In other words, first element in the group.
Last Child
The :last-child
pseudo selector will select only the elements that are last children in their group of all siblings. In other words, last element in the group.
Nth Child
The :nth-child
pseudo selector will select only the elements that are Nth children in their group of all siblings. In other words, Nth element in the group. It also supports special values like even
and odd
- try them!
Nth Last Child
The :nth-last-child
pseudo selector is just :nth-child
selector but reversed. In the xample above we're selecting 2nd to last element in the group.
First Of Type
The :first-of-type
pseudo selector will select the last element of given type. It's similar to :first-child
but instead of considering all siblings, it considers only siblings of the same node type.
Last Of Type
The :last-of-type
pseudo selector will select the last element of given type. It's similar to :last-child
but instead of considering all siblings, it considers only siblings of the same node type.
Nth Of Type
The :nth-of-type
pseudo selector will select elements of given type that are Nth element in their group. It's similar to :first-of-type
and :last-of-type
just more flexible as index can be specified. It also supports special values like even
and odd
- try them!
Only of Type
The :only-of-type
pseudo selector will select elements of given type that are the only element of said type in their group.
Has Descendant
The :has()
pseudo selector is a way of selecting a parent element based on the existence of a certain child. Here, the div
elements that have a child with product
class are selected. Note that using any descendant selector (space) can cause a lot of duplicate results so using the direct child selector (>
) is recommended. Try removing the `>`` to see the difference.
Is Matcher
The :is()
pseudo selector is a way of selecting elements that match any of the supplied selectors. Here, the div
and span
elements are selected as they match the :is()
selector. This pseudo selector can be very powerful when combined with :not
- try to exclude .foo
from the selection.
Getting Attribute Value
The ::attr()
is a non-standard pseudo selector used in tools like scrapy, parsel, and Scrapfly SDK to select element attribute exclusively.
Getting Element Text
The ::text
is a non-standard pseudo selector used in tools like scrapy, parsel and Scrapfly SDK to select element text directly.