CSS selectors is a powerful HTML querying protocol which is used by browsers to determine what HTML elements to style.
It's also incredibly useful in HTML parsing when web scraping or processing HTML data, as the same queries can be used to select values as well.
In web scraping, CSS selectors are an easy and powerful way to parse HTML data and are used in many web scraping libraries. This article is a carefully curated CSS Selector cheatsheet for web scraping, though it can apply to any HTML parsing tasks.
This CSS selector cheatsheet contains all selector features used in HTML parsing.
Clicking on the explanation text will take you to a real-life interactive example with more details. Note that CSS selectors can differ in different implementations, so unique non-standard features are marked as such.
These features are, however available in XPath selector engine.
Direct Child
<div>
<p >
Follow us on
<a href="https://x.com/@scrapfly_dev">X!</a>
<skip>ignore</skip>
</p>
</div>
The > direct child selector selects only direct children of the parent element. Here, the a element is selected as it's a direct child of p and div. Note that this selector can be dangerous as HTML tree depth can change easily breaking the selector. For example, if the a element is wrapped in span the selector will break.
Any Descendant
<div>
<p >
Follow us on
<a href="https://x.com/@scrapfly_dev">X!</a>
<skip>ignore</skip>
</p>
</div>
Space selects any descended no matter how many layers deep. Here, the a element is selected as it's a descendant of div.
The + selects one following adjacent sibling (i.e. has to be right below it). Here, the first p element is selected as it's a direct following sibling of .ad.
Selectors can be joined with , to select multiple elements. Here, the p, span and a elements are selected. Note that the result order usually follows the structure of the HTML tree.
The . selector can be used to restrict the selection to elements that contain the class value in the class attribute. Here, the div elements with product in the class attribute are selected.
Square brackets ([]) can be used to match elements by attribute values. For example, [href] matches any element that has href attribute (even if it's empty).
Any attribute matcher can be made case-insensitive by adding i suffix. Here, the span and div elements are selected as they match the data-item attribute value case-insensitively.
The :first-child pseudo selector will select only the elements that are first children in their group of all siblings. In other words, first element in the group.
The :last-child pseudo selector will select only the elements that are last children in their group of all siblings. In other words, last element in the group.
The :nth-child pseudo selector will select only the elements that are Nth children in their group of all siblings. In other words, Nth element in the group. It also supports special values like even and odd - try them!
The :first-of-type pseudo selector will select the last element of given type. It's similar to :first-child but instead of considering all siblings, it considers only siblings of the same node type.
The :last-of-type pseudo selector will select the last element of given type. It's similar to :last-child but instead of considering all siblings, it considers only siblings of the same node type.
The :nth-of-type pseudo selector will select elements of given type that are Nth element in their group. It's similar to :first-of-type and :last-of-type just more flexible as index can be specified. It also supports special values like even and odd - try them!
The :has() pseudo selector is a way of selecting a parent element based on the existence of a certain child. Here, the div elements that have a child with product class are selected. Note that using any descendant selector (space) can cause a lot of duplicate results so using the direct child selector (>) is recommended. Try removing the `>`` to see the difference.
The :is() pseudo selector is a way of selecting elements that match any of the supplied selectors. Here, the div and span elements are selected as they match the :is() selector. This pseudo selector can be very powerful when combined with :not - try to exclude .foo from the selection.
In this article we'll take a look at scraping using Javascript through NodeJS. We'll cover common web scraping libraries, frequently encountered challenges and wrap everything up by scraping etsy.com