Parsing HTML with CSS Selectors

by Bernardas Ališauskas Apr 11, 2024

#data-parsing #css-selectors

When it comes to parsing web scraped HTML content, there are multiple techniques to select the data we want. For simple text parsing - various text parsing techniques like regular expression can be used. However, HTML is designed to be a machine-readable text structure - we can take advantage of this fact and use special path languages like CSS selectors to extract data in a much more efficient and reliable way!

In this article, we'll be taking a deep look at this unique path language and how can we use it to extract needed details from modern, complex HTML documents.

What Are CSS Selectors?

If ever done some web development, you are probably familiar with CSS selectors for applying styles to HTML websites - we can use the same tool for parsing HTML data!
Cascading Style Sheets protocol offers a unique path language for selecting HTML nodes to apply style to - these are called CSS Selectors. While this path language is designed to find nodes for styling, we can also use it to find nodes for parsing and navigation in our web scrapers.

CSS selectors tend to be brief but powerful enough for most web-scraping related parsing. Let's take a quick look at common pros and cons, especially compared to Xpath selectors:

Parsing HTML with Xpath

Introduction to xpath in the context of web-scraping. How to extract data from HTML documents using xpath, best practices and available tools.

Pros:

Brief and simple to read - which in turn means easier to maintain.
Made for the web. Since CSS selectors are used to apply styles for content, that means our web scraper should be able to select elements just as our browser can. This, also means this path language has a huge community
Part of web standard - meaning it's built-in web browsers and many other web tools which makes CSS selectors very accessible!

Cons:

Only capable of selecting nodes, where's xpath can select node attributes too. However, many CSS selector clients expand functionality with their own extra syntax. For example, for selecting node attributes or inner text extra pseudo classes like ::attr, ::text or simple client methods like .text() or .attribute() are being added as extra functionality.
Not very extendible. Unlike XPATH, CSS selector clients tend to offer less opportunity for extensions.
Not as powerful as XPATH. We'll explore this more later but generally, CSS selectors might not be able to traverse HTML trees as freely as XPATH selectors.

HTML Overview

HTML (HyperText Markup Language) is designed to be easily machine-readable and parsable. In other words, HTML follows a tree-like structure of nodes and their attributes, which we can easily navigate programmatically.

Let's start off with a small example page and illustrate its structure:

<head>
  <title>
  Document Title
  </title>
</head>
<body>
  <h1>Introduction</h1>
  <div>
    <p>some description text: </p>
    <a class="link" href="https://example.com">example link</a>
  </div>
</body>

In this basic example of a simple web page, we can see that the document already resembles a data tree. Let's go a bit further and illustrate this:

html tree illustration — HTML tree is made of nodes which can contain attributes such as classes, IDs and text itself.

Here we can wrap our heads around it a bit more easily: it's a tree of nodes, and each node can also have properties attached to them like keyword attributes (like class and href) and natural attributes such as text.

Now that we're familiar with HTML let's familiarize ourselves with CSS selector syntax and let's use it to parse some data!

HTML Syntax Overview

Css selectors are often referred to as selectors and a single selector indicates a path to a particular HTML node.

To test our CSS selectors, we'll be using an embedded selector playground

The average CSS selector in web scraping often looks something like this:

css protocol illustration — The most common CSS selector features are class and descendant selectors

In this example, our selector will select all <b> nodes that are children of an <a> node with an ID of "title" which is under a <div> node with a CLASS of "content":

<div class="content"> <a id="title" href="https://twitter.com/@scrapfly_dev">follow us on twitter <b>@scrapfly_dev</b></a> </div>

As you can see, CSS selectors is just a chain of various expressions joined either by a space or ">" character. Let's see the most commonly used expressions in web-scraping:

expression	description
`node`	selects any descendant (child, grandchild etc.) that matches node name
`>node`	selects a direct child that matches node name
`~node`	selects sibling that matches node name
`+node`	selects only adjacent siblings that matches node name
`.class`	class constraint - select only nodes that contain this class
`#id`	ID constraint - select only nodes that contain this ID
`[attribute=]`	attribute match constraint, e.g. `span[data=foo]` will select all span node with data="foo" attribute
`[attribute*=]`	attribute contains constraint, e.g. `span[data*=foo]` will select all span node with "foo" value in the `data` attribute, like: `<span data="foobar gaz">`
`[... i]`	attribute constrain match case insensitivity indicator, e.g. `span[data=foo i]` will match both `<span data="Foo">` and `<span data="FOO">` etc.
`,`	allows grouping of multiple selectors joining all results, e.g. `h1, h2` will select both h1 and h2 nodes

This is the core syntax available in most CSS selector clients, which should provide us with enough flexibility to parse most of HTML trees we might encounter on the modern web. Let's take a look at some examples!

Ultimate CSS Selector Cheatsheet for HTML Parsing

Ultimate companion for HTML parsing using CSS selectors. This cheatsheet contains all syntax explanations with interactive examples.

The most important feature of CSS selectors is node selection by name and descendant chaining. For example, using > character we can chain multiple node selectors:

<div> <p class="socials"> Follow us on <a href="https://twitter.com/@scrapfly_dev">Twitter!</a> </p> </div>

This allows us to define strict selection paths. However, using direct child selector (>) can make our selectors too strict for highly dynamic HTML files found in modern websites. In other words, what if the <a> node gets wrapped in some other styling node? That would break our selector.

Instead, we should use a mixture of space and > selectors to find the sweet spot of stability and accuracy:

<div> <p class="primary socials content"> Follow us on <ul> <li> <a href="https://twitter.com/@scrapfly_dev">Twitter!</a> </li> </ul> <a href="#">advertisement</a> </p> </div>

Here instead of defining a direct path we root our selector to <div> node that contains class socials(using .socials class constraint) and from there we can assume that any link in an unordered list is a social link.
Relying less on HTML structure and more on the context allows creating selectors that break less often on HTML structure changes.

Ideally, when designing our selectors we want to find the sweet spot between structure and context which will result in no false positives and something that doesn't break on small HTML tree changes:

<html> <p class="primary socials content"> Follow us on <ul> <li> <a href="https://twitter.com/@scrapfly_dev">Twitter!</a> </li> <li> <a href="https://linkedin.com/@scrapfly_dev">Linkedin!</a> </li> </ul> <a href="#">advertisement</a> </p> </html>

Here, we're using attribute contains constraint to restrict extraction only of links that contain "linkedin" in their urls.

In this section, we've discovered what makes a good CSS selector in web scrapers: we want something robust that doesn't select false positives and something not too strict that might miss some results.
Further, let's take a look at some more complex parsing scenarios and how can we solved them using CSS selectors alone

Navigating Complex Structures

Unfortunately, modern websites can have very complex and dynamic HTML trees that difficult to navigate reliably. Let's take a look at few common examples of complex structures and how can we solve them with CSS selectors.

Firstly, it's nice to remember that we don't have to cram everything into a single CSS selector, and we can safely join multiple selectors using the , syntax:

<div> // American English website might contain: <p>Our favorite product is <b>product 1</b> </p> // while British website might contain: <p>Our favourite product is <b>product 2</b> </p> </div>

In this example, we use two selectors for 2 different spellings of the word "favorite". We're also using a special pseudo class :contains which allows us to check whether text value of the node contains some string.

Note on Pseudo-classes and Pseudo-elements

The availability of pseudo-classes and pseudo-elements varies depending on the client. For instance, the :contains pseudo-class is supported in many web-scraping-focused clients such as parsel and JavaScript-native libraries like jQuery or Sizzle.

Another cool feature of selector joining is that all results come in ordered by their appearance, which means we can safely join selectors and retain the content structure:

<div> <p>For this recipe you'll need:</p> <a href="https://patreon.com">Support us on patreon</a> <b>600g of butter</b> <i>*margarine also works</i> <p>First, preheat the oven to 200C...</p> <p class="promo">For more recipes click the subscribe button</p> </div>

In this example, we extract recipe text while avoiding promos and other non-recipe related texts. We're also using :not pseudo class which allows us to reverse our selector constraints, which is very useful for filtering out unwanted nodes.

Finally, last one of important CSS selector features for web scraping is result slicing. Often we want to select only matching nodes of specific indices:

<div> <p>advertisement paragraph</p> <p>first paragraph</p> <p>second paragraph</p> <p>advertisement paragraph</p> </div>

In this example, we're using :nth-of-type and :nth-last-of-type to implement basic result slicing which allows us to filter out first and last nodes from our selection!

While Css selectors might appear a bit clunky compared to Xpath selectors (or other options) there's a surprising amount of power there that with some clever engineering can help us reliably extract data from HTML documents.

CSS Selector Clients

CSS selectors are primarily used in front-end web development, however there are few backend implementations that are used as clients for HTML parsing. Let's take a look at the most popular libraries that implement CSS selectors.

Python

To parse HTML in Python we have several options. However, many of them instead of executing natural CSS selectors convert them to XPath by using cssselect and run the XPath selectors through lxml XPath client. An example of such client would be parsel.

Other packages implement CSS selectors in varying capacity:

selectolax - is a new modern, blazing fast CSS selector client.
beautifulsoup - classic python client that supports CSS selectors as well as xpath and python object based navigation.

For Python, to parse HTML the most popular choices are beautifulsoup4 or parsel.

Web Scraping with Python

Introduction tutorial to web scraping with Python. How to collect and parse public data. Challenges, best practices and an example project.

PHP

PHP like python also prefers Xpath selectors thus most CSS selector clients use css-selector component to convert CSS selectors to xpath selectors and execute them through either built-in DOMXPath or community favorite DOMCrawler

Web Scraping With PHP 101

Introduction to web scraping with PHP. How to handle http connections, parse html files for data, best practices, tips and an example project.

Alternatively, when using browser emulation through browser emulation clients like Selenium php also gets access to browser's CSS selector capabilities:

// we can use findElements method of Selenium web driver to find elements by CSS selectors
$webDriver->findElements(WebDriverBy::cssSelector("div.content a#title>b"));

Ruby

Ruby has several CSS capable clients, however most popular package is nokogiri which offers both CSS and xpath selectors and loads of parsing utility functions and extensions:

html_doc = Nokogiri::HTML('<html><body><div class="socials"><a href="https://scrapfly.io/blog">Our blog</a></div></body></html>')
@doc.css("div.socials>a").attributes["href"]

Other Languages

Often xpath selectors are being favored over CSS selectors which makes CSS less accessible outside the mentioned few. That being said, since CSS selectors are very similar to XPATH selectors there's typically at least a community maintained translation layer available!

FAQ

Are CSS selectors better than XPATH?

Both path languages have their pros and cons. Generally, CSS selectors are briefer but less powerful than xpath. When web scraping it's best to mix both!

How to select element parent using CSS selectors?

CSS selectors do not support selection of node parents. Instead, XPath .. selector can be used, e.g. root/child/.. will select root

Can CSS selectors be extended like XPATH?

No, typically CSS selectors do not support native extensions. However many web scraping libraries can be patched with extra features as CSS selectors are typically converted to XPath selectors before runtime. That being said, it's better to fallback to XPath if CSS capabilities are inadequate.

CSS Selector Summary

In this introduction article we covered the syntax of CSS selectors, explored basic navigation to solidify our knowledge and finally finished off by taking a look at more advanced usages to fully grasp what this little path language is capable off!

While CSS selectors are great ideally when web scraping it's best to take advantage of both CSS and XPATH selectors. Common idiom is to use CSS selectors for simple paths as they are short and easy to follow and for more complex selections fall back to xpath which is more verbose and powerful.

Parsing HTML with Xpath

Introduction to xpath in the context of web-scraping. How to extract data from HTML documents using xpath, best practices and available tools.

For further CSS selector help we advise checking out #CSS-selectors tag on Stackoverflow which is very active and full of dedicated teachers!

Parsing HTML with CSS Selectors

What Are CSS Selectors?

Parsing HTML with Xpath

HTML Overview

HTML Syntax Overview

Ultimate CSS Selector Cheatsheet for HTML Parsing

Basic CSS Navigation

Navigating Complex Structures

Note on Pseudo-classes and Pseudo-elements

CSS Selector Clients

Python

Web Scraping with Python

PHP

Web Scraping With PHP 101

Ruby

Other Languages

FAQ

Are CSS selectors better than XPATH?

How to select element parent using CSS selectors?

Can CSS selectors be extended like XPATH?

CSS Selector Summary

Parsing HTML with Xpath

Related Knowledgebase

What are devtools and how they're used in web scraping?

How to select HTML elements by text using CSS Selectors?

How to use CSS selectors in NodeJS when web scraping?

How to find elements by CSS selector in Selenium

How to find HTML elements by multiple tags with BeautifulSoup?

How to find sibling HTML nodes using BeautifulSoup and Python?

How to find HTML element by class with BeautifulSoup?

What are some BeautifulSoup alternatives in Python?

How to use CSS Selectors in Nim ?

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

How to select elements by attribute using CSS selectors?

How to select elements by class using CSS selectors?

Related Articles

How to Parse XML

Ultimate CSS Selector Cheatsheet for HTML Parsing

Web Scraping With Ruby

Web Scraping With NodeJS and Javascript

Ultimate Guide to JSON Parsing in Python