HTML Parsing

Almost every web scraper has to parse data out of HTML for one reason or another. Be it finding data on the page or selecting elements to interact with.

Fortunately, HTML is a designed to be machine parsable. HTML documents are essentially tree structures of <tag> elements. For example, the following HTML document can be interpreted as an element tree:

<head>
  <title>
  </title>
</head>
<body>
  <h1>Introduction</h1>
  <div>
    <p>some description text:</p>
    <a class="link" href="http://example.com">
      example link
    </a>
  </div>
</body>

From this view, the tree structure is pretty clear and note that each node can have an unlimited amount of parameters attached to it such as href for URL links, or text for text data.

There are three standard ways to parse HTML trees:

CSS Selectors

The most common way to select HTML nodes. Simple but robust.

XPath

Powerful XML query language with its own special syntax and customizable functions.

Object Based

HTML nodes can be interpreted as programming objects and compiled as object trees. The most popular example of this is Python's Beautifulsoup package.

As for which parsing option is the best depends on available tools and personal preference though at Scrapfly we have this handy evaluation rule:

CSS selectors where possible — they're simple and reliable.
XPath as a fallback for more complex selections.
Beautifulsoup, lxml (or any other object based HTML parser) for complex algorithmic parsing.

Scrapfly's Python and Typescript SDKs come with HTML parser included for both CSS and XPath selectors through the result.selector shortcut!

Let's continue with our static HTML page example and apply some data parsing.

Example Scraper

For this example scraper we'll be using Python with parsel as our HTML parser. It can be installed using pip install parsel.

We'll be scraping all product URLs from web-scraping.dev/products static HTML page we scraped in the previous section.

Let's select all product links using XPath or CSS selectors. For that, we need to take a look at the HTML page structure to figure out how to build our selectors:

Using browser devtools we can see that:

All products are contained under div class="products"
Each product is under div class="product"
Each URL is under h3

With this, we can easily build out our CSS and XPath selectors:

Python
Scrapfly.py

from parsel import Selector
import httpx

# 1. retrieve the page HTML (from prev section)
response = httpx.get(url="https://web-scraping.dev/products")

# 2. Build a selector object
selector = Selector(response.text)
# we can use CSS selectors
product_urls = selector.css(".product h3 a::attr(href)").getall()
other_page_urls = set(selector.css(".paging>a::attr(href)").getall())

# or XPath for more advance selections, e.g. only products that contain "Potion" in the name:
drink_product_urls = selector.xpath("//h3/a[contains(text(), 'Potion')]/@href").getall()

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
  url="https://web-scraping.dev/products"
))

# we can use CSS selectors
product_urls = result.selector.css(".product h3 a::attr(href)").getall()
other_page_urls = set(result.selector.css(".paging>a::attr(href)").getall())

# or XPath for more advance selections, e.g. only products that contain "Potion" in the name:
drink_product_urls = result.selector.xpath("//h3/a[contains(text(), 'Potion')]/@href").getall()

Above, we retrieve HTML data as covered in the previous section. Then, we load it as a Selector object which builds a tree object we can use to select data from with XPath or CSS selectors.

We're ready for the full static paging exercise now — see the full exercise below 👇

Scrapeground Exercise: Static Paging Scraping

See this in-depth tutorial on Scrapfly Scrapeground full more detailed example of this scraper. This includes examples in different languages and libraries as well as more scenarios.

Getting Good at HTML Parsing

HTML can get very confusing and complex quickly. This complexity with the fact that pages tend to change often means that the HTML parser code has to be robust and flexible.

To start, see our interactive cheatsheet pages for quickly navigating CSS Selector and XPath syntax.

CSS Selector Cheatsheet

Interactive reference for all css selector features used in web scraping.

XPath Cheatsheet

Interactive reference for all XPath features used in web scraping.

Many shortcuts and tools can help with figuring out the right XPath or CSS selectors when developing web scrapers but to be honest — they're not good enough. It's best to take the time to learn the syntax and how to use the right tools for the job though AI assistants can be good tutors:

Using AI To Parse HTML

ChatGPT and other large language model assistants can be pretty good teachers when it comes to HTML parsing.

Using AI to Create XPath and CSS Selectors

It can also be used to generate XPath and CSS selectors from provided HTML samples.

How to handle dynamic CSS classes? How to turn HTML to text? More FAQ at Scrapfly Knowledgebase

Parsing XML - Feeds and Sitemaps

XML is rarely encountered in web scraping and since HTML is a subset of XML, the same parsing techniques and tools can be used for XML documents with a few minor differences.

Most commonly XML is encountered in scraping additional website structures like feeds and sitemaps.

Intro to parsing XML

How to parse HTML in web scraping and different programming languages and what common pitfalls to avoid.

Intro to scraping sitemaps

How to discover web scraping targets using official sitemap XML feed structures.

Next up - Dynamic Page Scraping

Scraping static pages and parsing HTML isn't too difficult, but modern web design is moving away from static pages. Next, let's take a look at how dynamic web pages and web apps like SPA's, websites powered by NEXT.js, React etc. are scraped.

< >