HTML Parsing

Almost every web scraper has to parse data out of HTML for one reason or another. Be it finding data on the page or selecting elements to interact with.

Fortunately, HTML is a designed to be machine parsable. HTML documents are essentially tree structures of <tag> elements. For example, the following HTML document can be interpreted as an element tree:

From this view, the tree structure is pretty clear and note that each node can have an unlimited amount of parameters attached to it such as href for URL links, or text for text data.

There are three standard ways to parse HTML trees:

As for which parsing option is the best depends on available tools and personal preference though at Scrapfly we have this handy evaluation rule:

  1. CSS selectors where possible — they're simple and reliable.
  2. XPath as a fallback for more complex selections.
  3. Beautifulsoup, lxml (or any other object based HTML parser) for complex algorithmic parsing.
Scrapfly's Python and Typescript SDKs come with HTML parser included for both CSS and XPath selectors through the result.selector shortcut!

Let's continue with our static HTML page example and apply some data parsing.

Example Scraper

For this example scraper we'll be using Python with parsel as our HTML parser. It can be installed using pip install parsel.

We'll be scraping all product URLs from web-scraping.dev/products static HTML page we scraped in the previous section.

Let's select all product links using XPath or CSS selectors. For that, we need to take a look at the HTML page structure to figure out how to build our selectors:

Using browser devtools we can see that:

  • All products are contained under div class="products"
  • Each product is under div class="product"
  • Each URL is under h3

With this, we can easily build out our CSS and XPath selectors:

Above, we retrieve HTML data as covered in the previous section. Then, we load it as a Selector object which builds a tree object we can use to select data from with XPath or CSS selectors.

We're ready for the full static paging exercise now — see the full exercise below 👇

Scrapeground Exercise: Static Paging Scraping

See this in-depth tutorial on Scrapfly Scrapeground full more detailed example of this scraper. This includes examples in different languages and libraries as well as more scenarios.

Getting Good at HTML Parsing

HTML can get very confusing and complex quickly. This complexity with the fact that pages tend to change often means that the HTML parser code has to be robust and flexible.

To start, see our interactive cheatsheet pages for quickly navigating CSS Selector and XPath syntax.

Many shortcuts and tools can help with figuring out the right XPath or CSS selectors when developing web scrapers but to be honest — they're not good enough. It's best to take the time to learn the syntax and how to use the right tools for the job though AI assistants can be good tutors:

Parsing XML - Feeds and Sitemaps

XML is rarely encountered in web scraping and since HTML is a subset of XML, the same parsing techniques and tools can be used for XML documents with a few minor differences.

Most commonly XML is encountered in scraping additional website structures like feeds and sitemaps.

Next up - Dynamic Page Scraping

Scraping static pages and parsing HTML isn't too difficult, but modern web design is moving away from static pages. Next, let's take a look at how dynamic web pages and web apps like SPA's, websites powered by NEXT.js, React etc. are scraped.

< >

Summary