HTML Parsing
Almost every web scraper has to parse data out of HTML for one reason or another. Be it finding data on the page or selecting elements to interact with.
Fortunately, HTML is a designed to be machine parsable.
HTML documents are essentially tree structures of <tag>
elements.
For example, the following HTML document can be interpreted as an element tree:
From this view, the tree structure is pretty clear and note that each node can
have an unlimited amount of parameters attached to it such as href
for URL links, or text
for text data.
There are three standard ways to parse HTML trees:
CSS Selectors
The most common way to select HTML nodes. Simple but robust.
XPath
Powerful XML query language with its own special syntax and customizable functions.
Object Based
HTML nodes can be interpreted as programming objects and compiled as object trees. The most popular example of this is Python's Beautifulsoup package.
As for which parsing option is the best depends on available tools and personal preference though at Scrapfly we have this handy evaluation rule:
- CSS selectors where possible — they're simple and reliable.
- XPath as a fallback for more complex selections.
- Beautifulsoup, lxml (or any other object based HTML parser) for complex algorithmic parsing.
Scrapfly's Python
and Typescript SDKs
come with HTML parser included for both CSS and XPath selectors
through the result.selector
shortcut!
Let's continue with our static HTML page example and apply some data parsing.
Example Scraper
For this example scraper we'll be using Python with parsel
as our HTML parser.
It can be installed using pip install parsel
.
We'll be scraping all product URLs from web-scraping.dev/products static HTML page we scraped in the previous section.
Let's select all product links using XPath or CSS selectors. For that, we need to take a look at the HTML page structure to figure out how to build our selectors:
Using browser devtools we can see that:
- All products are contained under
div class="products"
- Each product is under
div class="product"
- Each URL is under
h3
With this, we can easily build out our CSS and XPath selectors:
Above, we retrieve HTML data as covered in the previous section.
Then, we load it as a Selector
object which builds
a tree object we can use to select data from with XPath or CSS selectors.
We're ready for the full static paging exercise now — see the full exercise below 👇
Getting Good at HTML Parsing
HTML can get very confusing and complex quickly. This complexity with the fact that pages tend to change often means that the HTML parser code has to be robust and flexible.
To start, see our interactive cheatsheet pages for quickly navigating CSS Selector and XPath syntax.
CSS Selector Cheatsheet
Interactive reference for all css selector features used in web scraping.
XPath Cheatsheet
Interactive reference for all XPath features used in web scraping.
Many shortcuts and tools can help with figuring out the right XPath or CSS selectors when developing web scrapers but to be honest — they're not good enough. It's best to take the time to learn the syntax and how to use the right tools for the job though AI assistants can be good tutors:
Using AI To Parse HTML
ChatGPT and other large language model assistants can be pretty good teachers when it comes to HTML parsing.
Using AI to Create XPath and CSS Selectors
It can also be used to generate XPath and CSS selectors from provided HTML samples.
Parsing XML - Feeds and Sitemaps
XML is rarely encountered in web scraping and since HTML is a subset of XML, the same parsing techniques and tools can be used for XML documents with a few minor differences.
Most commonly XML is encountered in scraping additional website structures like feeds and sitemaps.
Intro to parsing XML
How to parse HTML in web scraping and different programming languages and what common pitfalls to avoid.
Intro to scraping sitemaps
How to discover web scraping targets using official sitemap XML feed structures.
Next up - Dynamic Page Scraping
Scraping static pages and parsing HTML isn't too difficult, but modern web design is moving away from static pages. Next, let's take a look at how dynamic web pages and web apps like SPA's, websites powered by NEXT.js, React etc. are scraped.