How to Parse XML
In this article, we'll explain about XML parsing. We'll start by defining XML files, their format and how to navigate them for data extraction.
CSS selectors allow selecting elements by any attribute value such as class
, id
, href
etc. This means we can extract any HTML elements based on attribute value with CSS selectors.
While class
and id
have special notation shortcuts - .
and #
respectively - any attribute can be selected using the attribute selector ([attribute]
) which supports multiple operators:
=
match can be used for exact equality e.g. [attr=match]
:~=
turns the attribute into an array of space-separated values and checks whether it contains the match (similar to .class
and #id
matches):|=
checks for exact equality except ignores hypen suffixes (e.g. -language
):^=
checks whether the attribute starts with the match:$=
checks whether the attribute ends with the match:*=
checks whether the attribute contains the match:Finally, to make all of these matches case insensitive add i
before the closing bracket, e.g. [attr=match i]
This knowledgebase is provided by Scrapfly — a web scraping API that allows you to scrape any website without getting blocked and implements a dozens of other web scraping conveniences. Check us out 👇