How to Parse XML
In this article, we'll explain about XML parsing. We'll start by defining XML files, their format and how to navigate them for data extraction.
The most popular package that implements XPath selectors in Python is lxml. We can use the xpath()
method to find all matching values:
from lxml import etree
tree = etree.fromstring("""
<div>
<a>link 1</a>
<a>link 2</a>
</div>
""")
for result in tree.xpath("//a"):
print(result.text)
"link 1"
"link 2"
However, in web scraping the recommended way is to use the parsel package. It's based on lxml
and providers a more consistent behavior when working with HTML content:
from parsel import Selector
selector = Selector("""
<div>
<a>link 1</a>
<a>link 2</a>
</div>
""")
selector.xpath("//a").getall()
['<a>link 1</a>', '<a>link 2</a>']
This knowledgebase is provided by Scrapfly — a web scraping API that allows you to scrape any website without getting blocked and implements a dozens of other web scraping conveniences. Check us out 👇