How to Parse XML
In this article, we'll explain about XML parsing. We'll start by defining XML files, their format and how to navigate them for data extraction.
To parse web scraped content in Nim using CSS selectors we recommend the CSS3Selectors library which is designed with web scraping in mind:
import std/streams
import pkg/chame/minidom
import css3selectors
let html = """
<!DOCTYPE html>
<html>
<head><title>Example</title></head>
<body>
<p>1</p>
<p>2</p>
<p>3</p>
<p>4</p>
</body>
</html>
"""
let document = Node(parseHtml(newStringStream(html)))
let elements = document.querySelectorAll("p:nth-child(odd)")
echo elements # @[<p>1</p>, <p>3</p>]
let htmlFragment = parseHTMLFragment("<h1 id='test'>Hello World</h1><h2>Test Test</h2>", Element())
let element = htmlFragment.querySelector("#test")
echo element # <h1 id="test">Hello World</h1>
CSS3Selectors was created to supersede nimquery which still works well for parsing HTML content in Nim if CSS3Selectors is not available:
from xmltree import `$`
from htmlparser import parseHtml
import nimquery
let html = """
<!DOCTYPE html>
<html>
<head><title>Example</title></head>
<body>
<p>1</p>
<p>2</p>
<p>3</p>
<p>4</p>
</body>
</html>
"""
let xml = parseHtml(html)
let elements = xml.querySelectorAll("p:nth-child(odd)")
echo elements
# => @[<p>1</p>, <p>3</p>]
This knowledgebase is provided by Scrapfly — a web scraping API that allows you to scrape any website without getting blocked and implements a dozens of other web scraping conveniences. Check us out 👇