How to Parse XML
In this article, we'll explain about XML parsing. We'll start by defining XML files, their format and how to navigate them for data extraction.
It's not possible to select preceding siblings using CSS selectors (unlike following siblings).
However, depending on your scraping stack there are several different ways to achieve this:
Use Beautifulsoup and Python to select the preceding siblings:
from bs4 import BeautifulSoup
html = """
<div>
<h2>Heading 1</h2>
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<h2>Heading 2</h2>
<p>Paragraph 3</p>
<p>Paragraph 4</p>
</div>
"""
soup = BeautifulSoup(html, "html.parser")
# Find root element:
second_h2_element = soup.find_all("h2")[1]
# Select the preceding siblings using .previous_siblings property:
preceding_siblings = second_h2_element.previous_siblings
for sibling in preceding_siblings:
print(sibling.text)
const cheerio = require("cheerio");
const html = `
<div>
<h2>Heading 1</h2>
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<h2>Heading 2</h2>
<p>Paragraph 3</p>
<p>Paragraph 4</p>
</div>
`;
const $ = cheerio.load(html);
// Get the second h2 element
const second_h2_element = $("h2").eq(1);
// Select the preceding siblings of the h2 element
const preceding_siblings = second_h2_element.prevAll();
// Loop over the preceding siblings and print their text content
preceding_siblings.each(function() {
console.log($(this).text());
});
This knowledgebase is provided by Scrapfly data APIs, check us out! 👇