To start - what are sitemaps? Sitemaps are web files that list the locations of web objects like product pages, articles etc. for web crawlers. It's mostly used to tell search engines what to index though in web scraping we can use them to discover targets to scrape.
In this tutorial, we'll take a look at how to discover sitemap files and scrape them using Python and Javascript. In this guide we'll cover:
Finding sitemap location.
How to understand and navigate sitemaps.
How to parse sitemap files using XML parsing tools in Python and Javascript.
For this, we'll be using Python or Javascript with a few community packages and some real-life examples. Let's dive in!
Why Scrape Sitemaps?
Scraping sitemaps is an efficient way to discover targets listed on the website be it product pages, blog posts or any other web object. Sitemaps are usually XML documents that list URLs by category in batches of 50 000. Sitemaps are often gzip compressed making it a low-bandwidth way to discover page URLs.
So, instead of scraping a website's search or directory which can be thousands of HTML pages, we can scrape sitemap files and get the same results in a fraction of time and bandwidth costs. Additionally, since sitemaps are created for web scrapers the likelihood of getting blocked is much lower.
Where to find Sitemaps?
Sitemaps are usually located at the root of the website /sitemap.xml file. For example, scrapfly.io/sitemap.xml of this blog.
However, there's no clear standard location so another way to find the sitemap location is to explore the standard /robots.txt file which provides instructions for web crawlers.
For example, scrapflys's /robots.txt file contains the following line:
Sitemap: https://scrapfly.io/sitemap.xml
Many programming languages have robot.txt parsers built-in or it's as easy as retrieving the page content and finding the line with Sitemap: prefix:
Python
Javascript
# built-in sitemap parser:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser('http://scrapfly.io/robots.txt')
rp.read()
print(rp.site_maps())
['https://scrapfly.io/blog/sitemap-posts.xml', 'https://scrapfly.io/sitemap.xml']
# or httpx
import httpx
response = httpx.get('https://scrapfly.io/robots.txt')
for line in response.text.splitlines():
if line.strip().startswith('Sitemap:'):
print(line.split('Sitemap:')[1].strip())
const axios = require('axios');
const response = await axios.get(url);
const sitemaps = [];
const lines = robotsTxt.split('\n');
for (const line of lines) {
const trimmedLine = line.trim();
if (trimmedLine.startsWith('Sitemap:')) {
sitemaps.push(trimmedLine.split('Sitemap:')[1].trim());
}
}
console.log(sitemaps);
Scraping Sitemaps
Sitemaps are XML documents with specific, simple structures. For example:
For bigger websites, the sitemap limit of 50 000 URLs is often not enough so there are multiple sitemap files contained in a sitemap hub. To scrape this we need a bit of recursion.
For example, let's take a look at StockX sitemap hub. To start, we can see that robots.txt file contains the following line:
Sometimes, sitemap hubs split sitemaps by category, location which can help us to target specific types of pages. Though to scrape hubs all we have to do is parse the index file and scrape each sitemap file:
Python
Javascript
from parsel import Selector
import httpx
response = httpx.get("https://stockx.com/sitemap/sitemap-index.xml")
selector = Selector(response.text)
for sitemap_url in selector.xpath("//sitemap/loc/text()").getall():
response = httpx.get(sitemap_url)
selector = Selector(response.text)
for url in selector.xpath('//url'):
location = url.xpath('loc/text()').get()
change_frequency = url.xpath('changefreq/text()').get() # it's an alternative field to modification date
print(location, change_frequency)
In this quick introduction, we've taken a look at the most popular way to discover scraping targets - the sitemap system. To scrape it we used an HTTP client (httpx in Python or axios in Javascript) and XML parser (lxml or cheerio).
Sitemaps are a great way to discover new scraping targets and get a quick overview of the website structure. Though, not every website supports them and it's worth noting that the data on sitemaps can be more dated than on the website itself.
In this guide, we’ll explore how to scrape images from websites using different methods. We'll also cover the most common image scraping challenges and how to overcome them. By the end of this article, you will be an image scraping master!