How to use XPath selectors in NodeJS when web scraping?

CSS selectors are much more widely used in NodeJS and Javascript ecosystems though for web scraping we might need more powerful features of XPath selectors.
There are few options available for XPath selectors. Most popular one in web scraping is the osmosis library:

const osmosis = require("osmosis");

const html = `
<a href="http://scrapfly.io/">link 1</a>
<a href="http://scrapfly.blog/">link 2</a>
`
osmosis
    .parse(html)
    .find('//a/@href')
    .log(console.log);

Another alternative is the xmldom library:

import xpath from 'xpath';
import { DOMParser } from '@xmldom/xmldom'

const tree = new DOMParser().parseFromString(`

    <h1>Page title</h1>
<p>some paragraph</p>
<a href="http://scrapfly.io/blog">some link</a>

`);

console.log({
    // we can extract text of the node, which returns `Text` object:
    title: xpath.select('//h1/text()', tree)[0].data,
    // or a specific attribute value, which return `Attr` object:
    url: xpath.select('//a/@href', tree)[0].value,
});
Question tagged: NodeJS, XPath, Data Parsing

Related Posts

How to Scrape With Headless Firefox

Discover how to use headless Firefox with Selenium, Playwright, and Puppeteer for web scraping, including practical examples for each library.

How to Use Chrome Extensions with Playwright, Puppeteer and Selenium

In this article, we'll explore different useful Chrome extensions for web scraping. We'll also explain how to install Chrome extensions with various headless browser libraries, such as Selenium, Playwright and Puppeteer.

How to Scrape Sitemaps to Discover Scraping Targets

Usually to find scrape targets we look at site search or category pages but there's a better way - sitemaps! In this tutorial, we'll be taking a look at how to find and scrape sitemaps for target locations.