Web Scraping With NodeJS

article feature image

Web scraping is mostly connection and data programming so using a web language for scraping seems like a natural fit! In this article we'll be taking a look at using NodeJS - a popular backend javascript runtime environment. We'll cover the best packages available for connection and HTML parsing, tips and tricks and best practices.

Finally, we'll finish everything off with an example web scraping project - https://www.etsy.com/ product scraper that will illustrate common challenges encountered when web scraping in nodejs like cookie tracking and dealing with CSRF tokens.

Web Scraping Scene in NodeJS

NodeJS in web scraping is mostly known because of Puppeteer browser automation toolkit. Using web browser automation for web scraping has a lot of benefits, however it's a really complex and resource heavy approach to web scraping. With a little of reverse engineering and some clever nodeJS packages we can achieve similar results without the entire overhead of a web browser.

Web Scraping With a Headless Browser: Puppeteer

For more on using browser automation with Puppeteer we have an entire introduction article that covers basic usage, best practices, tips and tricks and an example project!

Web Scraping With a Headless Browser: Puppeteer

In this article we'll focus on few tools in particular. For connection, we'll be using axios HTTP client and for parsing we'll focus on cheerio HTML tree parser, let's install them using these command line instructions:

$ mkdir scrapfly-etsy-scraper
$ cd scrapfly-etsy-scraper
$ npm install cheerio axios

Making Connection

Vital part of web-scraping is establishing connection with our web targets and for that we'll need an HTTP client. NodeJS has many HTTP clients, however by far the most popular one is axios so in this section we'll be sticking with it as it provides most of the necessary functions for web scraping: cookie tracking and easy form/json requests.

HTTP Protocol Fundamentals

To collect data from a public resource, we need to establish connection with it first. Most of the web is served over http protocol which is rather simple: we (the client) send a request for a specific document to the website (the server), once the server processes our request it replies with the requested document - a very straight forward exchange!

illustration of a standard http exchange

illustration of a standard http exchange

As you can see in this illustration: we send a request object which consists of method (aka type), location and headers, in turn we receive a response object which consists of status code, headers and document content itself. Let's take a quick look at each of these components, what they mean and how are they relevant in web scraping.

Understanding Requests and Responses

When it comes to web-scraping we don't exactly need to know every little detail about http requests and responses, however it's good to have a general overview and to know which parts of this protocol are especially useful in web-scraping. Let's take a look at exactly that!

Request Method

Http requests are conveniently divided into few types that perform distinct function:

  • GET requests are intended to request a document.
  • POST requests are intended to request a document by sending a document.
  • HEAD requests are intended to request documents meta information.
  • PATCH requests are intended to update a document.
  • PUT requests are intended to either create a new document or update it.
  • DELETE requests are intended to delete a document.

When it comes to web scraping, we are mostly interested in collecting documents, so we'll be mostly working with GET and POST type requests. To add, HEAD requests can be useful in web scraping to optimize bandwidth - sometimes before downloading the document we might want to check its metadata whether it's worth the effort.

Request Location

To understand what is resource location, first we should take a quick look at URL's structure itself:

illustration showing general URL structure

Example of a URL structure

Here, we can visualize each part of a URL: we have protocol which when it comes to http is either http or https, then we have host which is essentially address of the server, and finally we have the location of the resource and some custom parameters.
If you're ever unsure of a URL's structure, you can always fire up Node's interactive shell (node in the terminal) and let it figure it out for you:

$ node
> new URL("http://www.domain.com/path/to/resource?arg1=true&arg2=false")
URL {
  href: 'http://www.domain.com/path/to/resource?arg1=true&arg2=false',
  origin: 'http://www.domain.com',
  protocol: 'http:',
  username: '',
  password: '',
  host: 'www.domain.com',
  hostname: 'www.domain.com',
  port: '',
  pathname: '/path/to/resource',
  search: '?arg1=true&arg2=false',
  searchParams: URLSearchParams { 'arg1' => 'true', 'arg2' => 'false' },
  hash: ''
}

Request Headers

While it might appear like request headers are just minor metadata details in web scraping, they are extremely important. Headers contain essential details about the request, like: who's requesting the data? What type of data they are expecting? Getting these wrong might result in the web scraper being denied access.

Let's take a look at some of the most important headers and what they mean:

User-Agent is an identity header that tells the server who's requesting the document.

# example user agent for Chrome browser on Windows operating system:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36

Whenever you visit a web page in your web browser identifies itself with a User-Agent string that looks something like "Browser Name, Operating System, Some version numbers". This helps the server to determine whether to serve or deny the client. In web scraping, we don't want to be denied content, so we have to blend in by faking our user agent to look like that one of a browser.

There are many online databases that contain latest user-agent strings of various platforms, like this Chrome user agent list bye whatismybrowser.com

Cookie is used to store persistent data. This is a vital feature for websites to keep tracking of user state: user logins, configuration preferences etc. Cookies are a bit out of scope of this article, but we'll be covering them in the future.

Accept headers (also Accept-Encoding, Accept-Language etc.) contain information about what sort of content we're expecting. Generally when web-scraping we want to mimic this of one of the popular web browsers, like Chrome browser use:

text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8

X- prefixed headers are special custom headers. These are important to keep an eye on when web scraping, as they might configure important functionality of the scraped website/webapp.

These are a few of most important observations, for more see extensive full documentation page: MDN HTTP Headers

Response Status Code

Conveniently, all HTTP responses come with a status code that indicates whether this request is a success, failure or some alternative action is requested (like request to authenticate). Let's take a quick look at status codes that are most relevant to web scraping:

  • 200 range codes generally mean success!
  • 300 range codes tend to mean redirection - in other words if we request content at /product1.html it might be moved to a new location /products/1.html and server would inform us about that.
  • 400 range codes mean request is malformed or denied. Our web scraper could be missing some headers, cookies or authentication details.
  • 500 range codes typically mean server issues. The website might be unavailable right now or is purposefully disabling access to our web scraper.

For more on http status codes, see documentation: HTTP Status definitions by MDN

Response Headers

When it comes to web scraping, response headers provide some important information for connection functionality and efficiency. For example, Set-Cookie header requests our client to save some cookies for future requests, which might be vital for website functionality. Other headers such as Etag, Last-Modified are intended to help client with caching to optimize resource usage.

For the entire list of all HTTP headers, see MDN HTTP Headers

Finally, just like with request headers, headers prefixed with an X- are custom web functionality headers.


We took a brief overlook of core HTTP components, and now it's time we give it a go and see how HTTP works in practical Node!

Making GET Requests

Now that we're familiar with the HTTP protocol and how it's used in web-scraping let's take a look of how we access it in Node's axios package.

Let's start off with a basic GET request:

import axios from 'axios';

axios.get('https://httpbin.org/get').then(
    (response) => console.log(response.data)
)

Here we're using http://httpbin.org HTTP testing service to retrieve a simple HTML page. When run, this script should print basic details about our made request:

{
  args: {},
  headers: {
    Accept: 'application/json, text/plain, */*',
    Host: 'httpbin.org',
    'User-Agent': 'axios/0.25.0',
  },
  origin: '180.111.222.223',
  url: 'https://httpbin.org/get'
}

Making POST requests

Sometimes our web-scraper might need to submit some sort of forms to retrieve HTML results. For example, search queries often use POST requests with query details as JSON values:

import axios from 'axios';

axios.post('https://httpbin.org/post', {'query': 'cats', 'page': 1}).then(
    (response) => console.log(response.data)
)

As for form data type request we need to do a bit more work and use form-data package:

import axios from 'axios';
import FormData from 'form-data';

function makeForm(data){
    var bodyFormData = new FormData();
    for (let key in data){
        bodyFormData.append(key, data[key]);
    }
    return bodyFormData;
}

axios.post('https://httpbin.org/post', makeForm({'query': 'cats', 'page': 1})).then(
    (response) => console.log(response.data)
)

Axios is smart enough to fill in required header details (content-type and content-length) based on the data argument. So if we're sending an object it'll set Content-Type header to application/json and to application/x-www-form-urlencoded for form encoded data!

Ensuring Headers

As we've covered before our requests must provide metadata about themselves which helps server determine what content to return. Often, this metadata can be used to identify web scrapers and block them. Modern web browsers automatically include specific metadata details with every request so if we wish to not stand out as a web scraper we should replicate this behavior.

Primarily, User-Agent and Accept headers are often dead giveaways so when creating our Client we can set them to values a normal Chrome browser would use:

import axios from 'axios';

axios.get(
    'https://httpbin.org/get', 
    {headers: {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    }}
).then(
    (response) => console.log(response.data)
)

This will ensure that every request client is making will include these default headers.

Note that this is just the tip of the iceberg when it comes to bot blocking and request headers, however just setting User-Agent and Accept headers should make us much harder to detect!

Using Default Settings

When web scraping we typically want to apply same configuration to multiple requests. For this, axios instances can be used to configure default arguments:

import axios from 'axios';

const session = axios.create({
    headers: {'User-Agent': 'tutorial program'},
    timeout: 5000,
    proxy: {
            host: 'proxy-url',
            port: 80,
            auth: {username: 'my-user', password: 'my-password'}
        }
    }
)

session.get('http://httpbin.org/get').then(response => console.log(response));
session.get('http://httpbin.org/get').then(response => console.log(response));

Here we created an instance of axios that will apply custom headers, timeout and proxy settings to every request!

Tracking Cookies

Sometimes when web-scraping we care about persistent connection state. For websites where we need to login or configure website (like change currency), cookies are vital part of web scraping process.
Unfortunately, by default axios doesn't support cookie tracking, however it can be enabled via axios-cookiejar-support extension package:

import axios from 'axios';
import { CookieJar } from 'tough-cookie';
import { wrapper } from 'axios-cookiejar-support';

const jar = new CookieJar();
const session = wrapper(axios.create({ jar }));

async function setLocale(){
    // set cookies:
    let respSetCookies = await session.get('http://httpbin.org/cookies/set/locale/usa');
    // retrieve existing cookies:
    let respGetCookies = await session.get('http://httpbin.org/cookies');
    console.log(respGetCookies.data);
}

setLocale();

In the example above, we're configuring axios instance with a cookie jar object which allows us to have persistent cookies in our web scraping session. If we run this script we should see response:

{ cookies: { locale: 'usa' } }

Now that we're familiar HTTP connection and how can we use it in axios HTTP client package let's take a look at the other half of web scraping process: parsing HTML data!

Parsing HTML

HTML (HyperText Markup Language) is a text data structure that powers the web. The great thing about it is that it's intended to be machine-readable text content, which is great news for web-scraping as we can easily parse the relevant data with javascript code!

HTML is a tree type structure that lends easily to parsing. For example, let's take this simple HTML content:

<head>
    <title>My Website</title>
</head>
<body>
    <h1>Welcome to my website!</h1>
    <div class="content">
        <p>This is my website</p>
        <p>Isn't it great?</p>
    </div>
</body>

Here we see an extremely basic HTML document that a simple website might serve. You can already see the tree like structure just by indentation of the text, but we can even go further and illustrate it:

illustration of a HTML node tree

example of a HTML node tree. Note that branches are ordered (left-to-right)

This tree structure is brilliant for web-scraping as we can easily navigate the whole document.
For example, to find the title of the website, we can see that it's under <head> and under <title> nodes. In other words - if we wanted to extract 1000 titles for 1000 different pages, we would write a rule to find head->title->text for every one of them.

When it comes to HTML parsing, there are two standard ways to write these rules: CSS selectors and XPATH selectors - let's dive further and see how can we use them to parse web-scrapped data!

Using Cheerio with CSS Selectors

Cheerio is the most popular HTML parsing package in NodeJS which allows us to use CSS selectors to select specific nodes of an HTML tree.

Parsing HTML with CSS Selectors

For more on CSS selectors see our complete introduction tutorial which covers basic usage, tips and tricks and common web scraping idioms

Parsing HTML with CSS Selectors

To use Cheerio we have to create a tree parser object from an HTML string and then we can use a combination of CSS selectors and element functions to extract the relevant data we're looking for:

import cheerio from 'cheerio';

const tree = cheerio.load(`
    <head>
        <title>My Website</title>
    </head>
    <body>
        <div class="content">
            <h1>First blog post</h1>
            <p>Just started this blog!</p>
            <a href="http://scrapfly.io/blog">Checkout My Blog</a>
        </div>
    </body>
`);

console.log({
    // we can extract text of the node:
    title: tree('.content h1').text(),
    // or a specific attribute value:
    url: tree('.content a').attr('href')
});

In the example above, we're loading Cheerio with an example HTML document and highlighting two ways of selecting relevant data: using text() method we can select inner text of an HTML node and by using attr() method we can select value of elements attributes.

Using Xpath

While CSS selectors are short, robust and easy to read sometimes when dealing with complex HTML trees we might need something more powerful - for that nodeJS also has XPATH support via packages xpath and @xmldom/xmldom:

import xpath from 'xpath';
import { DOMParser } from '@xmldom/xmldom'

const tree = new DOMParser().parseFromString(`
    <head>
        <title>My Website</title>
    </head>
    <body>
        <div class="content">
            <h1>First blog post</h1>
            <p>Just started this blog!</p>
            <a href="http://scrapfly.io/blog">Checkout My Blog</a>
        </div>
    </body>
`);

console.log({
    // we can extract text of the node, which returns `Text` object:
    title: xpath.select('//div[@class="content"]/h1/text()', tree)[0].data,
    // or a specific attribute value, which return `Attr` object:
    url: xpath.select('//div[@class="content"]/a/@href', tree)[0].value,
});

Here, we're replicating our Cheerio example in xmldom + xpath setup selecting title text and url's href attribute.

Parsing HTML with Xpath

For more on XPATH selectors see our complete introduction tutorial which covers basic usage, tips and tricks and common web scraping idioms

Parsing HTML with Xpath

We looked into two methods of parsing HTML content with NodeJS: using CSS selectors with Cheerio and using Xpath selectors with xmldom + xpath. Generally, it's best to stick with Cheerio as it complies with HTML standard better and CSS selectors being easier to work with.
Next, let's put all of this together in an example project!

Example Project: etsy.com

We've learned about http connections using axios and HTML parsing using cheerio and now it's time to put everything together and solidify our knowledge!

In this section we'll write an example scraper for https://www.etsy.com/ which is a crowd source e-commerce website. This example will cover popular web scraping idioms like dealing with csrf tokens and session cookies.

We'll write a scraper that scrapes the newest products appearing in the vintage product category:

  1. We'll go to https://www.etsy.com/ and change our currency/region to USD/US.
  2. Then we'll go to product directory and find most recent product urls.
  3. For each of those urls we'll scrape product name, price and other details.

In this example we'll be using async await asynchronous programming paradigm. You can read more about it on MDNs official async/await introduction

Let's start off by establishing connection with etsy.com and setting our preferred currency/region:

import cheerio from 'cheerio'
import axios from 'axios';
import { wrapper } from 'axios-cookiejar-support';
import { CookieJar } from 'tough-cookie';

const jar = new CookieJar();
const session = wrapper(
    axios.create({ 
        jar: jar,
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    })
);


async function setLocale(currency, region){
    let _prewalk = await session.get('https://www.etsy.com/');
    let tree = cheerio.load(_prewalk.data);
    let csrf = tree('meta[name=csrf_nonce]').attr('content');
    try{
        let resp = await session.post(
            'https://www.etsy.com/api/v3/ajax/member/locale-preferences',
            {currency:currency, language:"en-US", region: region}, 
            {headers: {'x-csrf-token': csrf}},
        );
    }catch (error){
        console.log(error);
    }
}

await setLocale('USD', 'US');

Here, we are creating an axios instance with cookie tracking support. Then we are connecting to Etsy's homepage and looking for csrf token which allows us to interact with Etsy's backend API. Finally, we're sending preference request to this API which returns some tracking cookies that our cookiejar saves automatically for us.

CSRF token is a special security token used in modern web. It essentially tells the webserver that are continuing our communication and not just randomly popping in somewhere in the middle. In etsy example we started communication by requesting homepage, there we found a token which lets us continue our session. For more on CSRF tokens we recommend this stackoverflow thread

From here, every request we make using our axios instance will include these preference cookies - meaning all of our scrape content will be in USD currency.

With site preferences sorted we can continue with our next step - collect newest product urls in the /vintage/ category:

async function findProducts(category){
    let resp = await session.get(
        `https://www.etsy.com/c/${category}?explicit=1&category_landing_page=1&order=date_desc`
    );
    let tree = cheerio.load(resp.data);
    return tree('a.listing-link').map(
        (i, node) => tree(node).attr('href')
    ).toArray();
}

console.log(await findProducts('vintage'));

Here, we defined our function which given a category name will return urls from first page. Notice, we've added order=date_desc to sort results in descending order by date to pick up only the latest products.

We're left with implementing product scraping itself:

async function scrapeProduct(url){
    let resp = await session.get(url);
    let tree = cheerio.load(resp.data);
    return {
        url: url,
        title: tree('h1').text().trim(),
        description: tree('p[data-product-details-description-text-content]').text().trim(),
        price: tree('div[data-buy-box-region=price] p[class*=title]').text().trim(),
        store_url: tree('a[aria-label*="more products from store"]').attr('href').split('?')[0],
        images: tree('div[data-component=listing-page-image-carousel] img').map(
            (i, node) => tree(node).attr('data-src')
        ).toArray()
    };
}

Similarly to earlier all we're doing in this function is retrieving HTML of the product page and extract product details from the HTML content.

Finally, it's time to put everything together as a runner function:

async function scrapeVintage(){
    await setLocale('USD', 'US');
    let productUrls = await findProducts('vintage');
    return Promise.all(productUrls.map(
        (url) => scrapeProduct(url)
    ))
}

console.log(await scrapeVintage());

Here we're combining all of our defined functions into one scraping task which should produce results like:

[
  {
    url: 'https://www.etsy.com/listing/688372741/96x125-turkish-oushak-area-rug-vintage?click_key=467d607c570b0d7760a78a00c820a1da4d1e4d0d%3A688372741&click_sum=5f5c2ff9&ga_order=date_desc&ga_search_type=vintage&ga_view_type=gallery&ga_search_query=&ref=sc_gallery-1-1&frs=1&cns=1&sts=1',
    title: '9.6x12.5 Turkish Oushak Area Rug, Vintage Wool Rug, Faded Orange Handmade Home Décor, Distressed Blush Beige, Floral Bordered Oriental Rugs',
    description: '★ This special rug <...>',
    price: '$2,950.00',
    store_url: 'https://www.etsy.com/shop/SuffeArt',
    images: [
      'https://i.etsystatic.com/18572096/r/il/7480a4/3657436348/il_794xN.3657436348_oxay.jpg',
      'https://i.etsystatic.com/18572096/r/il/afa2b7/3705052531/il_794xN.3705052531_9xsa.jpg',
      'https://i.etsystatic.com/18572096/r/il/dbde4f/3657436290/il_794xN.3657436290_a64r.jpg',
      'https://i.etsystatic.com/18572096/r/il/b2002d/3705052595/il_794xN.3705052595_4c7m.jpg',
      'https://i.etsystatic.com/18572096/r/il/6ad90d/3705052613/il_794xN.3705052613_kzey.jpg',
      'https://i.etsystatic.com/18572096/r/il/ccec83/3705052663/il_794xN.3705052663_1472.jpg',
      'https://i.etsystatic.com/18572096/r/il/8be8c9/3657436390/il_794xN.3657436390_5su0.jpg',
      'https://i.etsystatic.com/18572096/r/il/c4f65e/3705052709/il_794xN.3705052709_4u9r.jpg',
      'https://i.etsystatic.com/18572096/r/il/806141/3705052585/il_794xN.3705052585_fn8p.jpg'
    ]
  },
  ...
]

In this example project we've learned to scrape modern e-commerce websites. We configured currency preferences, learned how to deal with csrf tokens and finally, how to scrape and parse product information!

There are many more challenges in web-scraping, so before we wrap this tutorial up let's take a look at some of them.

Solving Challenges with ScrapFly

By far the two biggest web scraping challenges are scaling and avoiding being blocked. To scrape fast our scrapers need to use proxies to avoid various rate limits imposed by the website. Further, websites can block us anytime as they discover that we are a robot and not a real user.

These two subjects are heavily related and way out of scope of this tutorial.
However, to put it shortly we need to figure out how to blend in so we appear as a general website user. This is often done by using many proxies to "appear as many website visitors" rather than one that looks at thousands of different pages or using web browser automation to comply with the full javascript logic of the website etc.


ScrapFly's middleware service can let you focus on web scraping while we deal with ensuring the scraper remains undetected and efficient.

scrapfly middleware

ScrapFly service does the heavy lifting for you

For example, the most popular feature is javascript rendering which essentially is Puppeteer in the cloud! This allows us to use simple NodeJS HTTP requests instead of starting up a whole web browser locally. This, not only saves us resources but also provides better results as ScrapFly's automated web browsers are specially designed to be web scraping efficient.

const axios = require('axios');
const scrapflyKey = 'YOUR SCRAPFLY API KEY'
function scrapflyRequest(url, waitForSelector){
  var options = {
    'key': scrapflyKey,
    'render_js': true,
    'url': url,
    'wait_for_selector': waitForSelector
  };
  return axios.get(
      'https://api.scrapfly.io/scrape',
      {params: options}
  );
}
async function run(){
    let response = await scrapflyRequest('http://etsy.com', 'h1');
    console.log(response);
}
run();

Here, we are using ScrapFly's middleware to use a fortified cloud web browser to render a request url. We can even tell it when to stop waiting same way we do it in Puppeteer!

Often we might not even need full javascript rendering, but we still might be blocked by some anti bot protection. For this ScrapFly provides Anti Scraping Protection solution which solves various scraping blockers like captchas and popular anti-scraping integrations:

function scrapflyRequest(url, useASP){
  var options = {
    'key': scrapflyKey,
    'url': url,
    'asp': useASP
    // ^^^ we can turn on ASP with this url option
  };
  return axios.get(
      'https://api.scrapfly.io/scrape',
      {params: options}
  );
}

In the example above, we are using ScrapFly's middleware which enables anti web scraper protection solution via asp argument.

Finally, ScrapFly provides automated smart proxy system which applies proxies to every request automatically! We can even choose proxy type and country of origin:

function scrapflyRequest(url){
  var options = {
    'key': scrapflyKey,
    'url': url,
    'country': 'ca'
    //         ^^^ use proxies from Canada
    'proxy_pool': 'public_residential_pool',
    //             ^^ use residential proxies which are much harder to detect
  };
  return axios.get(
      'https://api.scrapfly.io/scrape',
      {params: options}
  );
}

Here, we make our requests to use residential proxies from Canada which would allow us to access geo-locked Canadian websites and by using residential proxies we are much more difficult to block!

ScrapFly's feature set doesn't end here though! For full feature set see our full documentation

FAQ

To wrap up this tutorial let's take a look at frequently asked questions:

What's the difference between nodejs and puppeteer in web scraping?

Puppeteer is a popular browser automation library for Nodejs. It is frequently used for web scraping. However, we don't always need a web browser to web scrape. In this article, we've learned how can we use Nodejs with a simple HTTP client to scrape web pages. Browsers are very complicated and expensive to run and upkeep so HTTP client web scrapers are much faster and cheaper.

How to scrape concurrently in NodeJS?

Since NodeJS javascript code is naturally asynchronous we can perform concurrent requests to scrape multiple pages by wrapping a list of scrape promises in Promise.all or Promise.allSettled functions. These async await functions take a list of promise objects and executes them in parallel which can speed up web scraping process hundreds of times:

urls = [...]
async function scrape(url){
    ...
};
let scrape_promises = urls.map((url) => scrape(url));
await Promise.all(scrape_promises);
How to use proxy in NodeJS?

When scraping at scale we might need to use proxies to prevent blocking. Most NodeJS http client libraries implement proxy support through simple arguments. For example in axios library we can set proxy using sessions:

const session = axios.create({
    proxy: {
            host: 'http://111.22.33.44',  //proxy ip address with protocol
            port: 80,  // proxy port
            auth: {username: 'proxy-auth-username', password: 'proxy-auth-password'}  // proxy auth if needed
        }
    }
)
How to click on buttons and submit forms in NodeJS

Often when web scraping we need to replicate button clicking of form submission to get to the data we need. Since our NodeJS code is not fully browser compliant we cannot automatically click buttons or submit forms. Instead, we need to take a look how the web page functions in web browser's developer tools (F12 key in major browsers) and replicate that functionality in our web scraper code.

Summary

In this extensive introduction article we've introduced ourselves with NodeJS web scraping ecosystem. We looked into using axios as our HTTP client to collect multiple pages and using cheerio/@xmldom/xmldom to parse information from this data using CSS/XPATH selectors.

Finally, we wrapped everything up with an example web scraper project which scrapes vintage product information from https://www.etsy.com/ and looked into ScrapFly's middleware solution which takes care of difficult web scraping challenges such as scaling and blocking!

Related post

Web Scraping With Node-Unblocker

Tutorial on using Node-Unblocker - a nodejs library - to avoid blocking while web scraping and using it to optimize web scraping stacks.

Web Scraping With a Headless Browser: Puppeteer

Introduction to using Puppeteer in Nodejs for web scraping dynamic web pages and web apps. Tips and tricks, best practices and example project.