Web Scraping With Node-Unblocker

Web Scraping With Node-Unblocker

In this web scraping tutorial, we'll take a look at NodeJS' Node-Unblocker - a hackable proxy tool written in NodeJS which can be used to avoid web scraping blocking.

Additionally, nodeunblocker can act as a convenient request/response middleware service that can modify outgoing requests and incoming responses in our web scraper stack.

In this article, we'll take a look at node-unblocker setup, usage in web scraping and some general tips and tricks so let's dive in!

Web Scraping With NodeJS and Javascript

For more on web scraping with NodeJS see our full introduction tutorial which covers everything you need to know to start scraping with javascript/nodejs

Web Scraping With NodeJS and Javascript

What is Node-Unblocker?

Node-Unblocker is a nodeJS proxy server library that is known for being an easy way to get a custom proxy up and running quickly. It provides a native NodeJS API which means it easily integrates with NodeJS-based web scrapers.

Initially, nodeunblocker was created for evading internet censorship and accessing geo-restricted content though it is a general proxy server that can be used for many proxy-related tasks like web scraping.

The key difference between node-unblocker and a classic HTTP/SOCKS5 proxy is that we can use it as a REST API:

$ curl https://localhost:8080/proxy/htttps://www.httpbin.org/ip

Making proxies approachable in any web scraping environment.

Proxies in web scraping are vital for avoiding blocking and geographical restrictions making node-unblocker a very useful addition to NodeJS-based web scraping stacks.

Introduction To Proxies in Web Scraping

For more onn proxies in web scraping see our full introduction tutorial which covers proxy types and common usage

Introduction To Proxies in Web Scraping

Node-Unblocker also offers some advance features like request and response middlewares that allow to modify outgoing request and incoming responses. We'll take a look at this feature more later in the article.

Setup

To setup node-unblocker proxy server we need to combine it with an API server such as Express. We can achieve this via npm command:

$ npm install node-unblocker express

Then we can create our server file app.js:

"use strict";

const express = require("express");
const Unblocker = require("unblocker");
const app = express();
const unblocker = Unblocker({
  // config options here...
});

app.use(unblocker);

// We can apply custom proxy rules:
app.get("/", (req, res) =>
  res.redirect("/proxy/https://en.wikipedia.org/wiki/Main_Page")
);

// start the server and allow unblocker to proxy websockets:
const port = process.env.PORT || 8080;
app.listen(port).on("upgrade", unblocker.onUpgrade);

console.log(`unblocker app live at http://localhost:${port}/`);

Now, we can proxy any url through this server:

# run the server
$ node app.js
unblocker app live at http://localhost:8080/
# in other terminal window or browser we can test it:
$ curl https://localhost:8080/proxy/https://www.httpbin.org/ip

Node Unblocker in Web Scraping

Using node-unblocker we can create our own proxy pool which we can use to avoid web scraping blocking or geographical restrictions.

For example, by deploying node-unblocker on a US based server we can use it's proxy to access websites restricted to US region:

let USAonlyUrl = "https://www.example.com"
fetch(`https://localhost:8080/proxy/${USAonlyUrl}`)

We can also deploy several node unblocker servers and implement rotating proxy logic in our web-scraper to distribute our web scraping connections through several IP addresses:

let proxyPool = [
    "https://111.222.22.33:8080",
    "https://111.222.22.34:8080",
    "https://111.222.22.35:8080",
]
let proxy = proxyPool[Math.floor(Math.random()*proxyPool.length)];
url = "https://www.example.com"
fetch(`${proxy}/proxy/${USAonlyUrl}`)

Another more advanced use of a such proxy is to inject additional logic through request and response middlewares so let's take a look at that.

Using Middlewares

One of the most interesting features of node unblocker proxy is request and response middlewares which allow modifying of outgoing requests and incoming responses.

illustration of node unblocker's middlewares
Node Unblocker middlewares sit between scraper at target website - modifying transferred data

When web scraping, we can use custom middlewares as an abstraction layer. For example, we can automatically apply authentication headers to all outgoing requests:

// app.js
function attachAuth(data) {
    if (data.url.match(/^https?:\/\/instagram.com\//)) {
        data.headers["x-instagram-token"] = "123";
    }
}
var config = {
    requestMiddleware: [
        attachAuth
    ]
}

Now, the proxy server will attach x-instagram-token header to every outgoing request. Common use of this idiom in web scraping is to abstract connection details to proxy server. In other words, if we have 5 web scrapers scraping the same target we only need to have token resolution logic in our proxy instead of all 5 scrapers.

Same with response middlewares - for example, we can automatically drop unwanted cookies from ever reaching our web scrapers:

// app.js
function dropCookies(data) {
    if (data.url.match(/^https?:\/\/instagram.com\//)) {
        var cookies = setCookie.parse(data, { decodeValues: false });
        if (cookies.length) {
          debug("filtering set-cookie headers");
          data.headers["set-cookie"] = cookies.filter(function (cookie) {
            if (cookie.name.includes("bad_cookie"){
                return false;
            }
            return true;
          });
        }
    }
}
var config = {
    responseMiddleware: [
        dropCookies
    ]
}

Using node-unblocker middlewares we can easily distribute scraping logic across many scrapers making it a great web scraping scaling tool.

For more uses, see node-unblocker official example directory

Deploying With Docker

Since we're using node-unblocker together with Express web server framework we can easily deploy it with docker. Using this Dockerfile we can build a docker image and deploy it to any docker based hosting provider:

FROM node:16

# Create app directory
WORKDIR /usr/src/app

# Install app dependencies
# A wildcard is used to ensure both package.json AND package-lock.json are copied
# where available (npm@5+)
COPY package*.json ./

RUN npm install
# If you are building your code for production
# RUN npm ci --only=production

# Bundle app source
COPY . .

EXPOSE 8080
CMD [ "node", "app.js" ]

ScrapFly - Node Unblocker and Much More!

Node-unblocker is an interesting project, and it can work well for small web scrapers, however it does have limitations as it doesn't work with many complex websites such as Instagram, Youtube, Google etc.
To add, using node-blocker we can only take advantage of very limited proxy pool where in web scraping we often need hundreds of proxies.

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:

  • Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
  • Millions of self-healing proxies of the highest possible trust score.
  • Constantly evolving and adapting to new anti-bot systems.
  • We've been doing this publicly since 2020 with the best bypass on the market!

Using ScrapFly we can access any websites without being blocked or throttled. We can access ScrapFly in Nodejs via any http client library:

const axios = require('axios');
const scrapflyKey = 'YOUR SCRAPFLY API KEY'

function scrapflyRequest(url){
  var options = {
    'key': scrapflyKey,
    'url': url,
    // optional options:
    'render_js': true,  // whether to enabled javascript rendering
    'asp': true,  // whether to enabled Anti Scraping Protection Bypass
    'country': 'US',  // use proxies based in United States
  };
  return axios.get(
      'https://api.scrapfly.io/scrape',
      {params: options}
  );
}
async function run(){
    let response = await scrapflyRequest('http://instagram.com/');
    console.log(response);
}
run();

ScrapFly API is a powerful and easy utility that not only provides proxies for our web scrapers but additional utility functionality! For more, see our full docs on using ScrapFly with NodeJS

Related Posts

Axios vs Fetch: Which HTTP Client to Choose in JS?

Explore the differences between Fetch and Axios - two essential HTTP clients in JavaScript - and discover which is best suited for your project.

Concurrency vs Parallelism

Learn the key differences between Concurrency and Parallelism and how to leverage them in Python and JavaScript to optimize performance in various computational tasks.

How to Scrape Forms

Learn how to scrape forms through a step-by-step guide using HTTP clients and headless browsers.