Guide to Axios Headers
Learn about Javascript's Axios headers. How to configure, update, inspect headers in request and responses, how to set defaults and useful tips
In this web scraping tutorial, we'll take a look at NodeJS' Node-Unblocker - a hackable proxy tool written in NodeJS which can be used to avoid web scraping blocking.
Additionally, nodeunblocker can act as a convenient request/response middleware service that can modify outgoing requests and incoming responses in our web scraper stack.
In this article, we'll take a look at node-unblocker setup, usage in web scraping and some general tips and tricks so let's dive in!
Node-Unblocker is a nodeJS proxy server library that is known for being an easy way to get a custom proxy up and running quickly. It provides a native NodeJS API which means it easily integrates with NodeJS-based web scrapers.
Initially, nodeunblocker was created for evading internet censorship and accessing geo-restricted content though it is a general proxy server that can be used for many proxy-related tasks like web scraping.
The key difference between node-unblocker and a classic HTTP/SOCKS5 proxy is that we can use it as a REST API:
$ curl https://localhost:8080/proxy/htttps://www.httpbin.org/ip
Making proxies approachable in any web scraping environment.
Proxies in web scraping are vital for avoiding blocking and geographical restrictions making node-unblocker a very useful addition to NodeJS-based web scraping stacks.
Node-Unblocker also offers some advance features like request and response middlewares that allow to modify outgoing request and incoming responses. We'll take a look at this feature more later in the article.
To setup node-unblocker proxy server we need to combine it with an API server such as Express. We can achieve this via npm
command:
$ npm install node-unblocker express
Then we can create our server file app.js
:
"use strict";
const express = require("express");
const Unblocker = require("unblocker");
const app = express();
const unblocker = Unblocker({
// config options here...
});
app.use(unblocker);
// We can apply custom proxy rules:
app.get("/", (req, res) =>
res.redirect("/proxy/https://en.wikipedia.org/wiki/Main_Page")
);
// start the server and allow unblocker to proxy websockets:
const port = process.env.PORT || 8080;
app.listen(port).on("upgrade", unblocker.onUpgrade);
console.log(`unblocker app live at http://localhost:${port}/`);
Now, we can proxy any url through this server:
# run the server
$ node app.js
unblocker app live at http://localhost:8080/
# in other terminal window or browser we can test it:
$ curl https://localhost:8080/proxy/https://www.httpbin.org/ip
Using node-unblocker we can create our own proxy pool which we can use to avoid web scraping blocking or geographical restrictions.
For example, by deploying node-unblocker on a US based server we can use it's proxy to access websites restricted to US region:
let USAonlyUrl = "https://www.example.com"
fetch(`https://localhost:8080/proxy/${USAonlyUrl}`)
We can also deploy several node unblocker servers and implement rotating proxy logic in our web-scraper to distribute our web scraping connections through several IP addresses:
let proxyPool = [
"https://111.222.22.33:8080",
"https://111.222.22.34:8080",
"https://111.222.22.35:8080",
]
let proxy = proxyPool[Math.floor(Math.random()*proxyPool.length)];
url = "https://www.example.com"
fetch(`${proxy}/proxy/${USAonlyUrl}`)
Another more advanced use of a such proxy is to inject additional logic through request and response middlewares so let's take a look at that.
One of the most interesting features of node unblocker proxy is request and response middlewares which allow modifying of outgoing requests and incoming responses.
When web scraping, we can use custom middlewares as an abstraction layer. For example, we can automatically apply authentication headers to all outgoing requests:
// app.js
function attachAuth(data) {
if (data.url.match(/^https?:\/\/instagram.com\//)) {
data.headers["x-instagram-token"] = "123";
}
}
var config = {
requestMiddleware: [
attachAuth
]
}
Now, the proxy server will attach x-instagram-token
header to every outgoing request. Common use of this idiom in web scraping is to abstract connection details to proxy server. In other words, if we have 5 web scrapers scraping the same target we only need to have token resolution logic in our proxy instead of all 5 scrapers.
Same with response middlewares - for example, we can automatically drop unwanted cookies from ever reaching our web scrapers:
// app.js
function dropCookies(data) {
if (data.url.match(/^https?:\/\/instagram.com\//)) {
var cookies = setCookie.parse(data, { decodeValues: false });
if (cookies.length) {
debug("filtering set-cookie headers");
data.headers["set-cookie"] = cookies.filter(function (cookie) {
if (cookie.name.includes("bad_cookie"){
return false;
}
return true;
});
}
}
}
var config = {
responseMiddleware: [
dropCookies
]
}
Using node-unblocker middlewares we can easily distribute scraping logic across many scrapers making it a great web scraping scaling tool.
For more uses, see node-unblocker official example directory
Since we're using node-unblocker together with Express web server framework we can easily deploy it with docker. Using this Dockerfile
we can build a docker image and deploy it to any docker based hosting provider:
FROM node:16
# Create app directory
WORKDIR /usr/src/app
# Install app dependencies
# A wildcard is used to ensure both package.json AND package-lock.json are copied
# where available (npm@5+)
COPY package*.json ./
RUN npm install
# If you are building your code for production
# RUN npm ci --only=production
# Bundle app source
COPY . .
EXPOSE 8080
CMD [ "node", "app.js" ]
Node-unblocker is an interesting project, and it can work well for small web scrapers, however it does have limitations as it doesn't work with many complex websites such as Instagram, Youtube, Google etc.
To add, using node-blocker we can only take advantage of very limited proxy pool where in web scraping we often need hundreds of proxies.
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:
Using ScrapFly we can access any websites without being blocked or throttled. We can access ScrapFly in Nodejs via any http client library:
const axios = require('axios');
const scrapflyKey = 'YOUR SCRAPFLY API KEY'
function scrapflyRequest(url){
var options = {
'key': scrapflyKey,
'url': url,
// optional options:
'render_js': true, // whether to enabled javascript rendering
'asp': true, // whether to enabled Anti Scraping Protection Bypass
'country': 'US', // use proxies based in United States
};
return axios.get(
'https://api.scrapfly.io/scrape',
{params: options}
);
}
async function run(){
let response = await scrapflyRequest('http://instagram.com/');
console.log(response);
}
run();
ScrapFly API is a powerful and easy utility that not only provides proxies for our web scrapers but additional utility functionality! For more, see our full docs on using ScrapFly with NodeJS