How to Avoid Web Scraping Blocking: IP Address Guide

article feature image

Internet Protocol (IP) address is the most common way of identifying web scrapers. IP is at the core of every internet exchange and tracking and analysis of it can tell a lot about the connecting client.

In web scraping, IP tracking and analysis (aka fingerprint) is often used to throttle and block web scrapers or other undesired visitors. In this article, we'll take a look at what are Internet Protocol addresses and how IP tracking technologies are used to block web scrapers.

How to Scrape Without Getting Blocked Tutorial

For more on avoiding web scraping blocking see our full introduction article which covers request headers, TLS handshakes and javascript fingerprinting

How to Scrape Without Getting Blocked Tutorial

IP Address Details

Internet Protocol address is a simple number based address that identifies connection origin - it's the backbone of all internet connection. If you're at home - your IP is provided to you by the internet service provider, however there's much more to it!

Versions

There are two versions of this IP addresses: IPv4 and IPv6. The key difference is that IPv4 pool is limited to few billion, which might sound like a lot, but we're almost out of those. IPv6 has significantly more available addresses but still lacks adoption.

illustration of ipv4 versus ipv6 internet protocols

Since most of the web still functions over IPv4 and amount of these addresses is limited which means these addresses are essentially a commodity, thus IPv4 performs much better when it comes to fingerprinting simply because it costs more to obtain. We'll stick with IPv4 addresses as these are much more available in web scraping.

Structure

So let's take a look at IPv4 address, particularly in the context of identification and tracking.
IPv4 addresses are made up from 4 parts:

illustration of IPv4 structural parts

The first two parts are network addresses which are distributed randomly to IP holders (like ISPs) so there's very little valuable information we can extract from those. The last two numbers is what really matters when it comes to IP fingerprinting.
The 3rd number is called sub-network address, and it's essentially an identifier for a group of 254 addresses. In real world subnets often identify a geographical region - you and your neighbors are most likely sharing the same subnet address provided by your ISP each of you having individual host address - the last number of the address.

Metadata

IP address itself provides very little information about identity of its owner, so IP meta information databases are used to provide more context about connecting clients. These databases collect information from public data points (Like WHOIS, ARIN and RIPE) and often contain loads of meta information like:

  • ISP's metadata like name, legal details and AS Number
  • IP address geographical location
  • Connection Type
  • Origin estimates: is it a Proxy IP, VPN or something else?

We can easily query WHOIS database for raw metadata using their online lookup page: https://www.whois.com/whois/ (or terminal tools like whois)

# Example query for 209.127.191.180 - free proxy IP
NetRange:       209.127.160.0 - 209.127.192.255
CIDR:           209.127.192.0/24, 209.127.160.0/19
NetName:        B2NETSOLUTIONS
NetHandle:      NET-209-127-160-0-1
Parent:         NET209 (NET-209-0-0-0-0)
NetType:        Direct Allocation
OriginAS:       
Organization:   B2 Net Solutions Inc. (BNS-34)
RegDate:        2018-01-12
Updated:        2022-02-09
Ref:            https://rdap.arin.net/registry/ip/209.127.160.0


OrgName:        B2 Net Solutions Inc.
OrgId:          BNS-34
Address:        205-1040 South Service Road
City:           Stoney Creek
StateProv:      ON
PostalCode:     L8E 6G3
Country:        CA
RegDate:        2011-10-24
Updated:        2021-09-16
Comment:        https://servermania.com
Ref:            https://rdap.arin.net/registry/entity/BNS-34
...

We can extract a lot of info about this connection from this metadata alone. For example, we can see the owner is some organization (residential IPs would have "Person" keyword instead). From the registered name and domain we can see that it seems to be some server hosting company. Using this information we can guesstimate that connecting client might be a robot.

Whois database offers raw data which is difficult to follow and parse. For this we recommend taking a look at IP database aggregators like ipleak.com which distill this information to few important values.

When web scraping we want to avoid IPs with metadata that might indicate non-human connections (like IPs that are owned by a datacenter). Instead, we should aim for residential or mobile IPs which makes connection appear much more human.

Best Proxy Providers for Web Scraping

For more on quality of proxies, see our best proxy provider article which compares pricing, quality and accessibility of top services in the industry!

Best Proxy Providers for Web Scraping

How IPs Are Being Tracked?

Anti web scraping services use these two IP details - the address and the metadata - to generate initial connection trust score for every client which is used to determine whether the client is desirable or not.

For example, if you're connecting from your clean home network the service might start you off at a score of 1 (trustworthy) and let you through effortlessly without requesting a captcha to be solved.
On the other hand, if you're connecting from a busy public wifi the score will be a bit lower (e.g. 0.5), which might prompt a small captcha challenge every once in a while.
Worst case scenario, if you connect from a busy, shared datacenter IP you'd get a really low score which can result in multiple captcha challenges or even a complete block.

So, which IP data points influence this score the most?

IP fingerprint illustration

First, it's the address itself. All tracking services keep a database of IP connection data, e.g. IP X connected N times in the past day and so on. The important thing to note here is that this data has a vast relationship network. So, one IP address' score can be affected by it's neighbors and relatives.

Prime example of this is the fact that IPs are not sold one by one but in blocks. Meaning one bad apple often spoils the bunch. IP addresses are usually sold by /24 blocks which means 256 addresses or in other words 1 subnet (the 3rd IPv4 number). So, if we see multiple unusual connections from addresses like 1.1.1.2, 1.1.1.43, 1.1.1.15 we can guesstimate that the whole 1.1.1.X block is owned by a single identity. This often results in whole subnet being either blocked or having it's trust score reduced.

We can expand the same block ownership idea even further by taking a look at the IP address metadata.

The most common data point for this is the Autonomous System Number (ASN) which is a number assigned to every registered IP owner. So few bad apples of one specific ASN can lower connection score for all of the IPs under the same ASN.

There are various online databases that allow you to inspect ASN numbers for IP numbers assigned to them like bgpview.io

.

Another metadata point that is commonly used in calculating trust scores is IP type itself. While the metadata doesn't explicitly say whether the address is residential, mobile or datacenter the fact can be inferred from the ownership details.
So, a datacenter IP would have a lower score just because it's very likely to be a robot, whereas mobile and residential IPs would treated much more fairly.

IPs in Web Scraping

We learned a lot about IP fingerprinting in web scraping. So how do we apply this information in web scraping?

To avoid web scraper blocking we want use IPs with a high trust score. So we should avoid IP addresses with weak metadata datapoints - anything that would indicate datacenter or untrustworthy owners.

When scraping at scale we want to diversify our connections by using a proxy pool of high trust score IP addresses. Diversity is key here as even high trust score addresses can lose their potency in a period of time of high connectivity.

To put it shortly: to get around web scraper blocking we want a diverse pool of residential or mobile proxies. With lots of different subnets, geographical locations and AS numbers.

Introduction To Proxies in Web Scraping

For more on proxies see our introduction article which goes into greater detail of what makes a good proxy and how to correctly work with proxies when web scraping

Introduction To Proxies in Web Scraping

To make things easy, ScrapFly's API offers a smart proxy system which intelligently selects IP from a massive 190+M IP pool for every individual request for you!

scrapfly middleware

ScrapFly service does the heavy lifting for you!

That's just the tip of the iceberg of web scraping solutions ScrapFly offers like bypass of anti web scraping protection systems and javascript rendering - give it a spin for free!

Related post

How to Avoid Web Scraping Blocking: Headers Guide

Introduction to web scraping headers - what do they mean, how to configure them in web scrapers and how to avoid being blocked.

Web Scraping Graphql with Python

Introduction to web scraping graphql powered websites. How to create graphql queries in python and what are some common challenges.

Web Scraping With Python Tutorial

Introduction tutorial to web scraping with Python. How to collect and parse public data. Challenges, best practices and an example project.

Web Scraping With R

Introduction to web scraping with R language. How to handle http connections, parse html files, best practices, tips and an example project.