How to Avoid Web Scraper IP Blocking?

article feature image

Internet Protocol (IP) address is the most common way of identifying web scrapers. IP is at the core of every internet exchange and tracking and analysis of it can tell a lot about the connecting client.

In web scraping, IP tracking and analysis (aka fingerprint) is often used to throttle and block web scrapers or other undesired visitors. In this article, we'll take a look at what are Internet Protocol addresses and how IP tracking technologies are used to block web scrapers.

How to Scrape Without Getting Blocked? In-Depth Tutorial

For more on avoiding web scraping blocking see our full introduction article which covers request headers, TLS handshakes and javascript fingerprinting

How to Scrape Without Getting Blocked? In-Depth Tutorial

IP Address Details

Internet Protocol address is a simple number-based address that identifies connection origin - it's the backbone of all internet connections. If you're at home - your IP is provided to you by the internet service provider however, there's much more to it!

Versions

There are two versions of these IP addresses: IPv4 and IPv6.
The key difference is that IPv4 pool is limited to a few billion addresses. This might sound like a lot, but we're almost out of these!
On the other hand, IPv6 has significantly more available addresses though lacks real-world adoptiopn.

illustration of ipv4 versus ipv6 internet protocols

Since most of the web still functions over IPv4 and amount of these addresses is limited it means these addresses are essentially a commodity. This is the reason IPv4 performs much better when it comes to fingerprinting simply because it costs more to obtain.
In other words, if a website sees a client connect from an IPv6 address it automatically lowers client trust score because these addresses are much more plentiful.

In this article we'll stick with IPv4 addresses as scraping with IPv6 addresses isn't very possible yet.

Structure

So let's take a look at IPv4 address structure in the context of identification and tracking.
IPv4 addresses are made up of 4 parts:

illustration of IPv4 structural parts

The first two parts are network addresses that are distributed randomly to IP holders (like ISPs) so there's very little valuable information we can extract from those.

The last two numbers are what matter when it comes to IP fingerprinting.
The third number is called sub-network address, and it's essentially an identifier for a group of 254 addresses. In the real world, subnets often identify a geographical region - you and your neighbors are most likely sharing the same subnet address provided by your ISP each of you having an individual host address - the last number of the address.

Metadata

IP address itself provides very little information about the identity of its owner. So, IP meta-information databases are used to provide more context about connecting clients. These databases collect information from public data points (Like WHOIS, ARIN and RIPE) and contain loads of meta information like:

  • ISP's metadata like name, legal details and AS Number
  • IP address geographical location
  • Connection Type
  • Origin estimates: is it a Proxy IP, VPN or something else?

We can easily query the WHOIS database for raw metadata using their online lookup page: https://www.whois.com/whois/ (or terminal tools like whois)

For example, let's take a look at this random free proxy IP address:

# Example query for 209.127.191.180 - free proxy IP
NetRange:       209.127.160.0 - 209.127.192.255
CIDR:           209.127.192.0/24, 209.127.160.0/19
NetName:        B2NETSOLUTIONS
NetHandle:      NET-209-127-160-0-1
Parent:         NET209 (NET-209-0-0-0-0)
NetType:        Direct Allocation
OriginAS:       
Organization:   B2 Net Solutions Inc. (BNS-34)
RegDate:        2018-01-12
Updated:        2022-02-09
Ref:            https://rdap.arin.net/registry/ip/209.127.160.0


OrgName:        B2 Net Solutions Inc.
OrgId:          BNS-34
Address:        205-1040 South Service Road
City:           Stoney Creek
StateProv:      ON
PostalCode:     L8E 6G3
Country:        CA
RegDate:        2011-10-24
Updated:        2021-09-16
Comment:        https://servermania.com
Ref:            https://rdap.arin.net/registry/entity/BNS-34
...

We can see how much metadata information we got from this public IP database. All of these details could be used to determine the likelihood of this IP being used by a real person or a program.
For example, we can see the owner is some organization (residential IPs would have "Person" keyword instead). From the registered name and the domain, we can see that it seems to be some server hosting company.
So, we can see that this is IP address owned by some server hosting company located in California - how likely that this connection is coming from a human user?

Whois database offers raw data which is difficult to follow and parse. For this we recommend taking a look at IP database aggregators like ipleak.com which distill this information to few important values.

When web scraping we want to avoid IPs with metadata that might indicate non-human connections (like IPs that are owned by a datacenter). Instead, we should aim for residential or mobile IPs which make connection appear much more human.

Best Proxy Providers for Web Scraping

For more on the quality of proxies, see our best proxy provider article which compares pricing, quality and accessibility of top services in the industry!

Best Proxy Providers for Web Scraping

How IPs Are Being Tracked?

Anti web scraping services use these two IP details - the address and the metadata - to generate the initial connection trust score for every client which is used to determine whether the client is desirable or not.

For example, if you're connecting from your clean home network the service might start you off at a score of 1 (trustworthy) and let you through effortlessly without requesting a captcha to be solved.
On the other hand, if you're connecting from busy public wifi the score will be a bit lower (e.g. 0.5), which might prompt a small captcha challenge every once in a while.
Worst case scenario, if you connect from a busy, shared datacenter IP you'd get a really low score which can result in multiple captcha challenges or even a complete block.

So, which IP data points influence this score the most?

IP fingerprint illustration

First, it's the address itself. All tracking services keep a database of IP connection data, e.g. IP X connected N times in the past day and so on. The important thing to note here is that this data has a vast relationship network. So, one IP address' score can be affected by it's neighbors and relatives.

Prime example of this is the fact that IPs are not sold one by one but in blocks. Meaning one bad apple often spoils the bunch. IP addresses are usually sold by /24 blocks which means 256 addresses or in other words 1 subnet (the 3rd IPv4 number). So, if we see multiple unusual connections from addresses like 1.1.1.2, 1.1.1.43, 1.1.1.15 we can guesstimate that the whole 1.1.1.X block is owned by a single identity. This often results in whole subnet being either blocked or having it's trust score reduced.

We can expand the same block ownership idea even further by taking a look at the IP address metadata.

The most common data point for this is the Autonomous System Number (ASN) which is a number assigned to every registered IP owner. So few bad apples of one specific ASN can lower connection score for all of the IPs under the same ASN.

There are various online databases that allow you to inspect ASN numbers for IP numbers assigned to them like bgpview.io

.

Another metadata point that is commonly used in calculating trust scores is IP type itself. While the metadata doesn't explicitly say whether the address is residential, mobile or datacenter the fact can be inferred from the ownership details.
So, a datacenter IP would have a lower score just because it's very likely to be a robot, whereas mobile and residential IPs would treated much more fairly.

IP Address Use in Web Scraping

We learned a lot about IP fingerprinting in web scraping. So how do we apply this information in web scraping?

To avoid web scraper blocking we want to use IPs with a high trust scores. In other words, we should avoid IP addresses with weak metadata data points - anything that would indicate a datacenter origin or untrustworthy owners.

Introduction To Proxies in Web Scraping

For more on proxies see our introduction article which goes into greater detail of what makes a good proxy and how to correctly work with proxies when web scraping

Introduction To Proxies in Web Scraping

When scraping at scale, we want to diversify our connections by using a proxy pool of high trust score IP addresses. Diversity is key here as even high trust score addresses can lose their potency in a period of high connectivity.

To put it shortly: to get around web scraper blocking we want a diverse pool of residential or mobile proxies. With lots of different subnets, geographical locations and AS numbers.

How to Rotate Proxies in Web Scraping

For an efficient way to manage a proxy pool see our article about proxy rotation which includes example code.

How to Rotate Proxies in Web Scraping

IP Rotation with ScrapFly

To make things easy, ScrapFly's API offers a smart proxy system which intelligently selects IP from a massive 190+M IP pool for every individual request for you!

scrapfly middleware

ScrapFly service does the heavy lifting for you!

That's just the tip of the iceberg of web scraping solutions ScrapFly offers like bypass of anti web scraping protection systems and javascript rendering - give it a spin for free!

FAQ

To wrap this article up let's take a look at some frequently asked questions about IP address role in web scraper blocking:

What Proxy Type is Best for Web Scraping?

Residential proxies are the best for web scraping. Residential proxies are owned by trust worthy ASN (for example public ISPs) so connections made by these IP addresses are more trustworthy

Which Geographic Locations are Best for Web Scraping?

Same origin as the hosted target. For example, if we're scraping a website in US we should use US-based IP addresses. However, that's not always the case though US and EU IP addresses tend to have higher trust in general.

What Makes a Good Proxy Pool for Web Scraping?

Diversity! As we've covered in this article having a diverse pool of Autonomous System Numbers (ASN) and subnets will result in the best web scraping performance when it comes to blocking.

Summary

To summarize web scrapers can be identified through IP address analysis. This is done either by inspecting IP metadata such as address type (datacenter or residential), ASN and other unique details. So, to avoid being blocked web scrapers should use a pool of diverse, quality proxy IP addresses.

Related Posts

How Headers Are Used to Block Web Scrapers and How to Fix It

Introduction to web scraping headers - what do they mean, how to configure them in web scrapers and how to avoid being blocked.

Web Scraping Graphql with Python

Introduction to web scraping graphql powered websites. How to create graphql queries in python and what are some common challenges.

Hands on Python Web Scraping Tutorial and Example Project

Introduction tutorial to web scraping with Python. How to collect and parse public data. Challenges, best practices and an example project.