Internet Protocol (IP) address is the most common way of identifying web scrapers. IP is at the core of every internet exchange and tracking and analysis of it can tell a lot about the connecting client.
In web scraping, IP tracking and analysis (aka fingerprint) is often used to throttle and block web scrapers or other undesired visitors. In this article, we'll take a look at what are Internet Protocol addresses and how IP tracking technologies are used to block web scrapers.
Internet Protocol address is a simple number based address that identifies connection origin - it's the backbone of all internet connection. If you're at home - your IP is provided to you by the internet service provider, however there's much more to it!
There are two versions of this IP addresses: IPv4 and IPv6. The key difference is that IPv4 pool is limited to few billion, which might sound like a lot, but we're almost out of those. IPv6 has significantly more available addresses but still lacks adoption.
Since most of the web still functions over IPv4 and amount of these addresses is limited which means these addresses are essentially a commodity, thus IPv4 performs much better when it comes to fingerprinting simply because it costs more to obtain. We'll stick with IPv4 addresses as these are much more available in web scraping.
So let's take a look at IPv4 address, particularly in the context of identification and tracking.
IPv4 addresses are made up from 4 parts:
The first two parts are network addresses which are distributed randomly to IP holders (like ISPs) so there's very little valuable information we can extract from those. The last two numbers is what really matters when it comes to IP fingerprinting.
The 3rd number is called sub-network address, and it's essentially an identifier for a group of 254 addresses. In real world subnets often identify a geographical region - you and your neighbors are most likely sharing the same subnet address provided by your ISP each of you having individual host address - the last number of the address.
IP address itself provides very little information about identity of its owner, so IP meta information databases are used to provide more context about connecting clients. These databases collect information from public data points (Like WHOIS, ARIN and RIPE) and often contain loads of meta information like:
We can easily query WHOIS database for raw metadata using their online lookup page: https://www.whois.com/whois/ (or terminal tools like
# Example query for 184.108.40.206 - free proxy IP NetRange: 220.127.116.11 - 18.104.22.168 CIDR: 22.214.171.124/24, 126.96.36.199/19 NetName: B2NETSOLUTIONS NetHandle: NET-209-127-160-0-1 Parent: NET209 (NET-209-0-0-0-0) NetType: Direct Allocation OriginAS: Organization: B2 Net Solutions Inc. (BNS-34) RegDate: 2018-01-12 Updated: 2022-02-09 Ref: https://rdap.arin.net/registry/ip/188.8.131.52 OrgName: B2 Net Solutions Inc. OrgId: BNS-34 Address: 205-1040 South Service Road City: Stoney Creek StateProv: ON PostalCode: L8E 6G3 Country: CA RegDate: 2011-10-24 Updated: 2021-09-16 Comment: https://servermania.com Ref: https://rdap.arin.net/registry/entity/BNS-34 ...
We can extract a lot of info about this connection from this metadata alone. For example, we can see the owner is some organization (residential IPs would have "Person" keyword instead). From the registered name and domain we can see that it seems to be some server hosting company. Using this information we can guesstimate that connecting client might be a robot.
Whois database offers raw data which is difficult to follow and parse. For this we recommend taking a look at IP database aggregators like ipleak.com which distill this information to few important values.
When web scraping we want to avoid IPs with metadata that might indicate non-human connections (like IPs that are owned by a datacenter). Instead, we should aim for residential or mobile IPs which makes connection appear much more human.
Anti web scraping services use these two IP details - the address and the metadata - to generate initial connection trust score for every client which is used to determine whether the client is desirable or not.
For example, if you're connecting from your clean home network the service might start you off at a score of 1 (trustworthy) and let you through effortlessly without requesting a captcha to be solved.
On the other hand, if you're connecting from a busy public wifi the score will be a bit lower (e.g. 0.5), which might prompt a small captcha challenge every once in a while.
Worst case scenario, if you connect from a busy, shared datacenter IP you'd get a really low score which can result in multiple captcha challenges or even a complete block.
So, which IP data points influence this score the most?
First, it's the address itself. All tracking services keep a database of IP connection data, e.g. IP X connected N times in the past day and so on. The important thing to note here is that this data has a vast relationship network. So, one IP address' score can be affected by it's neighbors and relatives.
Prime example of this is the fact that IPs are not sold one by one but in blocks. Meaning one bad apple often spoils the bunch. IP addresses are usually sold by
/24 blocks which means 256 addresses or in other words 1 subnet (the 3rd IPv4 number). So, if we see multiple unusual connections from addresses like
184.108.40.206, 220.127.116.11, 18.104.22.168 we can guesstimate that the whole
1.1.1.X block is owned by a single identity. This often results in whole subnet being either blocked or having it's trust score reduced.
We can expand the same block ownership idea even further by taking a look at the IP address metadata.
The most common data point for this is the Autonomous System Number (ASN) which is a number assigned to every registered IP owner. So few bad apples of one specific ASN can lower connection score for all of the IPs under the same ASN.
There are various online databases that allow you to inspect ASN numbers for IP numbers assigned to them like bgpview.io
Another metadata point that is commonly used in calculating trust scores is IP type itself. While the metadata doesn't explicitly say whether the address is residential, mobile or datacenter the fact can be inferred from the ownership details.
So, a datacenter IP would have a lower score just because it's very likely to be a robot, whereas mobile and residential IPs would treated much more fairly.
We learned a lot about IP fingerprinting in web scraping. So how do we apply this information in web scraping?
To avoid web scraper blocking we want use IPs with a high trust score. So we should avoid IP addresses with weak metadata datapoints - anything that would indicate datacenter or untrustworthy owners.
When scraping at scale we want to diversify our connections by using a proxy pool of high trust score IP addresses. Diversity is key here as even high trust score addresses can lose their potency in a period of time of high connectivity.
To put it shortly: to get around web scraper blocking we want a diverse pool of residential or mobile proxies. With lots of different subnets, geographical locations and AS numbers.
To make things easy, ScrapFly's API offers a smart proxy system which intelligently selects IP from a massive 190+M IP pool for every individual request for you!