Internet Protocol (IP) address is the most common way of identifying web scrapers. IP is at the core of every internet exchange and tracking and analysis of it can tell a lot about the connecting client.
In web scraping, IP tracking and analysis (aka fingerprint) is often used to throttle and block web scrapers or other undesired visitors. In this article, we'll take a look at what are Internet Protocol addresses and how IP tracking technologies are used to block web scrapers.
IP Address Details
Internet Protocol address is a simple number-based address that identifies connection origin - it's the backbone of all internet connections. If you're at home - your IP is provided to you by the internet service provider however, there's much more to it!
There are two versions of these IP addresses: IPv4 and IPv6.
The key difference is that IPv4 pool is limited to a few billion addresses. This might sound like a lot, but we're almost out of these!
On the other hand, IPv6 has significantly more available addresses though lacks real-world adoptiopn.
Since most of the web still functions over IPv4 and amount of these addresses is limited it means these addresses are essentially a commodity. This is the reason IPv4 performs much better when it comes to fingerprinting simply because it costs more to obtain.
In other words, if a website sees a client connect from an IPv6 address it automatically lowers client trust score because these addresses are much more plentiful.
In this article we'll stick with IPv4 addresses as scraping with IPv6 addresses isn't very possible yet.
So let's take a look at IPv4 address structure in the context of identification and tracking.
IPv4 addresses are made up of 4 parts:
The first two parts are network addresses that are distributed randomly to IP holders (like ISPs) so there's very little valuable information we can extract from those.
The last two numbers are what matter when it comes to IP fingerprinting.
The third number is called sub-network address, and it's essentially an identifier for a group of 254 addresses. In the real world, subnets often identify a geographical region - you and your neighbors are most likely sharing the same subnet address provided by your ISP each of you having an individual host address - the last number of the address.
IP address itself provides very little information about the identity of its owner. So, IP meta-information databases are used to provide more context about connecting clients. These databases collect information from public data points (Like WHOIS, ARIN and RIPE) and contain loads of meta information like:
ISP's metadata like name, legal details and AS Number
IP address geographical location
Origin estimates: is it a Proxy IP, VPN or something else?
We can easily query the WHOIS database for raw metadata using their online lookup page: https://www.whois.com/whois/ (or terminal tools like whois)
For example, let's take a look at this random free proxy IP address:
# Example query for 126.96.36.199 - free proxy IP
NetRange: 188.8.131.52 - 184.108.40.206
CIDR: 220.127.116.11/24, 18.104.22.168/19
Parent: NET209 (NET-209-0-0-0-0)
NetType: Direct Allocation
Organization: B2 Net Solutions Inc. (BNS-34)
OrgName: B2 Net Solutions Inc.
Address: 205-1040 South Service Road
City: Stoney Creek
PostalCode: L8E 6G3
We can see how much metadata information we got from this public IP database. All of these details could be used to determine the likelihood of this IP being used by a real person or a program.
For example, we can see the owner is some organization (residential IPs would have "Person" keyword instead). From the registered name and the domain, we can see that it seems to be some server hosting company.
So, we can see that this is IP address owned by some server hosting company located in California - how likely that this connection is coming from a human user?
Whois database offers raw data which is difficult to follow and parse. For this we recommend taking a look at IP database aggregators like https://ipleak.com which distill this information to few important values.
When web scraping we want to avoid IPs with metadata that might indicate non-human connections (like IPs that are owned by a datacenter). Instead, we should aim for residential or mobile IPs which make connection appear much more human.
How IPs Are Being Tracked?
Anti web scraping services use these two IP details - the address and the metadata - to generate the initial connection trust score for every client which is used to determine whether the client is desirable or not.
For example, if you're connecting from your clean home network the service might start you off at a score of 1 (trustworthy) and let you through effortlessly without requesting a captcha to be solved.
On the other hand, if you're connecting from busy public wifi the score will be a bit lower (e.g. 0.5), which might prompt a small captcha challenge every once in a while.
Worst case scenario, if you connect from a busy, shared datacenter IP you'd get a really low score which can result in multiple captcha challenges or even a complete block.
So, which IP data points influence this score the most?
First, it's the address itself. All tracking services keep a database of IP connection data, e.g. IP X connected N times in the past day and so on. The important thing to note here is that this data has a vast relationship network. So, one IP address' score can be affected by it's neighbors and relatives.
Prime example of this is the fact that IPs are not sold one by one but in blocks. Meaning one bad apple often spoils the bunch. IP addresses are usually sold by /24 blocks which means 256 addresses or in other words 1 subnet (the 3rd IPv4 number). So, if we see multiple unusual connections from addresses like 22.214.171.124, 126.96.36.199, 188.8.131.52 we can guesstimate that the whole 1.1.1.X block is owned by a single identity. This often results in whole subnet being either blocked or having it's trust score reduced.
We can expand the same block ownership idea even further by taking a look at the IP address metadata.
The most common data point for this is the Autonomous System Number (ASN) which is a number assigned to every registered IP owner. So few bad apples of one specific ASN can lower connection score for all of the IPs under the same ASN.
There are various online databases that allow you to inspect ASN numbers for IP numbers assigned to them like bgpview.io
Another metadata point that is commonly used in calculating trust scores is IP type itself. While the metadata doesn't explicitly say whether the address is residential, mobile or datacenter the fact can be inferred from the ownership details.
So, a datacenter IP would have a lower score just because it's very likely to be a robot, whereas mobile and residential IPs would treated much more fairly.
IP Address Use in Web Scraping
We learned a lot about IP fingerprinting in web scraping. So how do we apply this information in web scraping?
To avoid web scraper blocking we want to use IPs with a high trust scores. In other words, we should avoid IP addresses with weak metadata data points - anything that would indicate a datacenter origin or untrustworthy owners.
When scraping at scale, we want to diversify our connections by using a proxy pool of high trust score IP addresses. Diversity is key here as even high trust score addresses can lose their potency in a period of high connectivity.
To put it shortly: to get around web scraper blocking we want a diverse pool of residential or mobile proxies. With lots of different subnets, geographical locations and AS numbers.
IP Rotation with ScrapFly
To make things easy, ScrapFly's API offers a smart proxy system which intelligently selects IP from a massive 190+M IP pool for every individual request for you!
To wrap this article up let's take a look at some frequently asked questions about IP address role in web scraper blocking:
What Proxy Type is Best for Web Scraping?
Residential proxies are the best for web scraping. Residential proxies are owned by trust worthy ASN (for example public ISPs) so connections made by these IP addresses are more trustworthy
Which Geographic Locations are Best for Web Scraping?
Same origin as the hosted target. For example, if we're scraping a website in US we should use US-based IP addresses. However, that's not always the case though US and EU IP addresses tend to have higher trust in general.
What Makes a Good Proxy Pool for Web Scraping?
Diversity! As we've covered in this article having a diverse pool of Autonomous System Numbers (ASN) and subnets will result in the best web scraping performance when it comes to blocking.
To summarize web scrapers can be identified through IP address analysis. This is done either by inspecting IP metadata such as address type (datacenter or residential), ASN and other unique details. So, to avoid being blocked web scrapers should use a pool of diverse, quality proxy IP addresses.
Localization allows for adapting websites content by changing language and currency. So, how do we scrape it? We'll take a look at the most common methods for changing language, currency and other locality details in web scraping.