One of the most difficult problems in web scraping is scaling, and the most important tools in scaling web scrapers are proxies!
Having a set of quality proxies can prevent our web scraper from being blocked or throttled, meaning we can scrape faster and spend less time maintaining our scrapers. So what makes a quality proxy for web scraping and what type of proxies are there?
In this introduction article, we'll take a look at what exactly is a proxy. What types of proxies are there, how do they compare against each other, what common challenges are posed by proxy usage and what are the best practices in web scraping?
What's a Proxy?
A proxy is essentially a middleman server that sits between the client and the server. There are many usages for proxies like optimizing connection routes, but most commonly proxies for web scraping are used to disguise the client's IP address (identity).
This disguise can be used to access geographically locked content (e.g. websites only available in a specific country) or to distribute traffic through multiple identities.
In web scraping we often use proxies to avoid being blocked as numerous connections from a single identity can be easily identified as non-human connections.
To further understand this, let's learn a bit about IP addresses and proxy types.
IP Protocol Versions
Currently, the internet runs on two types of IP addresses: IPv4 and IPv6.
The key difference between these two protocols are:
Address quantity: IPv4 address pool is limited to around 4 billion addresses which might seem like a lot, but the internet is a big place, and technically we ran out of free addresses already! (see IPv4 address exhaustion)
Adoption: Most websites still only support IPv4 connections, meaning we can't use IPv6 proxies unless we explicitly know our target website supports it.
What does this mean for web-scraping?
Since IPv6 is supported by very few websites we are still limited to using IPv4 proxies which are more expensive (3-10 times on average) because of the limited address issue.
That being said, some major websites do support IPv6 (which can be checked on various IPv6 accessibility test tools) which can greatly reduce your proxy budget!
Proxy Protocols
There are two major proxy protocols used these days: HTTP and SOCKS (latest SOCKS5).
In web scraping, there isn't much practical difference between these two protocols. SOCKS protocol tends to be a bit faster, more stable and more secure. On the other hand, HTTP proxies are more widely adopted by proxy providers and HTTP client libraries used for web scraping.
Proxy Types
There are 4 types of proxy IPs that are used in web scraping:
Datacenter
Residential
Static Residential (aka ISP)
Mobile.
The key difference between these 4 types is price, reliability (connection speed, IP rotation etc.) and stealth score (likelihood of being blocked).
Let's take a deeper look into each type, and its value in web scraping.
Datacenter Proxies
Datacenter IPs are commercially assigned to servers and are not affiliated with internet service providers (ISPs). Meaning, they are often flagged as high-risk of being bots. Typically, these IPs are also shared between many users further increasing flagging risk.
On the bright side, datacenter proxies are widely accessible, reliable and cheap! We recommend using datacenter proxies for teams with stronger engineering resources as engineering time is needed to reverse-engineer scraping targets and to design smart proxy rotation algorithms.
Residential IPs are assigned by ISPs meaning they are at lower risk of being flagged as they are attached to a real address and are wrapped in a stricter legal framework. Meaning, they are great for web scraping as they're the same IPs real humans use!
Unfortunately, residential IPs are much pricier than the datacenter ones.
Additionally, these proxies sometimes can have session persistency issues with maintaining the same IP for long periods thus often referred to as "Rotating Residential Proxies". This can be problematic for some targets that require the same IP to maintain a connection session. For example, if we're scraping a long booking process of an airline, by the time we reach the last step the proxy might expire fumbling the whole scrape process.
Constant session loss requires the web scraper to re-authenticate the session repeatedly causing friction in web scraping process, so it's best to look for residential proxies that can sustain long sessions.
Residential proxies are great for teams that have limited engineering resources as they have high stealth scores and are relatively affordable.
Static Residential / ISP Proxies
Residential IPs have a great stealth score but are unreliable as they aren't powered by a strong datacenter infrastructure. What if we combine the best of both worlds: the reliability of the datacenter proxies and the stealth of the residential proxies?
ISP proxies (aka Static Residential proxies) are datacenter proxies, which are registered as ISP IPs meaning they get most of the stealth benefits of a residential proxy, and the persistency/network quality of a datacenter proxy!
We're recommending ISP proxies for web scrapers, which need to maintain an IP-based session for long periods and avoid captchas and anti-bot systems.
Mobile Proxies
Mobile IPs are assigned by a mobile service provider (think 4G etc.) and since they are assigned dynamically to whoever is around the cell tower they are not tied to a single individual. This means they are really low risk of being blocked or forced to go through a captcha.
Mobile proxies are just more extreme versions of residential proxies: maintaining the same IP might be harder, and they are even more expensive. These proxies also tend to be somewhat slower and less reliable though modern providers have been making great improvements as of late.
Mobile proxies are amazing for teams with low engineering resources as they solve most of the connection blocking by origin virtue alone!
As you can see, a clear pattern emerges: the more complex and rare the IP is, the harder it is to identify however it also costs more. The complexity of a proxy also decreases its reliability.
So which one to choose?
To put it shortly - it all depends on your target and project resources.
Datacenter proxies are great for getting around simple rate limiting and as a general safety net.
Residential proxies greatly reduce the chance of captchas and being caught by anti-web-scraping protection services and mobile proxies take this even further.
We usually recommend starting with a sizable pool of datacenter proxies as they are significantly cheaper and more reliable and evaluate from there as the project grows.
However, datacenter proxies will be easily caught by anti-scraping-protection systems as they are really easy to identify.
We've covered 4 types of proxies, but the internet is a clever place and there are other lesser-known ways to mask your IP address.
Probably the most popular alternative is using Virtual Private Network (VPN) services as proxies. VPN's are essentially proxies with a more complex/stronger tunneling protocol.
Since a single VPN exit is shared by many users (like Mobile proxies) this can be advantageous as other users can raise IP's stealth score by solving captchas and browsing around like human beings. On the other hand, it can be the opposite and the exit IP might be completely polluted by other power users.
So to summarize: VPN approach is very unstable and accessibility heavily varies by VPN provider. Not many providers offer http/socks5 proxy access to their VPN servers, however with a bit of technical know-how VPN servers can also be used as proxies for casual web scraping projects.
Another alternative proxy type is The Onion Network (TOR). TOR is a privacy layer protocol where many servers bounce traffic around to mask the client's origin.
The main downside of using TOR network is that it's a volunteer-driven network with limited, publicly known exit nodes. Meaning, TOR network has a very low stealth score. Additionally, because of the protocol's complexity and a limited amount of volunteer exit nodes, TOR connections are very slow and often unreliable.
TOR can be used for web scraping with varying results however, we would not recommend it for anything other than educational purposes.
Bandwidth Budget
When shopping around for scraping proxies the first thing we'll notice is that most proxies are priced by proxy count and bandwidth. Bandwidth can quickly become a huge budget sink for some web scraping scenarios, so it's important to evaluate bandwidth consumption before choosing a scraping proxy provider.
It's easy to overlook bandwidth usage and end up with a huge proxy bill, so let's take a look at some examples:
target
avg document page size
pages per 1GB
avg browser page size
pages per 1GB
Walmart.com
16kb
1k - 60k
1 - 4 MB
200 - 2,000
Indeed.com
20kb
1k - 50k
0.5 - 1 MB
1,000 - 2,000
LinkedIn.com
35kb
300 - 30k
1 - 2 MB
500 - 1,000
Airbnb.com
35kb
30k
0.5 - 4 MB
250 - 2,000
Target.com
50kb
20k
0.5 - 1 MB
1,000 - 2,000
Crunchbase.com
50kb
20k
0.5 - 1 MB
1,000 - 2,000
G2.com
100kb
10k
1 - 2 MB
500 - 2,000
Amazon.com
200kb
5k
2 - 4 MB
250 - 500
In the table above we see the average bandwidth usage by various targets. If we look closely, we can see some patterns emerge: big heavy HTML websites (like amazon.com) use a lot of bandwidth compared to dynamic websites that use background resource requests to populate their page (like walmart.com).
Another example of a bandwidth sink is using browser automation tools like Puppeteer, Selenium or Playwright. Since web browsers are less precise in their connections they often download a lot of unnecessary data like images, fonts and so on. Because of this, it's essential to configure browser automation setups with resource blocking rules and proper caching rules to prevent bandwidth overhead but generally expects browser traffic to be much more expensive bandwidth-wise.
Having a middleman between your client and the server can introduce a lot of issues.
Probably the biggest issue is the support of HTTP2/3 traffic. The newer HTTP protocols are typically preferred in web scraping to avoid blocking. Unfortunately, lots of HTTP proxies struggle with this sort of traffic, so when choosing a proxy provider for web scraping we advise testing HTTP2 quality first!
Another common proxy provider issue is connection concurrency. Typically, proxy services have a limit on concurrent proxy connections, which might be too small for powerful web scrapers. Because of this we advise doing research on concurrent connection limits and throttling scrapers a bit below that limit to prevent proxy-related connection crashes.
Finally, proxies do introduce a lot of additional complexity to a web scraping project, so when proxies are used we recommend investing additional engineering effort in retry/error handling logic.
FAQ
To wrap this introduction up let's take a look at some frequently asked questions about proxies in web scraping:
Can free proxies be used in web scraping?
Yes but not with many benefits. Free scraping proxies are easy to identify and perform very poorly, so we would only recommend free proxy lists for low-demand web scraping and teams with a lot of engineering resources to keep track of free proxy availability.
Are scraping proxies banned forever?
Usually banned proxies recover within minutes, hours or days. Permanent bans for web scraping are very unlikely though some proxy providers are banned by various anti-scraping-protection services.
Why use proxies in web scraping at all?
Proxies in web scraping are used to avoid scraper blocking or to access geographically restricted content. For more on how proxies are used in web scraper blocking see How to Avoid Web Scraper IP Blocking?
Proxies at ScrapFly
At ScrapFly we realize how complicated proxies are in web scraping, so we made it our goal to simplify the process while also keeping the service accessible.
ScrapFly feels like a proxy but does much more!
ScrapFly offers a request middleware service, which ensures that outgoing requests result in successful responses. This is done by a combination of unique ScrapFly features such as a smart proxy selection algorithm, anti-web scraping protection solver and browser-based rendering.
ScrapFly is using credit based pricing model, which is much easier to predict and scale than bandwidth/proxy count based pricing. This allows flexible pricing based on used features rather than arbitrary measurements such as bandwidth, meaning our users aren't locked in to a single solution and can adjust their scrapers on the fly!
For example, the most popular $100/Mo tier can yield up to 1,000,000 target responses based on enabled features:
All ScrapFly HTTP1 requests are automatically converted to HTTP2 requests, which are significantly less likely to be blocked.
ScrapFly offers smart Anti Scraping Protection solution, which solves various captchas and scraping protection blockers if they do appear during the scraping process. What's great about ASP service is that the user only charged 5 credits for successful solutions, meaning this can be applied to every request worry free!
ScrapFly offers browser based rendering, which even further reduces chances of being blocked as real web browsers are much less likely to be blocked than HTTP clients. Using browser based rendering also greatly simplifies web scraping process as it reduces engineering efforts needed to understand scrape website - your requests will return the same data users see in their web browsers!
In this introduction article, we've learned a lot about proxies. We compared IPv4 vs IPv6 internet protocols and HTTP vs SOCKS proxy protocols. Then we took a deep look into proxy types: datacenter, residential, mobile and how they differ in web scraper blocking. Finally, we wrapped everything up by taking a look at common proxy challenges like bandwidth-based pricing, HTTP2 support and proxy stability issues.
In this article we explore proxy rotation. How does it affect web scraping success and blocking rates and how can we smartly distribute our traffic through a pool of proxies for the best results.