How to Rotate Proxies in Web Scraping
In this article we explore proxy rotation. How does it affect web scraping success and blocking rates and how can we smartly distribute our traffic through a pool of proxies for the best results.
One of the key challenges when it comes to web scraping in 2023 is scraper blocking, and the most common way to approach this is to use IP proxies.
In web scraping, proxy services can be used to disguise web scraper origin to avoid IP-based blocking or access websites only available in specific countries.
In this article, we'll overview and compare several popular proxy providers from the point of view of web scrapers. We'll also cover how to pick the right provider for your web scraper and what are some common challenges and issues.
Before we start evaluating popular proxy services, let's do a quick overview on proxy types used in web scraping:
The simplest form of proxies which usually are hosted on big data center servers. Unfortunately, this also means that it's easy to tell whether the client is a scraper or a real user as real people rarely browse the web from data centers.
These IP addresses are given to real households and often sourced by renting them out from real people. It's much easier to blend in using residential rotating proxies than datacenter proxies though it's harder to maintain the same IP address for long web scraping sessions.
These Proxies combine data center stability with residential proxy quality. ISP proxies are when residential IP addresses are issued to small data centers.
These addresses are issued to mobile cell towers and each connecting 3G/4G/5G phone. Just like residential proxies these are great for avoiding blocking but are even less stable.
To determine the best proxy service for scraping, let's establish our evaluation methodology.
Not all proxies for scraping are equal. Even proxies with the same specifications like proxy type (be it datacenter, residential or mobile) can perform very differently in real-life web scraping.
There are a few key points worth keeping an eye on when evaluating proxy quality for web scraping besides the raw tests though - let's take a brief overview.
Proxy User Pool Sharing.
Private proxies will yield much better results compared to shared proxy pools, which often have several users using same IPs for same targets. If you think your target is a popular web scraping target then shared pools should be avoided.
Geographic Location of proxies.
US-based proxies tend to have the best quality rating when it comes to web scraper blocking. So, while some services can claim to have thousands of addresses in their pool most of them might be from low-quality regions that have poor success rates.
For peer-to-peer rotating residential and mobile proxies a common issue is that received proxies are not always residential/mobile proxies. In our experience, this can vary from 1-40%, so it's important to confirm IP type (for example, see "Connection type" in ipleak.com results) before using it in your web-scraper for optimal results.
Concurrency limit (aka thread limit)
In web scraping, this limit can frequently be a common source of stability issues. Fast web scrapers can reach this limit pretty quickly as it's often lower than advertised and really hard to measure for. It's something worth keeping and eye on.
Finally, since proxy providers usually offer proxies through a single backconnect proxy (server that distributes proxies to clients) quality, speed and stability can vary greatly by each implementation. This can also make implementing custom, smarter proxy rotation logic more difficult for web scraper developers, which can further reduce chances of successful connections.
Proxy services offer very different pricing options - some charge by proxy count, some by bandwidth usage, and some by combining both.
For web scraping, bandwidth proxies can grow the bill really quickly and should be avoided if possible. Let's take a look at some usage scenarios and how bandwidth proxies would scale:
|target||avg document page size||pages per 1GB||avg browser page size||pages per 1GB|
|Walmart.com||16kb||1k - 60k||1 - 4 MB||200 - 2,000|
|Indeed.com||20kb||1k - 50k||0.5 - 1 MB||1,000 - 2,000|
|LinkedIn.com||35kb||300 - 30k||1 - 2 MB||500 - 1,000|
|Airbnb.com||35kb||30k||0.5 - 4 MB||250 - 2,000|
|Target.com||50kb||20k||0.5 - 1 MB||1,000 - 2,000|
|Crunchbase.com||50kb||20k||0.5 - 1 MB||1,000 - 2,000|
|G2.com||100kb||10k||1 - 2 MB||500 - 2,000|
|Amazon.com||200kb||5k||2 - 4 MB||250 - 500|
In the table above, we see example bandwidth usage estimations for several popular web scraping targets.
Note that bandwidth used by web scrapers varies wildly based on scraped target and web scraping technique.
For example, reverse engineering websites behavior and grabbing only the data document details will use significantly less bandwidth than using automated browser solutions like Puppeteer, Selenium or Playwright. So, for browser-based scraping bandwidth-based pricing can be very expensive.
Finally, all estimations should be at least doubled to consider the retry logic and other usage overhead (like session warm up, and request headers).
For example, let's say we have a $400/Mo plan that gives us 20GB of data. That would only net us ~50k Amazon product scrapes at best and only few hundred if we use a web browser with no special caching or optimization techniques.
Conversely, bandwidth proxies can work well with web scrapers that take advantage of AJAX/XHR requests.
For example, the same $400/Mo plan of 20GB data would yield us ~600k walmart.com product scrapes if we can reverse engineer walmart's web page behavior, which is a much more reasonable proposition!
Bandwidth-based proxies usually give access to big proxy pools, but it's very rare for web scrapers to need more than 100-1000 proxies per projects. For example, if we use 1 proxy at 30req/minute to scrape a website at 5000req/minute we only require 167 rotating proxies!
Proxy count based pricing is often a much safer and easier pricing model to work with. Buying a starter pool of private proxies (only accessible to a single client or very small pool of clients) is an easier and safer commitment for web scraping projects.
In this article, we'll be evaluating proxy providers from the point of view of ScrapFly's very own web scraping proxy-like service. We'll cover the most important features used in web scraping, so our full evaluation table will look like this:
|Anti Bot Bypass||✅|
|Price per GB||$1-25|
|50GB Project Estimated cost||$350/Mo|
Here we're evaluating proxy types: datacenter, residential and mobile, proxy features such as geo targeting and anti bot bypass and some analytical examples like price per gigabyte of bandwidth and estimated cost of an average 50GB web scraper.
Since, we're evaluating from point of view of ScrapFly user let's take a look at what makes ScrapFly so special!
At ScrapFly we realize how complicated proxies are in web scraping, so we made it our goal to simplify the process while also keeping the service accessible.
ScrapFly offers a request middleware service, which ensures that outgoing requests result in successful responses. This is done by a combination of unique ScrapFly features such as a smart proxy selection algorithm, anti web scraping protection solver and browser based rendering.
ScrapFly is using credit based pricing model, which is much easier to predict and scale than bandwidth/proxy count based pricing. This allows flexible pricing based on used features rather than arbitrary measurements such as bandwidth, meaning our users aren't locked in to a single solution and can adjust their scrapers on the fly!
For example, the most popular $100/Mo tier can yield up to 1,000,000 target responses based on enabled features:
To explore these and other offered features see our full documentation!
Let's see how ScrapFly would look on our evaluation table:
|Geo Targeting||54 countries|
|Anti Bot Bypass||✅|
|Price per GB||per request|
|50GB Project Estimated cost||$100/Mo|
Webshare.io is one of the biggest general proxy providers. They offer a variety of services:
It is primarily known for offering unlimited bandwidth datacenter proxies. These proxies can be great for bandwidh-intensive web scrapers that use headless browsers or download heavy files.
However, datacenter proxies will not help with avoiding web scraper blocking and webshare's residential proxy plans are very much in line with the industry average, starting at $18.75/Mo per GB
This puts our 50GB project scraper estimation at
$480/Mo, however because of the unlimited bandwidth datacenter proxies webshare still is very attractive option for some web scraping niches.
Let's see how this would look on our evaluation table:
|Geo Targeting||1-25 countries|
|Anti Bot Bypass||❌|
|Price per GB||$1-25|
|50GB Project Estimated cost||$480|
Netnut.io is another big proxy provider, which offers a variety of services:
It's primarily known for the mobile proxies which is the best proxy type when it comes to avoiding blocking and grabbing internet deals (like sneaker sales). It is quite expensive starting at $30/Mo per GB and is not recommended for most web scraping projects.
Another popular feature is vast geo targetting as Netnut offers residential proxies from over 150 countries. This is great for broad web crawling projects that need to reach niche areas of the world.
However, Netnuts residential proxy offer is a bit more expensive than the industry average starting at $20/Mo. This puts our 50GB project at
Let's see how Netnut looks on our evaluation table:
|Datacenter Proxies (50k)||✅|
|Residential Proxies (10-20M)||✅|
|Geo Targeting||150 countries|
|Anti Bot Bypass||❌|
|Price per GB||$1-30|
|50GB Project Estimated cost||$600|
Soax.com is another big name in the proxy world. Soax offers a very streamlined variety of services:
Soax is primarily known for its competitive prices. Its residential proxies are quite cheap starting at $12/Mo per GB. This puts our 50GB project at
$500/Mo with 5GB to spare.
Soax's mobile proxies are in line with the industry average starting at $30/Mo per GB.
|Residential Proxies (5M)||✅|
|Geo Targeting||100+ countries|
|Anti Bot Bypass||❌|
|Price per GB||$12 - 30|
|50GB Project Estimated cost||$500|
Geosurf.com is another bandwidth tier-based residential proxy provider that has been in the proxy industry for over 10 years. It's doesn't offer any particular breakthroughs but
It's a very similar offering to that of Soax.com, however it seems to be aimed more at enterprise level of users with higher minimum commitment but slightly better value. Let's see how it looks on our evaluation table:
|Residential Proxies (2.5M)||✅|
|Geo Targeting||135 countries + 1700 cities|
|Anti Bot Bypass||❌|
|Price per GB||$8 - 15|
|50GB Project Estimated cost||$544|
Unfortunately, Geosurf suffers from similar issues Soax.com does making it a difficult choice for low and mid tier projects. However, Geosurf does offer unlimited* concurrency and proxy selection by city which can come in handy for some niche web scrapers.
|Datacenter Proxies||3.4M||on demand||50k shared||❌||❌|
|Residential Proxies||190M||on demand||10-20M||5M||2.5M|
|Geo Targeting (Countries)||54||1-25||150||100||135|
|Anti Bot Bypass||✅||❌||❌||❌||❌|
|Price per GB||per request||$1-25||$1-17.5||$12 - 33||$8 - 15|
|Minimum Commitment (Monthly)||$15||$15||$20||$99||$300|
|50GB Project Estimated cost||$100||$480||$600||$500||$544|
When it comes to web scraping a classic proxy service is a tough sell. Even with the recent advances in proxy quality these services still fall short compared to dedicated web scraping APIs which can apply additional, smart connection strategies to prevent captchas, blocking or throttling.