One of the most common challenges encountered when web scraping is scaling. For this, using proxies in web scraping is crucial! Having a set of quality proxies can prevent web scraping blocking. But what makes a quality proxy for web scraping, and what are the different proxies are there?
In this guide, we'll take an extensive guide to using proxies for web scraping. We'll explain the different types of proxies, how they compare, their challenges, and their best practices for using them in web scraping. Let's get started!
What Is a Proxy?
A proxy server is a middleware that lies between a client and a host. There are different usages for proxies, such as connection optimization. However, the most common usage of web scraping proxies is masking or hiding the client's IP address.
This IP masking is beneficial for two main purposes:
Accessing geographically blocked websites by changing the IP location.
Splitting the requests' traffic across multiple IP addresses.
In the web scraping context, proxy servers are used to prevent IP address blocking, as a high number of requests sent from the same IP address can cause the connection to be identified as non-human.
To further explore the usage of proxies for web scraping, let's have a look at the IP address types.
IP Protocol Versions
Currently, the internet runs on two types of IP addresses: IPv4 and IPv6. The key differences between these two protocols are the following:
Address quantity
The IPv4 address pool is limited to around 4 billion addresses. This might seem like a lot, but the internet is a big place, and technically, we ran out of free addresses already! (see IPv4 address exhaustion)
Adoption
Most websites still only support IP addresses with IPv4 connections, meaning we can't use IPv6 proxies unless we explicitly know our target website supports it.
How does the IP address type affect web scraping?
Since IPv6 is supported by very few target websites, we are limited to using IPv4 proxy servers, which are more expensive (3-10 times on average) as they are limited. That being said, some major websites do support IPv6, which can be checked on various IPv6 accessibility test tools. So, if your target website supports IPv6, the web scraping proxy pool budget can be significantly reduced!
Proxy Protocols
There are two major proxy protocols used these days: HTTP and SOCKS (latest SOCKS5). In the context of web scraping proxies, there isn't much practical difference between these two protocols. Proxy servers with the SOCKS protocol tend to be a bit faster, more stable, and more secure. On the other hand, HTTP proxies are more widely adopted by web scraping proxy providers and the HTTP client libraries.
Proxy Types
The proxy type is the most important aspect when choosing a web scraping proxy provider or creating a proxy pool. There are four types of proxy IP addresses:
Datacenter
Residential
Static Residential (aka ISP)
Mobile.
The key differences between the above proxy server types are the following:
Price
Reliability, such as speed or the automatic proxy rotation
Stealth score, the likelihood of getting blocked, which is low for the anonymous proxies.
Let's have a deeper look into the value and details of each web scraping proxies.
Datacenter Proxies
Datacenter IPs are commercially assigned to proxy services through cloud servers, and they aren't affiliated with internet service providers (ISPs). This web scraping proxy type is often flagged as high-risk (with a high chance of being automated). They can be provided as dedicated proxies or shared between multiple users, which increases the flagging risk in the last case.
On the bright side, datacenter proxies are widely accessible, reliable, and cheap! A proxy pool of this type is recommended for teams with reliable engineering resources to reverse engineer the target websites. This can be utilized to create a proxy manager for rotating proxies depending on the blocking rate.
Residential Proxies
Residential IPs are assigned by ISPs and have a lower risk of being flagged, as they are assigned to home networks. Residential IPs make a reliable web scraping proxy as they are used by real humans!
That being said, proxy services with residential IP addresses are much pricier than the datacenter ones. Additionally, this proxy type can have session persistency issues with maintaining the same IP address for long periods. Hence, they are often referred to as "Rotating Residential Proxies".
Therefore, residential proxies can be problematic with specific target websites, as they require the same IP address to be maintained for the whole connection session. For example, if we are scraping web data at the end of a long process, the proxy manager can change the IP address before we reach the end.
A proxy service with residential IPs requires minimal engineering efforts, as they have a high trust score and are relatively affordable.
Static Residential / ISP Proxies
Residential IPs have a great trust score but are unreliable as they aren't powered by a reliable datacenter infrastructure. What if we combine the best of both worlds: the reliability of the datacenter proxies and the stealth of the residential proxies?
ISP proxies "Static Residential proxies" are a mixed version of residential and datacenter proxies. They combine the high score of residential IPs with the high proxy network quality of the datacenter infrastructure!
The static residential proxies are best suited for web scrapers, as they can benefit from the high trust score and the persistent connection sessions.
Mobile Proxies
Mobile IPs are assigned by mobile network towers. They have a dynamic IP address that gets rotated automatically. This means that they have a high trust score and are unlikely to get blocked or faced with CAPTCHA challenges.
Mobile proxies are an extreme version of residential proxies: maintaining the same IP might be more challenging, and they are even more expensive. This proxy type tends to be slower and less reliable. However, they are getting improved by web scraping proxy providers lately.
Mobile proxies don't require much engineering resources, as they solve most of their connection blocking by themselves!
Other Proxy Types
We've covered four proxy types. However, masking the IP address isn't only accessible through regular proxy providers. Let's quickly explore the other types.
Virtual Private Network (VPN)
VPNs are proxies with a more complex tunneling protocol. The IPs of a VPN are shared across many users. This means that the VPN IPs have low trust scores, and they are likely to get blocked or challenged with CAPTCHAs. Additionally, most of the VPNs don't provide access to their HTTP or SOCKS5 servers. However, they can be accessed for web scraping using a bot of technical knowledge.
The Onion Router (TOR)
Tor is an open-source software that provides anonymous proxies using volunteer-driven network layers. The IPs of Tor have a very low success rate. Tor connections are also slow and unreliable, making them ineffective for web scraping.
Which Web Scraping Proxy Provider to Choose?
In a nutshell, the more complex and rare the IP is, the harder it is to get identified and blocked, but it also costs more. However, more complex proxies have lower reliability.
Therefore, choosing a reliable proxy provider depends on your scraping target and project resources.
Datacenter proxies are great for getting around simple rate limiting and as a general safety net.
Residential proxies significantly reduce the chance of CAPTCHAs and getting caught by anti-web scraping protection services, but they require more engineering efforts.
Mobile proxies are suitable for websites with a higher blocking.
We recommend starting with a sizable pool of datacenter proxies as they are significantly cheaper and more reliable and evaluate from there as the project grows. However, they can easily be identified.
Bandwidth Budget
When shopping around for the best web scraping proxies, we'll first notice that most proxies are priced by proxy count and bandwidth. Bandwidth can quickly become a huge budget sink for some web scraping scenarios, so evaluating bandwidth consumption is important before choosing dedicated proxies or a web scraping API.
It's easy to overlook bandwidth usage and end up with a huge proxy bill, so let's take a look at some examples:
target
avg document page size
pages per 1GB
avg browser page size
pages per 1GB
Walmart.com
16kb
1k - 60k
1 - 4 MB
200 - 2,000
Indeed.com
20kb
1k - 50k
0.5 - 1 MB
1,000 - 2,000
LinkedIn.com
35kb
300 - 30k
1 - 2 MB
500 - 1,000
Airbnb.com
35kb
30k
0.5 - 4 MB
250 - 2,000
Target.com
50kb
20k
0.5 - 1 MB
1,000 - 2,000
Crunchbase.com
50kb
20k
0.5 - 1 MB
1,000 - 2,000
G2.com
100kb
10k
1 - 2 MB
500 - 2,000
Amazon.com
200kb
5k
2 - 4 MB
250 - 500
In the table above, we see the average bandwidth usage by various targets. If we look closely, we can see some patterns emerge: big, heavy HTML websites (like Amazon) use a lot of bandwidth compared to dynamic websites that use background requests to populate their pages (like Walmart).
Another example of a bandwidth sink is using browser automation tools like Puppeteer, Selenium, or Playwright. Since web browsers are less precise in their connections they often download a lot of unnecessary data like images, fonts and so on.
Therefore, it's essential to configure browser automation setups with resource blocking rules and proper caching rules to prevent bandwidth overhead, but generally expect browser traffic to be much more expensive bandwidth-wise.
Common Proxy Issues
Proxy scraping is having a middleman between your client and the server, which can introduce many issues.
Probably the biggest issue is the support of HTTP2/3 traffic. The newer HTTP protocols are typically preferred in web scraping to avoid blocking. Unfortunately, lots of HTTP proxies struggle with this sort of traffic, so when choosing a web scraping proxy provider, we advise testing HTTP2 quality first!
Another common proxy provider issue is connection concurrency. Typically, proxy services have a limit on concurrent proxy connections, which might be too small for powerful web scrapers. Hence, it's recommended to do research on concurrent connection limits and throttling scrapers a bit below that limit to prevent proxy-related connection crashes.
Finally, proxies introduce a lot of additional complexity to a web scraping project. So, when using a proxy server for scraping, we recommend investing additional engineering effort in retry and error-handling logic.
Proxies at ScrapFly
Proxies can be a very powerful tool in web scraping but still not enough for scaling up some web scraping projects and this is where Scrapfly can assist!
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:
Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
Millions of self-healing proxies of the highest possible trust score.
Constantly evolving and adapting to new anti-bot systems.
We've been doing this publicly since 2020 with the best bypass on the market!
ScrapFly is a web scraping API that offers a request middleware service, which ensures that outgoing requests result in successful responses. This is done by a combination of unique ScrapFly features such as a smart proxy selection algorithm, anti-web scraping protection solver, and browser-based rendering.
ScrapFly uses credit-based pricing model, which is much easier to predict and scale than bandwidth-based pricing. This allows flexible pricing based on used features rather than arbitrary measurements such as bandwidth, meaning our users aren't locked into a single solution and can adjust their scrapers on the fly!
For example, the most popular $100/Month tier can yield up to 1,000,000 target responses based on enabled features:
To wrap up this guide on using proxies for web scraping, let's take a look at some frequently asked questions.
Can free proxies be used in web scraping?
Yes, but only with a few benefits. Free scraping proxies are easy to identify and perform very poorly, so we recommend free proxy lists for low-demand web scraping and teams with many engineering resources to keep track of free proxy availability.
Are scraping proxies banned forever?
Usually, banned proxies recover within minutes, hours, or days. Permanent bans for web scraping are very unlikely, though some proxy providers are banned by various anti-scraping protection services.
Why use proxies in web scraping at all?
Proxies in web scraping are used to avoid scraper blocking or to access geographically restricted content. For more on how proxies are used in web scraper blocking, refer to our guide on IP address blocking.
Using Proxies For Web Scraping Summary
In this guide, we've learned a lot about proxies. We compared IPv4 vs IPv6 internet protocols and HTTP vs SOCKS proxy protocols. Then, we explored the different proxy types and how they differ in web scraping blocking. Finally, we wrapped everything up by looking at common proxy challenges like bandwidth-based pricing, HTTP2 support, and proxy stability issues.
Proxies are complicated and can be hard to work with, so try out our flat-priced ScrapFly solution for free!
In this article, we'll explain web scraping using Tor. For this, we'll use Tor as a proxy server to change the IP address randomly in either HTTP or SOCKS, as well as using it as a rotating proxy server.
One of the most common challenges encountered while web scraping is IP throttling and blocking. Learn about the CloudProxy tool, how to install it and how to use it for cloud-based web scraping.