Introduction To Proxies in Web Scraping

article feature image

One of the most difficult problems in web scraping is scaling, and the most important tools in scaling web scrapers are proxies!
Having a set of quality proxies can prevent our web scraper from being blocked or throttled, meaning we can scrape faster and spend less time maintaining our scrapers. So what makes a quality proxy for web scraping and what type of proxies are there?

In this introduction article, we'll take a look at what exactly is a proxy. What types of proxies are there, how do they compare against each other, what common challenges are posed by proxy usage and what are the best practices in web scraping?

What's a Proxy?

A proxy is essentially a middleman server that sits between the client and the server. There are many usages for proxies like optimizing connection routes, but most commonly proxies for web scraping are used to disguise the client's IP address (identity).

This disguise can be used to access geographically locked content (e.g. websites only available in a specific country) or to distribute traffic through multiple identities.

In web scraping we often use proxies to avoid being blocked as numerous connections from a single identity can be easily identified as non-human connections.

To further understand this, let's learn a bit about IP addresses and proxy types.

IP Protocol Versions

illustration of ipv4 versus ipv6 internet protocols

Currently, the internet runs on two types of IP addresses: IPv4 and IPv6.
The key difference between these two protocols are:

  • Address quantity: IPv4 address pool is limited to around 4 billion addresses which might seem like a lot, but the internet is a big place, and technically we ran out of free addresses already! (see IPv4 address exhaustion)
  • Adoption: Most websites still only support IPv4 connections, meaning we can't use IPv6 proxies unless we explicitly know our target website supports it.

What does this mean for web-scraping?
Since IPv6 is supported by very few websites we are still limited to using IPv4 proxies which are more expensive (3-10 times on average) because of the limited address issue.
That being said, some major websites do support IPv6 (which can be checked on various IPv6 accessibility test tools) which can greatly reduce your proxy budget!

Proxy Protocols

There are two major proxy protocols used these days: HTTP and SOCKS (latest SOCKS5).
In web scraping, there isn't much practical difference between these two protocols. SOCKS protocol tends to be a bit faster, more stable and more secure. On the other hand, HTTP proxies are more widely adopted by proxy providers and HTTP client libraries used for web scraping.

Proxy Types

There are 4 types of proxy IPs that are used in web scraping:

  • Datacenter
  • Residential
  • Static Residential (aka ISP)
  • Mobile.

The key difference between these 4 types is price, reliability (connection speed, IP rotation etc.) and stealth score (likelihood of being blocked).

Let's take a deeper look into each type, and its value in web scraping.

Datacenter Proxies

Datacenter IPs are commercially assigned to servers and are not affiliated with internet service providers (ISPs). Meaning, they are often flagged as high-risk of being bots. Typically, these IPs are also shared between many users further increasing flagging risk.

image

On the bright side, datacenter proxies are widely accessible, reliable and cheap! We recommend using datacenter proxies for teams with stronger engineering resources as engineering time is needed to reverse-engineer scraping targets and to design smart proxy rotation algorithms.

How to Rotate Proxies in Web Scraping

For efficient proxy rotation example, see our introduction article on how to rotate proxies effectively using weighted randomization.

How to Rotate Proxies in Web Scraping

Residential Proxies

Residential IPs are assigned by ISPs meaning they are at lower risk of being flagged as they are attached to a real address and are wrapped in a stricter legal framework. Meaning, they are great for web scraping as they're the same IPs real humans use!

image

Unfortunately, residential IPs are much pricier than the datacenter ones.
Additionally, these proxies sometimes can have session persistency issues with maintaining the same IP for long periods thus often referred to as "Rotating Residential Proxies". This can be problematic for some targets that require the same IP to maintain a connection session. For example, if we're scraping a long booking process of an airline, by the time we reach the last step the proxy might expire fumbling the whole scrape process.

Constant session loss requires the web scraper to re-authenticate the session repeatedly causing friction in web scraping process, so it's best to look for residential proxies that can sustain long sessions.

Residential proxies are great for teams that have limited engineering resources as they have high stealth scores and are relatively affordable.

Static Residential / ISP Proxies

Residential IPs have a great stealth score but are unreliable as they aren't powered by a strong datacenter infrastructure. What if we combine the best of both worlds: the reliability of the datacenter proxies and the stealth of the residential proxies?

image

ISP proxies (aka Static Residential proxies) are datacenter proxies, which are registered as ISP IPs meaning they get most of the stealth benefits of a residential proxy, and the persistency/network quality of a datacenter proxy!

We're recommending ISP proxies for web scrapers, which need to maintain an IP-based session for long periods and avoid captchas and anti-bot systems.

Mobile Proxies

Mobile IPs are assigned by a mobile service provider (think 4G etc.) and since they are assigned dynamically to whoever is around the cell tower they are not tied to a single individual. This means they are really low risk of being blocked or forced to go through a captcha.

image

Mobile proxies are just more extreme versions of residential proxies: maintaining the same IP might be harder, and they are even more expensive. These proxies also tend to be somewhat slower and less reliable though modern providers have been making great improvements as of late.

Mobile proxies are amazing for teams with low engineering resources as they solve most of the connection blocking by origin virtue alone!


As you can see, a clear pattern emerges: the more complex and rare the IP is, the harder it is to identify however it also costs more. The complexity of a proxy also decreases its reliability.

So which one to choose?

image

To put it shortly - it all depends on your target and project resources.
Datacenter proxies are great for getting around simple rate limiting and as a general safety net.
Residential proxies greatly reduce the chance of captchas and being caught by anti-web-scraping protection services and mobile proxies take this even further.

We usually recommend starting with a sizable pool of datacenter proxies as they are significantly cheaper and more reliable and evaluate from there as the project grows.

However, datacenter proxies will be easily caught by anti-scraping-protection systems as they are really easy to identify.

How to Avoid Web Scraper IP Blocking?

For more on how proxies are identified, tracked and blocked see our introduction article which explains what is IP metadata and how is it used to fingerprint connections.

How to Avoid Web Scraper IP Blocking?

Other Types of Proxies?

We've covered 4 types of proxies, but the internet is a clever place and there are other lesser-known ways to mask your IP address.

Probably the most popular alternative is using Virtual Private Network (VPN) services as proxies. VPN's are essentially proxies with a more complex/stronger tunneling protocol.
Since a single VPN exit is shared by many users (like Mobile proxies) this can be advantageous as other users can raise IP's stealth score by solving captchas and browsing around like human beings. On the other hand, it can be the opposite and the exit IP might be completely polluted by other power users.

So to summarize: VPN approach is very unstable and accessibility heavily varies by VPN provider. Not many providers offer http/socks5 proxy access to their VPN servers, however with a bit of technical know-how VPN servers can also be used as proxies for casual web scraping projects.

Another alternative proxy type is The Onion Network (TOR). TOR is a privacy layer protocol where many servers bounce traffic around to mask the client's origin.
The main downside of using TOR network is that it's a volunteer-driven network with limited, publicly known exit nodes. Meaning, TOR network has a very low stealth score. Additionally, because of the protocol's complexity and a limited amount of volunteer exit nodes, TOR connections are very slow and often unreliable.

TOR can be used for web scraping with varying results however, we would not recommend it for anything other than educational purposes.

Bandwidth

When shopping around for scraping proxies the first thing we'll notice is that most proxies are priced by proxy count and bandwidth. Bandwidth can quickly become a huge budget sink for some web scraping scenarios, so it's important to evaluate bandwidth consumption before choosing a scraping proxy provider.

It's easy to overlook bandwidth usage and end up with a huge proxy bill, so let's take a look at some examples:

target avg document page size pages per 1GB avg browser page size pages per 1GB
Walmart.com 16kb 1k - 60k 1 - 4 MB 200 - 2,000
Indeed.com 20kb 1k - 50k 0.5 - 1 MB 1,000 - 2,000
LinkedIn.com 35kb 300 - 30k 1 - 2 MB 500 - 1,000
Airbnb.com 35kb 30k 0.5 - 4 MB 250 - 2,000
Target.com 50kb 20k 0.5 - 1 MB 1,000 - 2,000
Crunchbase.com 50kb 20k 0.5 - 1 MB 1,000 - 2,000
G2.com 100kb 10k 1 - 2 MB 500 - 2,000
Amazon.com 200kb 5k 2 - 4 MB 250 - 500

In the table above we see the average bandwidth usage by various targets. If we look closely, we can see some patterns emerge: big heavy HTML websites (like amazon.com) use a lot of bandwidth compared to dynamic websites that use background resource requests to populate their page (like walmart.com).

Another example of a bandwidth sink is using browser automation tools like Puppeteer, Selenium or Playwright. Since web browsers are less precise in their connections they often download a lot of unnecessary data like images, fonts and so on. Because of this, it's essential to configure browser automation setups with resource blocking rules and proper caching rules to prevent bandwidth overhead but generally expects browser traffic to be much more expensive bandwidth-wise.

Scraping Dynamic Websites Using Web Browsers

For more on how to optimize web browser automation tools see our extensive article that covers three major packages: Puppeteer, Playwright and Selenium

Scraping Dynamic Websites Using Web Browsers

Common Proxy Issues

Having a middleman between your client and the server can introduce a lot of issues.

Probably the biggest issue is the support of HTTP2/3 traffic. The newer HTTP protocols are typically preferred in web scraping to avoid blocking. Unfortunately, lots of HTTP proxies struggle with this sort of traffic, so when choosing a proxy provider for web scraping we advise testing HTTP2 quality first!

Another common proxy provider issue is connection concurrency. Typically, proxy services have a limit on concurrent proxy connections, which might be too small for powerful web scrapers. Because of this we advise doing research on concurrent connection limits and throttling scrapers a bit below that limit to prevent proxy-related connection crashes.

Finally, proxies do introduce a lot of additional complexity to a web scraping project, so when proxies are used we recommend investing additional engineering effort in retry/error handling logic.

FAQ

To wrap this introduction up let's take a look at some frequently asked questions about proxies in web scraping:

Can free proxies be used in web scraping?

Yes but not with many benefits. Free scraping proxies are easy to identify and perform very poorly, so we would only recommend free proxy lists for low-demand web scraping and teams with a lot of engineering resources to keep track of free proxy availability.

Are scraping proxies banned forever?

Usually banned proxies recover within minutes, hours or days. Permanent bans for web scraping are very unlikely though some proxy providers are banned by various anti-scraping-protection services.

Why use proxies in web scraping at all?

Proxies in web scraping are used to avoid scraper blocking or to access geographically restricted content. For more on how proxies are used in web scraper blocking see How to Avoid Web Scraper IP Blocking?

Proxies at ScrapFly

At ScrapFly we realize how complicated proxies are in web scraping, so we made it our goal to simplify the process while also keeping the service accessible.

scrapfly middleware

ScrapFly feels like a proxy but does much more!

ScrapFly offers a request middleware service, which ensures that outgoing requests result in successful responses. This is done by a combination of unique ScrapFly features such as a smart proxy selection algorithm, anti-web scraping protection solver and browser-based rendering.

ScrapFly is using credit based pricing model, which is much easier to predict and scale than bandwidth/proxy count based pricing. This allows flexible pricing based on used features rather than arbitrary measurements such as bandwidth, meaning our users aren't locked in to a single solution and can adjust their scrapers on the fly!

image of scrapfly's pricing tiers

For example, the most popular $100/Mo tier can yield up to 1,000,000 target responses based on enabled features:

  • ScrapFly provides a choice of either datacenter or residential proxies and geolocation (over 50+ locations) for each request.
  • All ScrapFly HTTP1 requests are automatically converted to HTTP2 requests, which are significantly less likely to be blocked.
  • ScrapFly offers smart Anti Scraping Protection solution, which solves various captchas and scraping protection blockers if they do appear during the scraping process. What's great about ASP service is that the user only charged 5 credits for successful solutions, meaning this can be applied to every request worry free!
  • ScrapFly offers browser based rendering, which even further reduces chances of being blocked as real web browsers are much less likely to be blocked than HTTP clients. Using browser based rendering also greatly simplifies web scraping process as it reduces engineering efforts needed to understand scrape website - your requests will return the same data users see in their web browsers!

To explore these and other offered features see our full documentation!

Summary

In this introduction article, we've learned a lot about proxies. We compared IPv4 vs IPv6 internet protocols and HTTP vs SOCKS proxy protocols. Then we took a deep look into proxy types: datacenter, residential, mobile and how they differ in web scraper blocking. Finally, we wrapped everything up by taking a look at common proxy challenges like bandwidth-based pricing, HTTP2 support and proxy stability issues.

Best Proxy Providers for Web Scraping

Now that you're familiar with proxies, and their challenges - see our comparison write-up on best proxies for web scraping.

Best Proxy Providers for Web Scraping

Proxies are complicated and can be hard to work with so try out our flat-priced ScrapFly solution for free!

Related Posts

How to Rotate Proxies in Web Scraping

In this article we explore proxy rotation. How does it affect web scraping success and blocking rates and how can we smartly distribute our traffic through a pool of proxies for the best results.

Web Scraping With Node-Unblocker

Tutorial on using Node-Unblocker - a nodejs library - to avoid blocking while web scraping and using it to optimize web scraping stacks.

How to Avoid Web Scraper IP Blocking?

How IP addresses are used in web scraping blocking. Understanding IP metadata and fingerprinting techniques to avoid web scraper blocks.