Introduction To Proxies in Web Scraping

article feature image

One of the most difficult problems in web-scraping is scaling, and the most important of tools in scaling web scrapers are proxies! Having a set of quality proxies can prevent our web scraper from being blocked or throttled, meaning we can scrape faster and spend less time maintaining our scrapers. So what makes a quality proxy and what type of proxies are there?

In this introduction article, we'll take a look at what exactly is a proxy. What types of proxies are there, how they compare against each other, common challenges posed by proxy usage and what are the best practices in web scraping!

What's a Proxy?

Proxy is essentially a middleman server that sits between the client and the server. There are many usages for proxies like optimizing connection routes, but most commonly proxies are used to disguise client's IP address (identity). This disguise can be used to access geographically locked content (e.g. websites only available in specific country) or to distribute traffic through multiple identities.

In web scraping we often use proxies to avoid being blocked as numerous connections from a single identity can be easily identified as non-human connections.

To further understand this let's learn a bit about IP addresses and proxy types.

IP Protocol Versions

illustration of ipv4 versus ipv6 internet protocols

Currently, the internet runs on two types of IP addresses: IPv4 and IPv6.
The key difference between these two protocols are:

  • Address quantity: IPv4 address pool is limited to around 4 billion addresses which might seem like a lot, but the internet is a big place, and technically we ran out of free addresses already! (see IPv4 address exhaustion)
  • Adoption: Most websites still only support IPv4 connections, meaning we can't use IPv6 proxies unless we explicitly know our target website supports it.

What does this mean for web-scraping?
Since IPv6 is supported by very few websites we are still limited to using IPv4 proxies which are more expensive (3-10 times on average) because of the limited address issue.
That being said, some major websites do support IPv6 (which can be checked on various IPv6 accessibility test tools) which can greatly reduce your proxy budget!

Proxy Protocols

There are two major proxy protocols used these days: HTTP and SOCKS (latest SOCKS5).
In the context of web-scraping there isn't much practical difference between these two protocols: SOCKS protocol tends to be a bit faster, more stable and secure however HTTP proxies are more widely adopted by proxy providers and HTTP client libraries used for web scraping.

Proxy Types

There are 4 types of proxy IPs that are used in web scraping: Datacenter, Residential, Static Residential (aka ISP) and Mobile. They key difference between these 3 types is price, reliability (connection speed, IP rotation etc.) and stealth score (likelihood of being blocked).

Let's take a deeper look into each individual type, and their value in web scraping.

Datacenter Proxies

Datacenter IPs are commercially assigned to servers and are not affiliated with internet service providers (ISPs). Meaning, they are often flagged as high-risk of being bots. Typically, these IPs are also shared between many users further increasing flagging risk.

image

On the bright side, datacenter proxies are widely accessible, reliable and cheap! We recommend using datacenter proxies for teams with stronger engineering resources as dedicate engineering time to reverse-engineer scraping targets and to designing smart proxy rotation algorithms is often required.

Residential Proxies

Residential IPs are assigned by ISPs meaning they are lower risk of being flagged as they are attached to a real address and are wrapped in a stricter legal framework. Meaning, they are great for web scraping as they're the same IPs real humans use!

image

Unfortunately, residential IPs are much pricier than datacenter ones. Additionally, these proxies sometimes can have issues with maintaining the same IP for long periods of time thus often referred to as "Rotating Residential Proxies". This can be problematic for some targets that require same IP to maintain connection session, which might require web scraper to re-authenticated session repeatedly causing friction in web scraping process. These proxies are great for teams that have limited engineering resources as they have high stealth score and are relatively affordable.

Static Residential / ISP Proxies

Residential IPs have a great stealth score but are unreliable as they aren't powered by a strong datacenter infrastructure. What if we combine the best of both worlds: reliability of datacenter proxies and stealth of residential proxies?

image

ISP proxies (aka Static Residential proxies) are datacenter proxies, which are registered as ISP IPs meaning they get most of stealth benefits of a residential proxy, and the persistency/network quality of a datacenter proxy! We're recommending ISP proxies for web scrapers, which need to maintain an IP-based session for long periods of time and avoid captchas and anti-bot systems.

Mobile Proxies

Mobile IPs are assigned by a mobile service provider (think 4G etc.) and since they are assigned dynamically to whomever is around the cell tower they are not tied to a single individual. Meaning, they are really low risk of being blocked or forced to go through a captcha.

image

Mobile proxies are just more extreme versions of residential proxies: maintaining same IP might be harder, and they are even more expensive. That being said, they are amazing for teams with low engineering resources as they solve most of the connection blocking by origin virtue alone!


As you can see, a clear pattern emerges: the more complex and rare the IP is, the harder it is to identify however it also costs more. The complexity of a proxy also decreases its reliability.

So which one to choose?

image

To put it shortly - it all depends on your target and project resources.
Datacenter proxies are great for getting around simple rate limiting and as general safety net. Residential proxies greatly reduce chance of captchas and being caught by anti-web-scraping protection services and mobile proxies takes this even further.

We usually recommend starting with sizable pool of datacenter proxies as they are significantly cheaper and more reliable and evaluate from there as the project grows.

Other Types of Proxies?

We've covered 3 types of proxies, but the internet is a clever place and there are other less known ways to mask your IP.

Probably the most popular alternative is using Virtual Private Network (VPN) services as proxies. VPN's are essentially proxies with a more complex/stronger tunneling protocol.
Since a single VPN exit is shared by many users (like Mobile proxies) this can be advantageous as other users can raise IP's stealth score by solving captchas and browsing around like human beings. On the other hand, it can be the opposite and exit IP might be completely polluted by other power users.

So to summarize: VPN approach is very unstable and accessibility heavily varies by VPN provider. Not many providers offer http/socks5 proxy access to their VPN servers, however with a bit of technical know-how VPN servers can also be used as proxies for casual web scraping projects.

Another alternative proxy type is The Onion Network (TOR). TOR is a privacy layer protocol where many servers bounce traffic around to mask client's origin. Main down-side of using TOR network that it's a volunteer driven network with limited, publicly known exit nodes. Meaning, TOR network has a very low stealth score. Additionally, because of protocol's complexity and limited amount of volunteer exit nodes TOR connections are very slow and often unreliable.

TOR can be used for web scraping with varying results, however we would not recommend it for anything other than educational purposes.

Bandwidth

When shopping around for proxies the first thing we'll notice that most proxies are priced by proxy count and bandwidth. Bandwidth can quickly become a huge budget sink for some web scraping scenarios, so it's important to evaluate bandwidth consumption before choosing a proxy provider.

It's easy to overlook bandwidth usage and end up with a huge proxy bill, so let's take a look at some examples:

target avg document page size pages per 1GB avg browser page size pages per 1GB
Walmart.com 16kb 1k - 60k 1 - 4 MB 200 - 2,000
Indeed.com 20kb 1k - 50k 0.5 - 1 MB 1,000 - 2,000
LinkedIn.com 35kb 300 - 30k 1 - 2 MB 500 - 1,000
Airbnb.com 35kb 30k 0.5 - 4 MB 250 - 2,000
Target.com 50kb 20k 0.5 - 1 MB 1,000 - 2,000
Crunchbase.com 50kb 20k 0.5 - 1 MB 1,000 - 2,000
G2.com 100kb 10k 1 - 2 MB 500 - 2,000
Amazon.com 200kb 5k 2 - 4 MB 250 - 500

In the table above we see average bandwidth usage by various targets. If we look closely, we can see some patterns emerge: big heavy HTML websites (like amazon.com) use a lot of bandwidth compared to dynamic websites that use background resource request to populate their page (like walmart.com).

Another example of bandwidth sink is using browser automation tools like Puppeteer, Selenium or Playwright. Since web browser are less precise in their connections they often download a lot of unnecessary data like images, fonts and so on. Because of this, it's essential to configure browser automation setups with resource blocking rules and proper caching rules to prevent bandwidth overhead but generally expect browser traffic to be much more expensive bandwidth wise.

Scraping Dynamic Websites Using Browser

For more on how to optimize web browser automation tools see our extensive article that covers three major packages: Puppeteer, Playwright and Selenium

Scraping Dynamic Websites Using Browser

Common Proxy Issues

Having a middleman between your client and the server can introduce a lot of issues.

Probably the biggest issue is the support of HTTP2/3 traffic. The newer HTTP protocols are typically preferred in web-scraping to avoid blocking. Unfortunately, lots of HTTP proxies struggle with this sort of traffic, so when choosing a proxy provider for web-scraping we advise testing HTTP2 quality first!

Another common proxy provider issue is connection concurrency. Typically, proxy services have a limit on concurrent proxy connections, which might be too small for powerful web-scrapers. Because of this we advise doing research on concurrent connection limit and to throttle scrapers a bit below that limit to prevent proxy related connection crashes.

Finally, proxies do introduce a lot of additional complexity to a web scraping project, so when proxies are used we recommend investing additional engineering effort in to retry/error handling logic.

Proxies at ScrapFly

At ScrapFly we realize how complicated proxies are in web scraping, so we made it our goal to simplify the process while also keeping the service accessible.

scrapfly middleware

ScrapFly feels like a proxy but does much more!

ScrapFly offers a request middleware service, which ensures that outgoing requests result in successful responses. This is done by a combination of unique ScrapFly features such as a smart proxy selection algorithm, anti web scraping protection solver and browser based rendering.

ScrapFly is using credit based pricing model, which is much easier to predict and scale than bandwidth/proxy count based pricing. This allows flexible pricing based on used features rather than arbitrary measurements such as bandwidth, meaning our users aren't locked in to a single solution and can adjust their scrapers on the fly!

image of scrapfly's pricing tiers

For example, the most popular $100/Mo tier can yield up to 1,000,000 target responses based on enabled features:

  • ScrapFly provides a choice of either datacenter or residential proxies and geolocation (over 50+ locations) for each request.
  • All ScrapFly HTTP1 requests are automatically converted to HTTP2 requests, which are significantly less likely to be blocked.
  • ScrapFly offers smart Anti Scraping Protection solution, which solves various captchas and scraping protection blockers if they do appear during the scraping process. What's great about ASP service is that the user only charged 5 credits for successful solutions, meaning this can be applied to every request worry free!
  • ScrapFly offers browser based rendering, which even further reduces chances of being blocked as real web browsers are much less likely to be blocked than HTTP clients. Using browser based rendering also greatly simplifies web scraping process as it reduces engineering efforts needed to understand scrape website - your requests will return the same data users see in their web browsers!

To explore these and other offered features see our full documentation!

Summary

In this introduction article we've learned a lot about proxies. We compared IPv4 vs IPv6 internet protocols and HTTP vs SOCKS proxy protocols. Then we took a deep look into proxy types: datacenter, residential, mobile and how they differ in web scraper blocking. Finally, we wrapped everything up by taking a look at common proxy challenges like bandwidth based pricing, HTTP2 support and proxy stability issues.

Best Proxy Providers for Web Scraping

Now that you're familiar with proxies, and their challenges - see our comparison write up on top proxy providers!

Best Proxy Providers for Web Scraping

Proxies are complicated and can be hard to work with so try out our flat priced ScrapFly solution for free!

Related post

Web Scraping With Node-Unblocker

Tutorial on using Node-Unblocker - a nodejs library - to avoid blocking while web scraping and using it to optimize web scraping stacks.

How to Avoid Web Scraping Blocking: IP Address Guide

How IP addresses are used in web scraping blocking. Understanding IP metadata and fingerprinting techniques to avoid web scraper blocks.

Top 5 Residential Proxy Providers for Web Scraping

Analysis and comparison of top residential proxy providers. What to look for in residential proxies for web scraping?

Best Proxy Providers for Web Scraping

Analysis and comparison of some of the most popular proxy providers. What makes a good proxy providers? What features and dangers to look out for?