The Complete Guide To Using Proxies For Web Scraping

The Complete Guide To Using Proxies For Web Scraping

One of the most common challenges encountered when web scraping is scaling. For this, using proxies in web scraping is crucial! Having a set of quality proxies can prevent web scraping blocking. But what makes a quality proxy for web scraping, and what are the different proxies are there?

In this guide, we'll take an extensive guide to using proxies for web scraping. We'll explain the different types of proxies, how they compare, their challenges, and their best practices for using them in web scraping. Let's get started!

How to Scrape Without Getting Blocked? In-Depth Tutorial

In this article, we'll take a look at web scraping without getting blocked by exploring four core concepts where web scrapers fail to cover their tracks and how analysis of these details can lead to blocking.

How to Scrape Without Getting Blocked? In-Depth Tutorial

What Is a Proxy?

A proxy server is a middleware that lies between a client and a host. There are different usages for proxies, such as connection optimization. However, the most common usage of web scraping proxies is masking or hiding the client's IP address.

This IP masking is beneficial for two main purposes:

  • Accessing geographically blocked websites by changing the IP location.
  • Splitting the requests' traffic across multiple IP addresses.

In the web scraping context, proxy servers are used to prevent IP address blocking, as a high number of requests sent from the same IP address can cause the connection to be identified as non-human.

To further explore the usage of proxies for web scraping, let's have a look at the IP address types.

IP Protocol Versions

illustration of ipv4 versus ipv6 internet protocols

Currently, the internet runs on two types of IP addresses: IPv4 and IPv6. The key differences between these two protocols are the following:

  • Address quantity
    The IPv4 address pool is limited to around 4 billion addresses. This might seem like a lot, but the internet is a big place, and technically, we ran out of free addresses already! (see IPv4 address exhaustion)
  • Adoption
    Most websites still only support IP addresses with IPv4 connections, meaning we can't use IPv6 proxies unless we explicitly know our target website supports it.

How does the IP address type affect web scraping?
Since IPv6 is supported by very few target websites, we are limited to using IPv4 proxy servers, which are more expensive (3-10 times on average) as they are limited. That being said, some major websites do support IPv6, which can be checked on various IPv6 accessibility test tools. So, if your target website supports IPv6, the web scraping proxy pool budget can be significantly reduced!

Proxy Protocols

There are two major proxy protocols used these days: HTTP and SOCKS (latest SOCKS5). In the context of web scraping proxies, there isn't much practical difference between these two protocols. Proxy servers with the SOCKS protocol tend to be a bit faster, more stable, and more secure. On the other hand, HTTP proxies are more widely adopted by web scraping proxy providers and the HTTP client libraries.

Proxy Types

The proxy type is the most important aspect when choosing a web scraping proxy provider or creating a proxy pool. There are four types of proxy IP addresses:

  • Datacenter
  • Residential
  • Static Residential (aka ISP)
  • Mobile.

The key differences between the above proxy server types are the following:

  • Price
  • Reliability, such as speed or the automatic proxy rotation
  • Stealth score, the likelihood of getting blocked, which is low for the anonymous proxies.

Let's have a deeper look into the value and details of each web scraping proxies.

Datacenter Proxies

Datacenter IPs are commercially assigned to proxy services through cloud servers, and they aren't affiliated with internet service providers (ISPs). This web scraping proxy type is often flagged as high-risk (with a high chance of being automated). They can be provided as dedicated proxies or shared between multiple users, which increases the flagging risk in the last case.

image

On the bright side, datacenter proxies are widely accessible, reliable, and cheap! A proxy pool of this type is recommended for teams with reliable engineering resources to reverse engineer the target websites. This can be utilized to create a proxy manager for rotating proxies depending on the blocking rate.

How to Rotate Proxies in Web Scraping

For an efficient proxy rotation example, see our introduction guide on rotating proxies effectively using weighted randomization.

How to Rotate Proxies in Web Scraping

Residential Proxies

Residential IPs are assigned by ISPs and have a lower risk of being flagged, as they are assigned to home networks. Residential IPs make a reliable web scraping proxy as they are used by real humans!

image

That being said, proxy services with residential IP addresses are much pricier than the datacenter ones. Additionally, this proxy type can have session persistency issues with maintaining the same IP address for long periods. Hence, they are often referred to as "Rotating Residential Proxies".

Therefore, residential proxies can be problematic with specific target websites, as they require the same IP address to be maintained for the whole connection session. For example, if we are scraping web data at the end of a long process, the proxy manager can change the IP address before we reach the end.

A proxy service with residential IPs requires minimal engineering efforts, as they have a high trust score and are relatively affordable.

Static Residential / ISP Proxies

Residential IPs have a great trust score but are unreliable as they aren't powered by a reliable datacenter infrastructure. What if we combine the best of both worlds: the reliability of the datacenter proxies and the stealth of the residential proxies?

image

ISP proxies "Static Residential proxies" are a mixed version of residential and datacenter proxies. They combine the high score of residential IPs with the high proxy network quality of the datacenter infrastructure!

The static residential proxies are best suited for web scrapers, as they can benefit from the high trust score and the persistent connection sessions.

Mobile Proxies

Mobile IPs are assigned by mobile network towers. They have a dynamic IP address that gets rotated automatically. This means that they have a high trust score and are unlikely to get blocked or faced with CAPTCHA challenges.

image

Mobile proxies are an extreme version of residential proxies: maintaining the same IP might be more challenging, and they are even more expensive. This proxy type tends to be slower and less reliable. However, they are getting improved by web scraping proxy providers lately.

Mobile proxies don't require much engineering resources, as they solve most of their connection blocking by themselves!

Other Proxy Types

We've covered four proxy types. However, masking the IP address isn't only accessible through regular proxy providers. Let's quickly explore the other types.

Virtual Private Network (VPN)

VPNs are proxies with a more complex tunneling protocol. The IPs of a VPN are shared across many users. This means that the VPN IPs have low trust scores, and they are likely to get blocked or challenged with CAPTCHAs. Additionally, most of the VPNs don't provide access to their HTTP or SOCKS5 servers. However, they can be accessed for web scraping using a bot of technical knowledge.

The Onion Router (TOR)

Tor is an open-source software that provides anonymous proxies using volunteer-driven network layers. The IPs of Tor have a very low success rate. Tor connections are also slow and unreliable, making them ineffective for web scraping.

Which Web Scraping Proxy Provider to Choose?

In a nutshell, the more complex and rare the IP is, the harder it is to get identified and blocked, but it also costs more. However, more complex proxies have lower reliability.

image

Therefore, choosing a reliable proxy provider depends on your scraping target and project resources.

  • Datacenter proxies are great for getting around simple rate limiting and as a general safety net.
  • Residential proxies significantly reduce the chance of CAPTCHAs and getting caught by anti-web scraping protection services, but they require more engineering efforts.
  • Mobile proxies are suitable for websites with a higher blocking.

We recommend starting with a sizable pool of datacenter proxies as they are significantly cheaper and more reliable and evaluate from there as the project grows. However, they can easily be identified.

How to Avoid Web Scraper IP Blocking?

For more on how proxies are identified, tracked and blocked see our introduction article which explains what is IP metadata and how is it used to fingerprint connections.

How to Avoid Web Scraper IP Blocking?

Bandwidth Budget

When shopping around for the best web scraping proxies, we'll first notice that most proxies are priced by proxy count and bandwidth. Bandwidth can quickly become a huge budget sink for some web scraping scenarios, so evaluating bandwidth consumption is important before choosing dedicated proxies or a web scraping API.

It's easy to overlook bandwidth usage and end up with a huge proxy bill, so let's take a look at some examples:

target avg document page size pages per 1GB avg browser page size pages per 1GB
Walmart.com 16kb 1k - 60k 1 - 4 MB 200 - 2,000
Indeed.com 20kb 1k - 50k 0.5 - 1 MB 1,000 - 2,000
LinkedIn.com 35kb 300 - 30k 1 - 2 MB 500 - 1,000
Airbnb.com 35kb 30k 0.5 - 4 MB 250 - 2,000
Target.com 50kb 20k 0.5 - 1 MB 1,000 - 2,000
Crunchbase.com 50kb 20k 0.5 - 1 MB 1,000 - 2,000
G2.com 100kb 10k 1 - 2 MB 500 - 2,000
Amazon.com 200kb 5k 2 - 4 MB 250 - 500

In the table above, we see the average bandwidth usage by various targets. If we look closely, we can see some patterns emerge: big, heavy HTML websites (like Amazon) use a lot of bandwidth compared to dynamic websites that use background requests to populate their pages (like Walmart).

Another example of a bandwidth sink is using browser automation tools like Puppeteer, Selenium, or Playwright. Since web browsers are less precise in their connections they often download a lot of unnecessary data like images, fonts and so on.

Therefore, it's essential to configure browser automation setups with resource blocking rules and proper caching rules to prevent bandwidth overhead, but generally expect browser traffic to be much more expensive bandwidth-wise.

How to Scrape Dynamic Websites Using Headless Web Browsers

For more on how to optimize web browser automation tools see our extensive article that covers three major packages: Puppeteer, Playwright and Selenium

How to Scrape Dynamic Websites Using Headless Web Browsers

Common Proxy Issues

Proxy scraping is having a middleman between your client and the server, which can introduce many issues.

Probably the biggest issue is the support of HTTP2/3 traffic. The newer HTTP protocols are typically preferred in web scraping to avoid blocking. Unfortunately, lots of HTTP proxies struggle with this sort of traffic, so when choosing a web scraping proxy provider, we advise testing HTTP2 quality first!

Another common proxy provider issue is connection concurrency. Typically, proxy services have a limit on concurrent proxy connections, which might be too small for powerful web scrapers. Hence, it's recommended to do research on concurrent connection limits and throttling scrapers a bit below that limit to prevent proxy-related connection crashes.

Finally, proxies introduce a lot of additional complexity to a web scraping project. So, when using a proxy server for scraping, we recommend investing additional engineering effort in retry and error-handling logic.

Proxies at ScrapFly

Proxies can be a very powerful tool in web scraping but still not enough for scaling up some web scraping projects and this is where Scrapfly can assist!

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:

  • Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
  • Millions of self-healing proxies of the highest possible trust score.
  • Constantly evolving and adapting to new anti-bot systems.
  • We've been doing this publicly since 2020 with the best bypass on the market!

ScrapFly is a web scraping API that offers a request middleware service, which ensures that outgoing requests result in successful responses. This is done by a combination of unique ScrapFly features such as a smart proxy selection algorithm, anti-web scraping protection solver, and browser-based rendering.

ScrapFly uses credit-based pricing model, which is much easier to predict and scale than bandwidth-based pricing. This allows flexible pricing based on used features rather than arbitrary measurements such as bandwidth, meaning our users aren't locked into a single solution and can adjust their scrapers on the fly!

image of scrapfly's pricing tiers

For example, the most popular $100/Month tier can yield up to 1,000,000 target responses based on enabled features:

To explore these and other offered features see our full documentation!

FAQ

To wrap up this guide on using proxies for web scraping, let's take a look at some frequently asked questions.

Can free proxies be used in web scraping?

Yes, but only with a few benefits. Free scraping proxies are easy to identify and perform very poorly, so we recommend free proxy lists for low-demand web scraping and teams with many engineering resources to keep track of free proxy availability.

Are scraping proxies banned forever?

Usually, banned proxies recover within minutes, hours, or days. Permanent bans for web scraping are very unlikely, though some proxy providers are banned by various anti-scraping protection services.

Why use proxies in web scraping at all?

Proxies in web scraping are used to avoid scraper blocking or to access geographically restricted content. For more on how proxies are used in web scraper blocking, refer to our guide on IP address blocking.

Using Proxies For Web Scraping Summary

In this guide, we've learned a lot about proxies. We compared IPv4 vs IPv6 internet protocols and HTTP vs SOCKS proxy protocols. Then, we explored the different proxy types and how they differ in web scraping blocking. Finally, we wrapped everything up by looking at common proxy challenges like bandwidth-based pricing, HTTP2 support, and proxy stability issues.

Best Web Scraping Proxy Providers (2024 Update)

Now that you're familiar with proxies, and their challenges - see our comparison write-up on best proxies for web scraping.

Best Web Scraping Proxy Providers (2024 Update)

Proxies are complicated and can be hard to work with, so try out our flat-priced ScrapFly solution for free!

Related Posts

What is Charles Proxy and How to Use it?

Learn about of the most popular web debugging proxies — Charles Proxy and what it's capable.

How to Use Tor For Web Scraping

In this article, we'll explain web scraping using Tor. For this, we'll use Tor as a proxy server to change the IP address randomly in either HTTP or SOCKS, as well as using it as a rotating proxy server.

Web Scraping with CloudProxy

One of the most common challenges encountered while web scraping is IP throttling and blocking. Learn about the CloudProxy tool, how to install it and how to use it for cloud-based web scraping.