How to Rotate Proxies in Web Scraping

by Bernardas Ališauskas Aug 22, 2024

#python #proxies

Web scraping at scale often requires big pools of proxies to avoid rate limiting, throttling or even blocking. As proxies are expensive resources correct distribution can make a huge difference in the overall resource usage.

In this article, we'll take at what is proxy rotation and how it helps to avoid blocking as well as some common rotation patterns, tips and tricks.

The Complete Guide To Using Proxies For Web Scraping

Introduction to proxy usage in web scraping. What types of proxies are there? How to evaluate proxy providers and avoid common issues.

Why Rotate Proxies?

When web scraping with a pool of proxies our connection success chances can be influenced by connection patterns. In other words, if proxy A connects to 50 pages in 5 seconds it's like to get throttled or blocked. However, if we have proxies A, B, C and have them take turns it will avoid pattern-based rate limiting and blocking.

So, what are the best ways to rotate our proxies?

Proxy Distribution and Order

The way we distribute our proxy rotation can make a huge difference. For example, a common approach is to simply pluck a random proxy from the pool for each request however it's not a very smart or efficient approach.

The first thing we should do is rotate proxies by subnet or proxy origin. For example, if we have a proxy pool of:

import random
proxies = [
    "xx.xx.123.1",
    "xx.xx.123.2",
    "xx.xx.123.3",
    "xx.xx.124.1",
    "xx.xx.124.2",
    "xx.xx.125.1",
    "xx.xx.125.2",
]
def get_proxy():
    return random.choice(proxies)

print("got 10 proxies:")
for i in range(10):
    print(get_proxy())

If we were to pluck a random proxy for our web scraping request there's a likely chance that proxies of the same subnet would end up in a row.

illustration of IPv4 address structure — The subnet is the 3rd number of a IPv4 address

As many throttling/blocking services consider subnet in their logic our proxy rotator would be in a major disadvantage.

Instead, it's best to ensure that we randomize by subnet:

import random
proxies = [
    "xx.xx.123.1",
    "xx.xx.123.2",
    "xx.xx.123.3",
    "xx.xx.124.1",
    "xx.xx.124.2",
    "xx.xx.125.1",
    "xx.xx.125.2",
]
last_subnet = ""
def get_proxy():
    while True:
        ip = random.choice(proxies)
        ip_subnet = ip.split('.')[2]
        if ip_subnet != last_subnet:
            last_subnet = ip_subnet
            return ip

print("got 10 proxies:")
for i in range(10):
    print(get_proxy())

Now, our proxy selector will never return two proxies from the same subnet in a row.

The subnet is not the only axis that should weigh on randomization.
Proxies also come with a lot of metadata such as IP address ASN (Autonomous System Number which essentially means "proxy IP owner ID"), location etc. Based on the web scraping target it's a good idea to rotate proxies based on all of these features.

How to Avoid Web Scraper IP Blocking?

How IP addresses are used in web scraping blocking. Understanding IP metadata and fingerprinting techniques to avoid web scraper blocks.

Tracking Proxy Performance

Once we have our proxies rolling eventually they will start being blocked. Some will be more successful than others and some blocked ones will at some point recover.

We can further extend our rotator to track proxy performance. For example, we can mark dead ones to prevent from being picked so they have some time to recover:

import random
from datetime import datetime

proxies = [
    "xx.xx.123.1",
    "xx.xx.123.2",
    "xx.xx.123.3",
    "xx.xx.124.1",
    "xx.xx.124.2",
    "xx.xx.125.1",
    "xx.xx.125.2",
]

last_subnet = ""
dead_proxies = {}
def get_proxy():
    while True:
        ip = random.choice(proxies)
        ip_subnet = ip.split('.')[2]
        if ip_subnet == last_subnet:
            continue
        if ip in dead_proxies:
            if dead_proxies[ip] - datetime.utcnow() > timedelta(seconds=30):
                #proxy has not recovered yet - skip
                continue
            else:
                # proxy has recovered - set it free!
                del dead_proxies[ip]
        last_subnet = ip_subnet
        return ip

def scrape(url, retries=0):
    proxy = get_proxy()
    response = requests.get(url, proxies=proxy)
    # mark dead proxies and retry
    if response.status_code == 200:
        return response
    else:
        dead_proxies[proxy] = datetime.utcnow()
        retries += 1
    if retries > 3:
        raise RetriesExceeded()
    scrape(url, retries=retries)

Above, our simple proxy rotator distributes proxies randomly by subnet and keeps track of dead proxies.

Weighted Random Proxy Example Rotator

Let's put together what we've learned and build a random rotator. For this we'll be using python weighted random choice function random.choices.

random.choices selects a random element but the best feature is that we can assign custom weights to the available choices. This allows us to prioritize some proxies over the others. For example:

from collections import Counter
import random

proxies = [
    ("xx.xx.123.1", 10),
    ("xx.xx.123.2", 1),
    ("xx.xx.123.3", 1),
    ("xx.xx.124.1", 10),
    ("xx.xx.124.2", 1),
    ("xx.xx.125.1", 10),
    ("xx.xx.125.2", 1),
]
counter = Counter()
for i in range(1000):
    choice = random.choices([p[0] for p in proxies], [p[1] for p in proxies], k=1)
    counter[choice[0]] += 1
for proxy, used_count in counter.most_common():
    print(f"{proxy} was used {used_count} times")

In the above example, we weighted all proxies that end in .1 ten times as heavy as others - which means they get picked 10 times more often:

xx.xx.125.1 was used 298 times
xx.xx.124.1 was used 292 times
xx.xx.123.1 was used 283 times
xx.xx.125.2 was used 38 times
xx.xx.124.2 was used 34 times
xx.xx.123.3 was used 30 times
xx.xx.123.2 was used 25 times

Weighted randomization gives us immense creativity in designing our proxy rotator!

For example, we can:

penalize recently used proxies
penalize failing proxies
promote healthy or fast proxies
promote one proxy type over the other (e.g. residential proxies over datacenter)

This approach allows us to create proxy rotators that are unpredictable but smart. Let's take a look at an example implementing this logic:

import random
from time import time
from typing import List, Literal


class Proxy:
    """container for a proxy"""

    def __init__(self, ip, type_="datacenter") -> None:
        self.ip: str = ip
        self.type: Literal["datacenter", "residential"] = type_
        _, _, self.subnet, self.host = ip.split(":")[0].split('.')
        self.status: Literal["alive", "unchecked", "dead"] = "unchecked"
        self.last_used: int = None

    def __repr__(self) -> str:
        return self.ip

    def __str__(self) -> str:
        return self.ip


class Rotator:
    """weighted random proxy rotator"""

    def __init__(self, proxies: List[Proxy]):
        self.proxies = proxies
        self._last_subnet = None

    def weigh_proxy(self, proxy: Proxy):
        weight = 1_000
        if proxy.subnet == self._last_subnet:
            weight -= 500
        if proxy.status == "dead":
            weight -= 500
        if proxy.status == "unchecked":
            weight += 250
        if proxy.type == "residential":
            weight += 250
        if proxy.last_used: 
            _seconds_since_last_use = time() - proxy.last_used
            weight += _seconds_since_last_use
        return weight

    def get(self):
        proxy_weights = [self.weigh_proxy(p) for p in self.proxies]
        proxy = random.choices(
            self.proxies,
            weights=proxy_weights,
            k=1,
        )[0]
        proxy.last_used = time()
        self.last_subnet = proxy.subnet
        return proxy

Example Run Code & Output

We can mock-run our Rotator to see proxy distribution:

from collections import Counter

if __name__ == "__main__":
    proxies = [
        # these will be used more often
        Proxy("xx.xx.121.1", "residential"),
        Proxy("xx.xx.121.2", "residential"),
        Proxy("xx.xx.121.3", "residential"),
        # these will be used less often
        Proxy("xx.xx.122.1"),
        Proxy("xx.xx.122.2"),
        Proxy("xx.xx.123.1"),
        Proxy("xx.xx.123.2"),
    ]
    rotator = Rotator(proxies)

    # let's mock some runs:
    _used = Counter()
    _failed = Counter()
    def mock_scrape():
        proxy = rotator.get()
        _used[proxy.ip] += 1
        if proxy.host == "1":  # simulate proxies with .1 being significantly worse
            _fail_rate = 60
        else:
            _fail_rate = 20
        if random.randint(0, 100) < _fail_rate:  # simulate some failure
            _failed[proxy.ip] += 1
            proxy.status = "dead"
            mock_scrape()
        else:
            proxy.status = "alive"
            return
    for i in range(10_000):
        mock_scrape()

    for proxy, count in _used.most_common():
        print(f"{proxy} was used   {count:>5} times")
        print(f"                failed {_failed[proxy]:>5} times")

Now, when we run our script we can see that our rotator picks residential proxies more often and proxies ending with ".1" less often just by using weighted randomization:

xx.xx.121.2 was used    2629 times
                failed   522 times
xx.xx.121.3 was used    2603 times
                failed   508 times
xx.xx.123.2 was used    2321 times
                failed   471 times
xx.xx.122.2 was used    2302 times
                failed   433 times
xx.xx.121.1 was used    1941 times
                failed  1187 times
xx.xx.122.1 was used    1629 times
                failed   937 times
xx.xx.123.1 was used    1572 times
                failed   939 times

In this example, we can see how we can use simple probability adjustment to intelligently rotate proxies with very little oversight or tracking code.

Proxies with ScrapFly

ScrapFly web scraping API has over 190M of proxies from 92+ countries at its disposal and rotates IPs smartly for us.

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - scrape web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - scrape dynamic web pages through cloud browsers.
Full browser automation - control browsers to scroll, input and click on objects.
Format conversion - scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

Scrapfly will automatically rotate proxies for us and we can select the country or proxy type we want to use:

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
client.scrape(ScrapeConfig(
    url="https://httpbin.dev/get",
    # we can select proxy country:
    country="US",
    # or proxy type:
    proxy_pool="public_residential_pool",
))

We can select the IP address origin country to reach geographically locked content or switch to residential proxies for those hard-to-scrape targets.

ScrapFly can go even further and bypass anti-scraping protection systems used by websites to block web scrapers:

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
client.scrape(ScrapeConfig(
    url="https://httpbin.dev/get",
    # we can enable Anti Scraping Protection bypass:
    asp=True,
))

The API will smartly select a proxy and a connection fingerprint that will be let through by the anti-scraping service.

Proxy Rotation Summary

In this article, we've taken a look at common proxy rotation strategies and how they can help us to avoid blocking when web scraping.

We covered a few popular rotation patterns like score tracking based on proxy details and performance. Finally, we made an example proxy rotator hat uses random weighted proxy selection to smartly select proxies for our web scraping connections.

5 Tools to Scrape Without Blocking and How it All Works

Tutorial on how to avoid web scraper blocking. What is javascript and TLS (JA3) fingerprinting and what role request headers play in blocking.

How to Rotate Proxies in Web Scraping

The Complete Guide To Using Proxies For Web Scraping

Why Rotate Proxies?

Proxy Distribution and Order

How to Avoid Web Scraper IP Blocking?

Tracking Proxy Performance

Weighted Random Proxy Example Rotator

Proxies with ScrapFly

Proxy Rotation Summary

5 Tools to Scrape Without Blocking and How it All Works

Explore this Article with AI

Related Knowledgebase

What is The cURL (28) Error, Couldn't connect to server?

How to Solve the cURL (60) Error When Using Proxy?

How To Use Proxy With cURL?

What are SOCKS5 proxies and how they compare to HTTP proxies?

Mobile vs Residential Proxies - which to choose for scraping?

Python httpx vs requests vs aiohttp - key differences

What Python libraries support HTTP2?

How to handle popup dialogs in Playwright?

How to use proxies with Python httpx?

How to scrape images from a website?

How to rotate proxies in scrapy spiders?

What are some ways to parse JSON datasets in Python?

Related Articles

How to Scrape Imovelweb.com

How to Scrape AutoScout24

How to Scrape Allegro.pl

How to Scrape Ticketmaster

How to Scrape Mouser.com