Web Scraping with CloudProxy

by Mazen Ramadan Aug 22, 2024

#tools #proxies

One of the most common challenges encountered while web scraping is IP throttling and blocking, which requires changing the IP address to prevent IP detection.

In this article, we'll explore the CloudProxy tool, which can be used to change the requests' IP address while scraping. We'll start by explaining about this tool and how to install it. Then, we'll explain how to use it for cloud-based web scraping. Let's get started!

What is CloudProxy?

Remote machines on cloud providers, such as DigitalOcean droplets and AWS EC2s, are pre-configured with a sophisticated network infrastructure. CloudProxy utilizes the remote servers' networks to create a set of IPs, which can be used as a proxy pool.

In other words, CloudProxy acts as a cloud-based web scraper middleware. It uses the cloud providers' APIs to create an IP address pool. Then, the web scraping requests are sent from these IPs. The CloudProxy can be used with different CloudProviders, AWS, Google Cloud, DigitalOcean and Hetzener.

Why Use CloudProxy For Web Scraping?

Websites and antibots can access the scraper's IP address. So, if many scrape requests are sent over a short time window, the website can flag them as automated. This IP detection can lead the website to set throttling rules that deny access to the request for a specific limit or even block the IP address entirely.

Using CloudProxy for web scraping, we can distribute the scraper's load across different IPs. This makes it harder for antibot to track down the
IP origin, allowing for avoiding IP address blocking.

For more details, refer to our extensive article on using proxies for web scraping.

The Complete Guide To Using Proxies For Web Scraping

Introduction to proxy usage in web scraping. What types of proxies are there? How to evaluate proxy providers and avoid common issues.

How to Install CloudProxy?

CloudProxy can be installed using Docker. If you don't have Docker installed, refer to the Docker installation page for detailed instructions. After installing Docker, you can install CloudProxy using its Docker image. Create a docker-comopse.yml file with the following build instructions:

version: '3'

services:
  cloudproxy:
    image: laffin/cloudproxy:latest
    ports:
      - "8000:8000"
    environment:
      - USERNAME=CHANGE_THIS_USERNAME
      - PASSWORD=CHANGE_THIS_PASSWORD
      - ONLY_HOST_IP=True
      # select a cloud provider settings
      # DigitalOcean
      DIGITALOCEAN_ENABLED=True
      DIGITALOCEAN_ACCESS_TOKEN=YOUR SECRET ACCESS KEY
      # AWS
      AWS_ENABLED=True
      AWS_ACCESS_KEY_ID=YOUR AWS ACCESS ID
      AWS_SECRET_ACCESS_KEY=YOUR SECRET ACCESS KEY
      # Google Cloud Platform
      GCP_ENABLED=True
      GCP_PROJECT=YOUR GCP PROJECT ID
      GCP_SERVICE_ACCOUNT_KEY=YOUR GCP SERVICE KEY
      # Hetzner
      HETZNER_ENABLED=True
      HETZNER_ACCESS_TOKEN=YOUR SECRET ACCESS KEY

Let's break down the above configuration:

The only_host_ip parameter is used to restrict access to the CloudProxy from the hosting server only.
The username and password parameters are used to authenticate the proxies from the web scraper side.

The rest of the parameters are authorization keys related to each cloud provider.

The last step is to start the Docker image and we'll have a running CloudProxy instance:

docker-compose up --build

The above command will run the CloudProxy instance at localhost:8000. It will also run a frontend app for scaling the proxies up or down, which can be found at the URL localhost:8000/ui.

Now that we have CloudProxy up and running, let's install a few Python libraries to use it for web scraping:

httpx for sending HTTP requests and getting the response as HTML.
parsel for parsing the HTML using query languages, such as XPath and CSS selectors.

The above libraries can be installed using the pip command:

pip install httpx parsel

How to Use CloudProxy For Web Scraping?

To retrieve all the proxy IPs on CloudProxy, we'll have to send a simple GET request to the http://localhost:8000/ endpoint:

curl -X 'GET' 'http://localhost:8000/' -H 'accept: application/json'

The response will look like this:

{"ips":["http://username:password:192.168.0.1:8899", "http://username:password:192.168.0.2:8899"]}

To rotate the proxies within our web scraper, we'll split our scraping logic into three parts:

Get a random proxy IP from CloudProxy.
Use the proxy we obtained with our HTTP client to scrape the target website's HTML.
Parse the response HTML for page data.

Let's use CloudProxy to create a cloud-based web scraper for extracting product data on web-scraping.dev:

import json
import httpx
import random
from parsel import Selector

def parse_products(response):
    """parse products from HTML"""
    selector = Selector(response.text)
    data = []
    for product in selector.xpath("//div[@class='row product']"):
        name = product.xpath(".//div[contains(@class, description)]/h3/a/text()").get()
        link = product.xpath(".//div[contains(@class, description)]/h3/a/@href").get()
        product_id = link.split("/product/")[-1]
        price = float(product.xpath(".//div[@class='price']/text()").get())
        image = product.xpath(".//img/@src").get()
        data.append({
            "product_id": int(product_id),
            "name": name,
            "link": link,
            "price": price,
            "image": image
        })
    return data


def random_proxy():
    """get a random proxy from CloudProxy"""
    ips = httpx.get("http://localhost:8000").json()
    return random.choice(ips['ips'])


def scrape_products(url):
    """scrape product data"""
    # get a random proxy
    proxy = random_proxy()
    # request the target website with a random proxy
    response = httpx.get(url, proxies=proxy)
    # parse the HTML
    data = parse_products(response)
    return data


data = scrape_products(url="https://web-scraping.dev/products")
print(json.dumps(data, indent=2))

In the above code, we define three functions. Let's break them down:

parse_products for parsing the HTML using XPath selector.
random_proxy for getting a random proxy IP from the CloudProxy server.
scrape_products for requesting the target website URL with the proxy we got and parsing the HTML to extract the product's data.

Here is the result we got:

[
  {
    "product_id": 1,
    "name": "Box of Chocolate Candy",
    "link": "https://web-scraping.dev/product/1",
    "price": 24.99,
    "image": "https://web-scraping.dev/assets/products/orange-chocolate-box-medium-1.png"
  },
  {
    "product_id": 2,
    "name": "Dark Red Energy Potion",
    "link": "https://web-scraping.dev/product/2",
    "price": 4.99,
    "image": "https://web-scraping.dev/assets/products/darkred-potion.png"
  },
  {
    "product_id": 3,
    "name": "Teal Energy Potion",
    "link": "https://web-scraping.dev/product/3",
    "price": 4.99,
    "image": "https://web-scraping.dev/assets/products/teal-potion.png"
  },
  {
    "product_id": 4,
    "name": "Red Energy Potion",
    "link": "https://web-scraping.dev/product/4",
    "price": 4.99,
    "image": "https://web-scraping.dev/assets/products/red-potion.png"
  },
  {
    "product_id": 5,
    "name": "Blue Energy Potion",
    "link": "https://web-scraping.dev/product/5",
    "price": 4.99,
    "image": "https://web-scraping.dev/assets/products/blue-potion.png"
  }
]

With CloudProxy, we were able to create a proxy pool and get a random proxy IP to avoid IP address blocking. However, our cloud-based web scraper has a few limitations. Let's take a closer look.

CloudProxy Limitations

A significant downside of using CloudProxy for scraping is the request blocking of the IP address type. Since CloudProxy uses cloud providers to create proxy servers, the proxy IP address is classified as a data center. This IP address type has a low trust score with websites and antibots, as regular users don't browse the internet with this IP type.

CloudProxy provides an IP addresses pool to avoid IP restrictions, such as throttling and blocking. However, highly protected websites often challenge requests with CAPTCHAs if the traffic is suspected. For example, let's attempt to scrape leboncoin.fr using CloudProxy:

import httpx
import random

def random_proxy():
    """get a random proxy from CloudProxy"""
    ips = httpx.get("http://localhost:8000").json()
    return random.choice(ips['ips'])

proxy = random_proxy()

response = httpx.get("https://www.leboncoin.fr/", proxies=proxy)
print(response)
"<Response [403 Forbidden]>"

As we can see from the response status code, the request is blocked. Leboncoin has suspected our data center IP address and required us to solve a Datadome challenge:

captcha challenge page on leboncoin.fr — CloudProxy scraping blocking

Let's have a look at a more efficient proxy solution to avoid blocking!

Proxies With ScrapFly

ScrapFly is a web scraping API with residential proxies from over 50+ countries, which allows for avoiding IP throttling and blocking while also allowing for scraping from almost any geographical location.

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - scrape web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - scrape dynamic web pages through cloud browsers.
Full browser automation - control browsers to scroll, input and click on objects.
Format conversion - scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

Here is how we can use ScrapFly proxies to avoid web scraping blocking. All we have to do is select a proxy pool and enable the asp argument:

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

api_response: ScrapeApiResponse = scrapfly.scrape(
    ScrapeConfig(
        # target website URL
        url="https://www.leboncoin.fr",
        # select a proxy pool (residential or datacenter)
        proxy_pool="public_residential_pool",
        # Set the proxy location to a specific country
        country="FR",        
        # JavaScript rendering, similar to headless browsers
        render_js=True,
        # Bypass anti scraping protection
        asp=True
    )
)
# Print the website's status code
print(api_response.upstream_status_code)
"200"

Try for FREE! More on Scrapfly

FAQ

Let's wrap this quick introduction to cloudproxy with some frequently asked questions about scraping with it

How much does CloudProxy cost?

Cloudproxy itself is free however you need access to at least 1 cloud provider to use it. Cloud providers have different pricing models but it starts at $5/mo.

How many IPs can I access with CloudProxy?

The number of IPs you can access depends on the number of added cloud providers however each IP is rotated which significantly increases the IP pool in practice.

What cloud providers does CloudProxy support?

Cloudproxy supports DigitalOcean, AWS, Google Cloud, Hetzner currently and has Azure, Scaleway and Vultr support planned for future releases.