Web Scraping with CloudProxy

article feature image

One of the most common challenges encountered while web scraping is IP throttling and blocking, which requires changing the IP address to prevent IP detection.

In this article, we'll explore the CloudProxy tool, which can be used to change the requests' IP address while scraping. We'll start by explaining about this tool and how to install it. Then, we'll explain how to use it for cloud-based web scraping. Let's get started!

What is CloudProxy?

Remote machines on cloud providers, such as DigitalOcean droplets and AWS EC2s, are pre-configured with a sophisticated network infrastructure. CloudProxy utilizes the remote servers' networks to create a set of IPs, which can be used as a proxy pool.

In other words, CloudProxy acts as a cloud-based web scraper middleware. It uses the cloud providers' APIs to create an IP address pool. Then, the web scraping requests are sent from these IPs. The CloudProxy can be used with different CloudProviders, AWS, Google Cloud, DigitalOcean and Hetzener.

Why Use CloudProxy For Web Scraping?

Websites and antibots can access the scraper's IP address. So, if many scrape requests are sent over a short time window, the website can flag them as automated. This IP detection can lead the website to set throttling rules that deny access to the request for a specific limit or even block the IP address entirely.

Using CloudProxy for web scraping, we can distribute the scraper's load across different IPs. This makes it harder for antibot to track down the
IP origin, allowing for avoiding IP address blocking.

For more details, refer to our extensive article on using proxies for web scraping.

Introduction To Proxies in Web Scraping

Learn what types of proxies are there, how they compare against each other, what common challenges are posed by proxy usage, and their best practices in web scraping.

Introduction To Proxies in Web Scraping

How to Install CloudProxy?

CloudProxy can be installed using Docker. If you don't have Docker installed, refer to the Docker installation page for detailed instructions. After installing Docker, you can install CloudProxy using its Docker image. Create a `docker-comopse.yml`` file with the following build instructions:

version: '3'

services:
  cloudproxy:
    image: laffin/cloudproxy:latest
    ports:
      - "8000:8000"
    environment:
      - USERNAME=CHANGE_THIS_USERNAME
      - PASSWORD=CHANGE_THIS_PASSWORD
      - ONLY_HOST_IP=True
      # select a cloud provider settings
      # DigitalOcean
      DIGITALOCEAN_ENABLED=True
      DIGITALOCEAN_ACCESS_TOKEN=YOUR SECRET ACCESS KEY
      # AWS
      AWS_ENABLED=True
      AWS_ACCESS_KEY_ID=YOUR AWS ACCESS ID
      AWS_SECRET_ACCESS_KEY=YOUR SECRET ACCESS KEY
      # Google Cloud Platform
      GCP_ENABLED=True
      GCP_PROJECT=YOUR GCP PROJECT ID
      GCP_SERVICE_ACCOUNT_KEY=YOUR GCP SERVICE KEY
      # Hetzner
      HETZNER_ENABLED=True
      HETZNER_ACCESS_TOKEN=YOUR SECRET ACCESS KEY

Let's break down the above configuration:

  • The only_host_ip parameter is used to restrict access to the CloudProxy from the hosting server only.
  • The username and password parameters are used to authenticate the proxies from the web scraper side.

The rest of the parameters are authorization keys related to each cloud provider.

The last step is to start the Docker image and we'll have a running CloudProxy instance:

docker-compose up --build

The above command will run the CloudProxy instance at localhost:8000. It will also run a frontend app for scaling the proxies up or down, which can be found at the URL localhost:8000/ui.

Now that we have CloudProxy up and running, let's install a few Python libraries to use it for web scraping:

  • httpx for sending HTTP requests and getting the response as HTML.
  • parsel for parsing the HTML using query languages, such as XPath and CSS selectors.

The above libraries can be installed using the pip command:

pip install httpx parsel

How to Use CloudProxy For Web Scraping?

To retrieve all the proxy IPs on CloudProxy, we'll have to send a simple GET request to the http://localhost:8000/ endpoint:

curl -X 'GET' 'http://localhost:8000/' -H 'accept: application/json'

The response will look like this:

{"ips":["http://username:password:192.168.0.1:8899", "http://username:password:192.168.0.2:8899"]}

To rotate the proxies within our web scraper, we'll split our scraping logic into three parts:

  1. Get a random proxy IP from CloudProxy.
  2. Use the proxy we obtained with our HTTP client to scrape the target website's HTML.
  3. Parse the response HTML for page data.

Let's use CloudProxy to create a cloud-based web scraper for extracting product data on web-scraping.dev:

import json
import httpx
import random
from parsel import Selector

def parse_products(response):
    """parse products from HTML"""
    selector = Selector(response.text)
    data = []
    for product in selector.xpath("//div[@class='row product']"):
        name = product.xpath(".//div[contains(@class, description)]/h3/a/text()").get()
        link = product.xpath(".//div[contains(@class, description)]/h3/a/@href").get()
        product_id = link.split("/product/")[-1]
        price = float(product.xpath(".//div[@class='price']/text()").get())
        image = product.xpath(".//img/@src").get()
        data.append({
            "product_id": int(product_id),
            "name": name,
            "link": link,
            "price": price,
            "image": image
        })
    return data


def random_proxy():
    """get a random proxy from CloudProxy"""
    ips = httpx.get("http://localhost:8000").json()
    return random.choice(ips['ips'])


def scrape_products(url):
    """scrape product data"""
    # get a random proxy
    proxy = random_proxy()
    # request the target website with a random proxy
    response = httpx.get(url, proxies=proxy)
    # parse the HTML
    data = parse_products(response)
    return data


data = scrape_products(url="https://web-scraping.dev/products")
print(json.dumps(data, indent=2))

In the above code, we define three functions. Let's break them down:

  • parse_products for parsing the HTML using XPath selector.
  • random_proxy for getting a random proxy IP from the CloudProxy server.
  • scrape_products for requesting the target website URL with the proxy we got and parsing the HTML to extract the product's data.

Here is the result we got:

[
  {
    "product_id": 1,
    "name": "Box of Chocolate Candy",
    "link": "https://web-scraping.dev/product/1",
    "price": 24.99,
    "image": "https://web-scraping.dev/assets/products/orange-chocolate-box-medium-1.png"
  },
  {
    "product_id": 2,
    "name": "Dark Red Energy Potion",
    "link": "https://web-scraping.dev/product/2",
    "price": 4.99,
    "image": "https://web-scraping.dev/assets/products/darkred-potion.png"
  },
  {
    "product_id": 3,
    "name": "Teal Energy Potion",
    "link": "https://web-scraping.dev/product/3",
    "price": 4.99,
    "image": "https://web-scraping.dev/assets/products/teal-potion.png"
  },
  {
    "product_id": 4,
    "name": "Red Energy Potion",
    "link": "https://web-scraping.dev/product/4",
    "price": 4.99,
    "image": "https://web-scraping.dev/assets/products/red-potion.png"
  },
  {
    "product_id": 5,
    "name": "Blue Energy Potion",
    "link": "https://web-scraping.dev/product/5",
    "price": 4.99,
    "image": "https://web-scraping.dev/assets/products/blue-potion.png"
  }
]

With CloudProxy, we were able to create a proxy pool and get a random proxy IP to avoid IP address blocking. However, our cloud-based web scraper has a few limitations. Let's take a closer look.

CloudProxy Limitations

A significant downside of using CloudProxy for scraping is the request blocking of the IP address type. Since CloudProxy uses cloud providers to create proxy servers, the proxy IP address is classified as a data center. This IP address type has a low trust score with websites and antibots, as regular users don't browse the internet with this IP type.

CloudProxy provides an IP addresses pool to avoid IP restrictions, such as throttling and blocking. However, highly protected websites often challenge requests with CAPTCHAs if the traffic is suspected. For example, let's attempt to scrape leboncoin.fr using CloudProxy:

import httpx
import random

def random_proxy():
    """get a random proxy from CloudProxy"""
    ips = httpx.get("http://localhost:8000").json()
    return random.choice(ips['ips'])

proxy = random_proxy()

response = httpx.get("https://www.leboncoin.fr/", proxies=proxy)
print(response)
"<Response [403 Forbidden]>"

As we can see from the response status code, the request is blocked. Leboncoin has suspected our data center IP address and required us to solve a Datadome challenge:

captcha challenge page on leboncoin.fr
CloudProxy scraping blocking

Let's have a look at a more efficient proxy solution to avoid blocking!

Proxies With ScrapFly

ScrapFly is a web scraping API with residential proxies from over 50+ countries, which allows for avoiding IP throttling and blocking while also allowing for scraping from almost any geographical location.

ScrapFly allows for scraping at scale by providing:

scrapfly middleware
ScrapFly service does the heavy lifting for you

Here is how we can use ScrapFly proxies to avoid web scraping blocking. All we have to do is select a proxy pool and enable the asp argument:

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

api_response: ScrapeApiResponse = scrapfly.scrape(
    ScrapeConfig(
        # target website URL
        url="https://www.leboncoin.fr",
        # select a proxy pool (residential or datacenter)
        proxy_pool="public_residential_pool",
        # Set the proxy location to a specific country
        country="FR",        
        # JavaScript rendering, similar to headless browsers
        render_js=True,
        # Bypass anti scraping protection
        asp=True
    )
)
# Print the website's status code
print(api_response.upstream_status_code)
"200"

FAQ

Let's wrap this quick introduction to cloudproxy with some frequently asked questions about scraping with it

How much does CloudProxy cost?

Cloudproxy itself is free however you need access to at least 1 cloud provider to use it. Cloud providers have different pricing models but it starts at $5/mo.

How many IPs can I access with CloudProxy?

The number of IPs you can access depends on the number of added cloud providers however each IP is rotated which significantly increases the IP pool in practice.

What cloud providers does CloudProxy support?

Cloudproxy supports DigitalOcean, AWS, Google Cloud, Hetzner currently and has Azure, Scaleway and Vultr support planned for future releases.

Web Scraping With CloudProxy Summary

In this article, we have explained about the CloudProxy tool. It authenticates with cloud providers to create a proxy server on their remote machines, which can be used to create a cloud-based web scraper.

We went through a step-by-step guide on how to install CloudProxy using Docker and how to use it with different cloud providers. We have also gone through a practical example of web scraping using CloudProxy.

Related Posts

Using API Clients For Web Scraping: Postman

In this article, we'll explore the use of API clients for web scraping. We'll start by explaining how to locate hidden API requests on websites. Then, we'll explore importing, manipulating, and exporting them using Postman to develop efficient API-based web scrapers.

Intro to Parsing HTML and XML with Python and lxml

In this tutorial, we'll take a deep dive into lxml, a powerful Python library that allows for parsing HTML and XML effectively. We'll start by explaining what lxml is, how to install it and using lxml for parsing HTML and XML files. Finally, we'll go over a practical web scraping with lxml.

Use Curl Impersonate to scrape as Chrome or Firefox

Learn how to prevent TLS fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking.