Comprehensive Guide to OkHttp for Java and Kotlin
Learn how to simplify network communication in Java and Android applications using OkHttp.
One of the most common challenges encountered while web scraping is IP throttling and blocking, which requires changing the IP address to prevent IP detection.
In this article, we'll explore the CloudProxy tool, which can be used to change the requests' IP address while scraping. We'll start by explaining about this tool and how to install it. Then, we'll explain how to use it for cloud-based web scraping. Let's get started!
Remote machines on cloud providers, such as DigitalOcean droplets and AWS EC2s, are pre-configured with a sophisticated network infrastructure. CloudProxy utilizes the remote servers' networks to create a set of IPs, which can be used as a proxy pool.
In other words, CloudProxy acts as a cloud-based web scraper middleware. It uses the cloud providers' APIs to create an IP address pool. Then, the web scraping requests are sent from these IPs. The CloudProxy can be used with different CloudProviders, AWS, Google Cloud, DigitalOcean and Hetzener.
Websites and antibots can access the scraper's IP address. So, if many scrape requests are sent over a short time window, the website can flag them as automated. This IP detection can lead the website to set throttling rules that deny access to the request for a specific limit or even block the IP address entirely.
Using CloudProxy for web scraping, we can distribute the scraper's load across different IPs. This makes it harder for antibot to track down the
IP origin, allowing for avoiding IP address blocking.
For more details, refer to our extensive article on using proxies for web scraping.
CloudProxy can be installed using Docker. If you don't have Docker installed, refer to the Docker installation page for detailed instructions. After installing Docker, you can install CloudProxy using its Docker image. Create a docker-comopse.yml
file with the following build instructions:
version: '3'
services:
cloudproxy:
image: laffin/cloudproxy:latest
ports:
- "8000:8000"
environment:
- USERNAME=CHANGE_THIS_USERNAME
- PASSWORD=CHANGE_THIS_PASSWORD
- ONLY_HOST_IP=True
# select a cloud provider settings
# DigitalOcean
DIGITALOCEAN_ENABLED=True
DIGITALOCEAN_ACCESS_TOKEN=YOUR SECRET ACCESS KEY
# AWS
AWS_ENABLED=True
AWS_ACCESS_KEY_ID=YOUR AWS ACCESS ID
AWS_SECRET_ACCESS_KEY=YOUR SECRET ACCESS KEY
# Google Cloud Platform
GCP_ENABLED=True
GCP_PROJECT=YOUR GCP PROJECT ID
GCP_SERVICE_ACCOUNT_KEY=YOUR GCP SERVICE KEY
# Hetzner
HETZNER_ENABLED=True
HETZNER_ACCESS_TOKEN=YOUR SECRET ACCESS KEY
Let's break down the above configuration:
only_host_ip
parameter is used to restrict access to the CloudProxy from the hosting server only.username
and password
parameters are used to authenticate the proxies from the web scraper side.The rest of the parameters are authorization keys related to each cloud provider.
The last step is to start the Docker image and we'll have a running CloudProxy instance:
docker-compose up --build
The above command will run the CloudProxy instance at localhost:8000
. It will also run a frontend app for scaling the proxies up or down, which can be found at the URL localhost:8000/ui
.
Now that we have CloudProxy up and running, let's install a few Python libraries to use it for web scraping:
The above libraries can be installed using the pip
command:
pip install httpx parsel
To retrieve all the proxy IPs on CloudProxy, we'll have to send a simple GET request to the http://localhost:8000/
endpoint:
curl -X 'GET' 'http://localhost:8000/' -H 'accept: application/json'
The response will look like this:
{"ips":["http://username:password:192.168.0.1:8899", "http://username:password:192.168.0.2:8899"]}
To rotate the proxies within our web scraper, we'll split our scraping logic into three parts:
Let's use CloudProxy to create a cloud-based web scraper for extracting product data on web-scraping.dev:
import json
import httpx
import random
from parsel import Selector
def parse_products(response):
"""parse products from HTML"""
selector = Selector(response.text)
data = []
for product in selector.xpath("//div[@class='row product']"):
name = product.xpath(".//div[contains(@class, description)]/h3/a/text()").get()
link = product.xpath(".//div[contains(@class, description)]/h3/a/@href").get()
product_id = link.split("/product/")[-1]
price = float(product.xpath(".//div[@class='price']/text()").get())
image = product.xpath(".//img/@src").get()
data.append({
"product_id": int(product_id),
"name": name,
"link": link,
"price": price,
"image": image
})
return data
def random_proxy():
"""get a random proxy from CloudProxy"""
ips = httpx.get("http://localhost:8000").json()
return random.choice(ips['ips'])
def scrape_products(url):
"""scrape product data"""
# get a random proxy
proxy = random_proxy()
# request the target website with a random proxy
response = httpx.get(url, proxies=proxy)
# parse the HTML
data = parse_products(response)
return data
data = scrape_products(url="https://web-scraping.dev/products")
print(json.dumps(data, indent=2))
In the above code, we define three functions. Let's break them down:
parse_products
for parsing the HTML using XPath selector.random_proxy
for getting a random proxy IP from the CloudProxy server.scrape_products
for requesting the target website URL with the proxy we got and parsing the HTML to extract the product's data.Here is the result we got:
[
{
"product_id": 1,
"name": "Box of Chocolate Candy",
"link": "https://web-scraping.dev/product/1",
"price": 24.99,
"image": "https://web-scraping.dev/assets/products/orange-chocolate-box-medium-1.png"
},
{
"product_id": 2,
"name": "Dark Red Energy Potion",
"link": "https://web-scraping.dev/product/2",
"price": 4.99,
"image": "https://web-scraping.dev/assets/products/darkred-potion.png"
},
{
"product_id": 3,
"name": "Teal Energy Potion",
"link": "https://web-scraping.dev/product/3",
"price": 4.99,
"image": "https://web-scraping.dev/assets/products/teal-potion.png"
},
{
"product_id": 4,
"name": "Red Energy Potion",
"link": "https://web-scraping.dev/product/4",
"price": 4.99,
"image": "https://web-scraping.dev/assets/products/red-potion.png"
},
{
"product_id": 5,
"name": "Blue Energy Potion",
"link": "https://web-scraping.dev/product/5",
"price": 4.99,
"image": "https://web-scraping.dev/assets/products/blue-potion.png"
}
]
With CloudProxy, we were able to create a proxy pool and get a random proxy IP to avoid IP address blocking. However, our cloud-based web scraper has a few limitations. Let's take a closer look.
A significant downside of using CloudProxy for scraping is the request blocking of the IP address type. Since CloudProxy uses cloud providers to create proxy servers, the proxy IP address is classified as a data center. This IP address type has a low trust score with websites and antibots, as regular users don't browse the internet with this IP type.
CloudProxy provides an IP addresses pool to avoid IP restrictions, such as throttling and blocking. However, highly protected websites often challenge requests with CAPTCHAs if the traffic is suspected. For example, let's attempt to scrape leboncoin.fr using CloudProxy:
import httpx
import random
def random_proxy():
"""get a random proxy from CloudProxy"""
ips = httpx.get("http://localhost:8000").json()
return random.choice(ips['ips'])
proxy = random_proxy()
response = httpx.get("https://www.leboncoin.fr/", proxies=proxy)
print(response)
"<Response [403 Forbidden]>"
As we can see from the response status code, the request is blocked. Leboncoin has suspected our data center IP address and required us to solve a Datadome challenge:
Let's have a look at a more efficient proxy solution to avoid blocking!
ScrapFly is a web scraping API with residential proxies from over 50+ countries, which allows for avoiding IP throttling and blocking while also allowing for scraping from almost any geographical location.
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
Here is how we can use ScrapFly proxies to avoid web scraping blocking. All we have to do is select a proxy pool and enable the asp
argument:
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
api_response: ScrapeApiResponse = scrapfly.scrape(
ScrapeConfig(
# target website URL
url="https://www.leboncoin.fr",
# select a proxy pool (residential or datacenter)
proxy_pool="public_residential_pool",
# Set the proxy location to a specific country
country="FR",
# JavaScript rendering, similar to headless browsers
render_js=True,
# Bypass anti scraping protection
asp=True
)
)
# Print the website's status code
print(api_response.upstream_status_code)
"200"
Let's wrap this quick introduction to cloudproxy with some frequently asked questions about scraping with it
Cloudproxy itself is free however you need access to at least 1 cloud provider to use it. Cloud providers have different pricing models but it starts at $5/mo.
The number of IPs you can access depends on the number of added cloud providers however each IP is rotated which significantly increases the IP pool in practice.
Cloudproxy supports DigitalOcean, AWS, Google Cloud, Hetzner currently and has Azure, Scaleway and Vultr support planned for future releases.
In this article, we have explained about the CloudProxy tool. It authenticates with cloud providers to create a proxy server on their remote machines, which can be used to create a cloud-based web scraper.
We went through a step-by-step guide on how to install CloudProxy using Docker and how to use it with different cloud providers. We have also gone through a practical example of web scraping using CloudProxy.