Comprehensive Guide to OkHttp for Java and Kotlin
Learn how to simplify network communication in Java and Android applications using OkHttp.
IP address blocking and throttling are common challenges encountered while web scraping and the most common solution to this is proxy use. However, proxies can be very expensive so are there any free alternatives?
In this article, we'll explore web scraping using Tor proxy network which is free and accessible to anyone. For this, we'll setup our scrapers to use Tor as a proxy server. This will change the IP address randomly for each scraping request - let's get started!
Tor (The Onion Router) is a free and open-source tool and a network that's most commonly implemented through Firefox. It allows for anonymous internet browsing by:
Tor also allows access to .onion domains exclusively. These domains have increased anonymity, and their visitors can almost hide their identities completely.
Although Tor is a web browser, other clients, such as web scrapers can use its network through a proxy.
Tor data is transmitted over a network called the Onion Router. This network consists of thousands of volunteer-based servers known as Tor relays.
To remain anonymous, requests sent over the Tor network go through three nodes: entry, middle, and exist. All requests sent over this network are encrypted, and they get decrypted through the following process:
Installing Tor is very straightforward. Simply go the Tor download page and select the executable for your operating system:
After installing the downloaded file, run the Tor browser and connect to the network. You will get the following tab page:
Note that the Tor browser has to be opened for web scrapers to connect to its network. Next, let's explore using Tor for web scraping.
Since Tor routes the requests to different nodes on remote servers, the request's IP address gets rotated. Hence, we can use Tor as a proxy server for web scraping.
Web scraping using Tor distributes the requesting traffic across multiple IP addresses. This makes it harder for websites and antibots to detect the requests' origin, preventing IP address blocking.
By default, Tor can be used as a proxy server by connecting to the 9150
port on localhost
using the SOCKS5
protocol. Go to the http://127.0.0.1:9150/
localhost URL, and you will get the following page:
The above image warns about connecting to Tor as an HTTP proxy instead of SOCKS5, as it was requested using the browser.
Let's attempt to use Tor as a proxy server for simple HTTP requests:
import httpx
client = httpx.Client(proxy="socks5://127.0.0.1:9150/")
resp = client.get("https://httpbin.dev/ip")
print(resp.text)
"{'origin': '185.243.218.61'}"
Here, we use the URL socks5://127.0.0.1:9150/
as a proxy server for httpx using Tor. The same URL can also be used with other web scraping clients to connect to Tor:
We have successfully used Tor as a SOCKS5 proxy server. However, we can enable HTTP proxies with Tor using a simple configuration option!
To use Tor as an HTTP proxy, we'll use HTTP tunneling:
torrc
file that can be found at:
~/Browser/Browser/TorBrowser/Data/Tor/torrc
~/Library/Application Support/TorBrowser-Data/torrc
HTTPTunnelPort 9080
to the torrc
file and save it.If we go to port 9080
on localhost
, we'll find that Tor can now be used as an HTTP proxy:
Just like what we did earlier with SOCKS5. We can use Tor HTTP proxy for web scraping:
import httpx
client = httpx.Client(proxy="http://127.0.0.1:9080/")
resp = client.get("https://httpbin.dev/ip")
print(resp.text)
"{'origin': '104.244.72.115'}"
Here, we use http://127.0.0.1:9080/
as our Tor HTTP proxy URL. It can also be used with other web scraping clients, including headless browsers such as Selenium, Playwright, and Puppeteer.
When using Tor for web scraping as a proxy, we are limited to one IP address for each connection bridge. To get a new IP address, we have to restart Tor to re-initiate the connection bridge, which isn't practical to do while scraping.
There are community solutions that allow using Tor as a rotating proxy server by connecting to multiple Tor networks. Let's explore a few of these solutions.
For further details on proxy rotation, refer to our dedicated guide.
Tor can be used as a proxy server to change the IP address. However, using Tor for web scraping can have some limitations, including:
For example, let's attempt to scrape data with Tor and observe its execution time:
import time
import httpx
client = httpx.Client(proxy="http://127.0.0.1:9080/", timeout=5000) # client with the Tor proxies
start_time = time.time()
for page_number in range(1, 6):
client.get(f"https://web-scraping.dev/products?page={page_number}")
total_execution_time = time.time() - start_time
print(f"Requests with Tor proxy execution time: {total_execution_time:.2f} seconds")
client = httpx.Client() # regular client without Tor
start_time = time.time()
for page_number in range(1, 6):
client.get(f"https://web-scraping.dev/products?page={page_number}")
total_execution_time = time.time() - start_time
print(f"Regular requests execution time: {total_execution_time:.2f} seconds")
"""
Requests with Tor proxy execution time: 19.93 seconds
Regular requests execution time: 10.36 seconds
"""
From the above results, we can see that the Tor web crawler took almost double the time of normal requests. In a real-life scraping scenario, it can even be slower.
We can conclude that web scraping using Tor is suitable for small and medium-sized scales. However, it's not reliable for effective and scalable web scrapers. To effectively change the IP address while web scraping. It's advised to use high-quality residential and mobile proxies. Let's have a look!
Proxies and TOR network can be powerful but often still not enough to scale up web scraping operations and this is where Scrapfly can lend a hand!
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
Here's how we can web scrape with proxies using ScrapFly Python SDK. All we have to do is select a proxy pool and country while enabling the asp
parameter to bypass scraping blocking:
from scrapfly import ScrapeConfig, ScrapflyClient
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
response = scrapfly.scrape(ScrapeConfig(
url="website URL",
asp=True, # enable the anti scraping protection to bypass blocking
proxy_pool="public_residential_pool", # select the residential proxy pool
country="US", # set the proxy location to a specfic country
render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))
# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']
To wrap up this guide on web scraping using Tor, let's have a look at some frequently asked questions.
Yes. The Tor browser accepts a SOCKS5 proxy connection at the URL 127.0.0.1:9150
by default, which can be used with HTTP clients or headless browsers.
To accept HTTP proxy connections with Tor, open the torrc
file in the Tor browser directory. Then, add the line HTTPTunnelPort 9080
and save the file. Tor will then operate as an HTTP proxy at the URL 127.0.0.1:9080
.
Yes. rotating-tor-http-proxy and rotating-proxy are open-source tools that initiate multiple Tor connections using Docker, which is used as a rotating proxy server with dynamic proxy IPs.
In this guide, we explained how to use Tor for web scraping. We started by defining what Tor is, how it works, and how to install it.
We went through a step-by-step guide on using Tor as a SOCKS and HTTP proxy server and explored a few open-source tools that allow using Tor as a rotating proxy server. Finally, we explored the limitations of web scraping using Tor: