Cloudflare is a popular antibot shield that blocks automated requests such as web scrapers. It's used across various global websites like Glassdoor, Indeed and G2. So, bypassing Cloudflare opens the door for a wide set of web scraping opportunities.
In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works and how to install and use it. Let's get started!
Legal Disclaimer and Precautions
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
What is FlareSolverr?
FlareSolverr is an open-source proxy server for solving Cloudflare anti-bot challenges.
It bypasses Cloudflare and creates a session with Headers and Cookies
that are reused to authorize future requests against the Cloudflare challenge.
FlareSolverr can be used with both GET and POST requests. It also supports integration with Prometheus for generating metrics and statistics about the bypass performance. FlareSoverr doesn't bypass Cloudflare by solving its challenges. Instead, it mimics normal browsers' configuration. Let's have a closer look!
How FlareSolverr Works?
FlareSolverr is built on top of Selenium and Undetected ChromeDriver, which implies different techniques to bypass Cloudflare, such as changing Selenium variable names and adding randomized delayed and mouse moves.
When a request is sent to the FlareSolverr server, it spins a Selenium headless browser with the Undetected ChromeDriver configuration and requests the page URL. Then, it waits for the Cloudflare challenge to get solved automatically or timed out. Finally, it preserves the session values of the successful requests and reuses them for future requests.
That being said, FlareSolverr can't bypass Cloudflare challenges with explicit CAPTCHAs that require clicks.
How to Install FlareSolverr?
FlareSolverr can be installed using source code or executable binaries. However, the most stable method is using Docker. If you don't have Docker installed, you can follow the official Docker installation page.
Create a docker-compose.yml file and add the following code:
Now that our configuration file is ready. Let's spin up the FlareSolverr server:
docker-compose up --build
To verify your installation, head over to the FlareSolverr port at 127.0.0.1:8191 and you will get a response similar to this:
{"msg": "FlareSolverr is ready!", "version": "3.3.13", "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}
To web scrape using FlareSolverr, we need to interact with its server using HTTP requests. For this, we'll be using httpx. It can be installed using pip:
pip install httpx
How to Use FlareSolverr?
The main functionality of FlareSolver is fairly straightforward. We route the requests to the FalreSolverr server, which will get executed using Selenium and the Undetected ChromeDriver. However, It also allows for sending POST requests, managing sessions and adding proxies. Let's start with the simple GET requests.
Sending GET Requests
To send requests using FlareSolverr, we need to send a POST request to the FlareSolverr URL and pass the request instructions through the request body:
The above request body is the minimal payload that FlareSolverr accepts. We specify the URL, request timeout and the request type, GET or POST.
Let's replicate the above request using httpx and observe the result:
import httpx
def send_get_request(url: str):
"""send a GET request with FlareSolverr"""
flaresolverr_url = "http://localhost:8191/v1"
# basic header content type header
r_headers = {"Content-Type": "application/json"}
# request payload
payload = {
"cmd": "request.get",
"url": url,
"maxTimeout": 60000
}
# send the POST request using httpx
response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=6000)
return response
response = send_get_request(url="https://google.com")
print(response.text)
We define a send_get_request() function. It requests the FlareSolverr URL with the target website URL and the request payload. Here is the result we got:
From the response, we can see all the cookie values assigned to the request. The assigned headers were also saved, but it's empty since no headers were set. Finally, we got the page HTML.
We have requested a Google page without a Cloudflare challenge. Let's put FlareSolverr into action by requesting nowsecure.nl. It is a simple page that implies the CloudFlare shield:
Let's attempt to bypass this page Cloudflare challenge using FlareSolverr:
import httpx
def send_get_request(url: str):
"""send a GET request with FlareSolverr"""
flaresolverr_url = "http://localhost:8191/v1"
# basic header content type header
r_headers = {"Content-Type": "application/json"}
# request payload
payload = {
"cmd": "request.get",
"url": url,
"maxTimeout": 60000
}
# send the POST request using httpx
response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000)
return response
response = send_get_request(url="https://nowsecure.nl/")
print(response.text)
If we take a look at the response, we'll find the Cloudflare challenge bypassed!
FlareSolverr has successfully bypassed the Cloudflare challenge and returned the session values. Let's have a look at how we can reuse this session.
Managing Sessions
We can grab the session values from Flarsoverr's responses and re-apply them with any HTTP client. However, FlareSolverr provides built-in methods for managing and reusing session values.
First, we have to store the requests' sessions. This can be achieved using the sessions.create command:
import httpx
def send_get_request(url: str):
"""send a GET request with FlareSolverr and save the session"""
flaresolverr_url = "http://localhost:8191/v1"
# basic header content type header
r_headers = {"Content-Type": "application/json"}
# request payload
payload = {
"cmd": "sessions.create",
"url": url,
"maxTimeout": 60000
}
# send the POST request using httpx
response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000)
return response
for i in range(3):
# create 3 session values
response = send_get_request(url="https://nowsecure.nl/")
print(response.text)
In the above code, we change the FlareSolverr command to sessions.create to save and return the request's session. Then, we call the function three times to create different sessions. The response would look like this:
Now let's reuse one of these sessions by declaring the session ID in the request payload:
import httpx
import json
def retrieve_session():
"""retrieve FlareSolverr session"""
flaresolverr_url = "http://localhost:8191/v1"
# basic header content type header
r_headers = {"Content-Type": "application/json"}
# request payload
payload = {
"cmd": "sessions.list"
}
# send the POST request using httpx
response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload)
session = json.loads(response.text)["sessions"][0]
return session
def request_with_session(url: str):
"""send a GET request with a FlareSolverr session"""
session_id = retrieve_session()
flaresolverr_url = "http://localhost:8191/v1"
# basic header content type header
r_headers = {"Content-Type": "application/json"}
# request payload
payload = {
"cmd": "request.get",
"session": session_id,
"url": url,
"maxTimeout": 60000
}
# send the POST request using httpx
response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000)
return response
response = request_with_session(url="https://nowsecure.nl/")
print(response.text)
Here, we define two functions, let's break them down:
retrieve_session for retrieving a session ID from FlareSolverr stored sessions.
request_with_session for requesting a page URL while reusing a session ID. It's the same as the previous code except for declaring the session ID in the session parameter of the request payload.
Reusing sessions come in handy while scaling web scrapers. We can send a request to bypass Cloudflare and save the session, then reuse the session ID for future requests. This will notably make the requests' execution time faster as we wouldn't have to bypass the challenge with each request.
The last feature we can use to manipulate FlareSolverr's sessions is deleting sessions using the sessions.destroy command:
import httpx
def delete_session(session_id: str):
"""destroy a FlareSolverr session"""
flaresolverr_url = "http://localhost:8191/v1"
# basic header content type header
r_headers = {"Content-Type": "application/json"}
# request payload
payload = {
"cmd": "sessions.destroy",
"session": session_id
}
# send the POST request using httpx
response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload)
return response
response = delete_session(session_id="e3004b0a-bad5-11ee-aec0-0242ac150002")
print(response.text)
'{"status": "ok", "message": "The session has been removed.", "startTimestamp": 1706118541437, "endTimestamp": 1706118541558, "version": "3.3.13"}'
Sending POST Requests
Sending POST requests with FlaveSolverr is the same as sending GET requests. All we have to do is change the command to request.post and add the request payload to the postData parameter with the string encoded:
import httpx
def send_post_request(url: str, request_payload: str):
"""send a POST request using FlareSolverr"""
flaresolverr_url = "http://localhost:8191/v1"
# basic header content type header
r_headers = {"Content-Type": "application/json"}
# request payload
payload = {
"url": url,
"maxTimeout": 60000,
"cmd": "request.post",
"postData": request_payload
}
# send the POST request using httpx
response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload)
return response
response = send_post_request(url="https://httpbin.dev/anything", request_payload="key1=value1&key2=value2")
print(response.text)
Adding Proxies
Proxies in FlareSolverr can be added for all commands through the proxy parameter:
import httpx
def send_get_request(url: str):
"""send a GET request with FlareSolverr"""
flaresolverr_url = "http://localhost:8191/v1"
# basic header content type header
r_headers = {"Content-Type": "application/json"}
# request payload
payload = {
"cmd": "request.get",
"url": url,
"maxTimeout": 60000,
"proxy": {"url": "proxy_url", "username": "proxy_username", "password": "proxy_password"}
}
# send the POST request using httpx
response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=6000)
return response
response = send_get_request(url="https://nowsecure.nl")
print(response.text)
Using proxies in FlareSolverr for web scraping allows us to distribute our requests' traffic across different IP addresses. This will make it harder for Cloudflare to track the IP address origin, preventing IP address throttling and blocking. Refer to our previous article on IP address blocking for more details.
FlareSolverr Limitations
We have successfully bypassed Cloudflare with FlareSolverr. However, it can fail to bypass Cloudflare with highly protected websites. For example, let's try to scrape a page on Zoominfo:
import httpx
def send_get_request(url: str):
"""send a GET request with FlareSolverr"""
flaresolverr_url = "http://localhost:8191/v1"
# basic header content type header
r_headers = {"Content-Type": "application/json"}
# request payload
payload = {
"url": url,
"maxTimeout": 60000,
"cmd": "request.get"
}
# send the POST request using httpx
response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=6000)
return response
response = send_get_request(url="https://www.zoominfo.com/c/tesla-inc/104333869")
print(response.text)
Unfortunately, FlareSolverr couldn't bypass the Cloudflare challenge and timed out:
{
"status": "error",
"message": "Error: Error solving the challenge. Timeout after 60.0 seconds.",
"startTimestamp": 1706175412072,
"endTimestamp": 1706175472889,
"version": "3.3.13"
}
Let's take a look at a better alternative for getting around Cloudflare!
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:
Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
Millions of self-healing proxies of the highest possible trust score.
Constantly evolving and adapting to new anti-bot systems.
We've been doing this publicly since 2020 with the best bypass on the market!
Here is how we can bypass Cloudflare protection on the previously failed example. All we have to do is replace our HTTP client with the ScrapFly client and enable the asp argument:
# standard web scraping code
import httpx
from parsel import Selector
response = httpx.get("some target website URL")
selector = Selector(response.text)
# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient
# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="You ScrapFly API key")
response = scrapfly.scrape(ScrapeConfig(
url="https://www.zoominfo.com/c/tesla-inc/104333869",
asp=True, # enable the anti scraping protection to bypass Cloudflare
country="US", # set the proxy location to a specfic country
render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))
print(response.status_code)
"200"
# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']
To wrap up this guide, let's have a look at frequently asked questions and common errors about bypassing Cloudflare with FlareSolverr.
Can FlareSolverr bypass Cloudflare?
Yes, you can use FlareSolverr to get around Cloudflare protection. However, FlareSolverr is limited to specific Cloudflare versions, and it can fail with highly protected websites.
The cookies provided by FlareSolverr are not valid.
This is a common issue encountered when the request or consumer doesn't use the same IP address as the one used by FlareSolverr. To resolve this issue, you can disable IPv6 for the FlareSolverr and consumer Docker containers. You can also disable the VPNs or proxies if used, as they can cause networking conflicts.
Error solving the challenge. Timeout after X seconds.
This error suggests a failure in bypassing Cloudflare. This might be due to an unsolvable challenge or a short timeout window in the requests. To resolve this error, You can attempt to increase the FlareSolverr timeout.
Summary
In this article, we explained about the FlareSolverr tool. It bypasses Cloudflare by requesting the web pages using the Selenium web browser with the Undetected ChromeDriver configuration.
We went through a step-by-step guide on installing FlareSolverr using Docker. We also explained how to web scrape using FlareSolverr by managing sessions, adding proxies and sending GET and POST requests.
Learn about the fundamentals of parsing data, across formats like JSON, XML, HTML, and PDFs. Learn how to use Python parsers and AI models for efficient data extraction.