FlareSolverr Guide: Bypass Cloudflare While Scraping

Apr 11, 2024

Python Tools Scraper Blocking HTTP

Cloudflare is a popular antibot shield that blocks automated requests such as web scrapers. It's used across various global websites like Glassdoor, Indeed and G2. So, bypassing Cloudflare opens the door for a wide set of web scraping opportunities.

In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works and how to install and use it. Let's get started!

Legal Disclaimer and Precautions

This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:

Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.

What is FlareSolverr?

FlareSolverr is an open-source proxy server for solving Cloudflare anti-bot challenges.
It bypasses Cloudflare and creates a session with Headers and Cookies
that are reused to authorize future requests against the Cloudflare challenge.

FlareSolverr can be used with both GET and POST requests. It also supports integration with Prometheus for generating metrics and statistics about the bypass performance. FlareSoverr doesn't bypass Cloudflare by solving its challenges. Instead, it mimics normal browsers' configuration. Let's have a closer look!

How FlareSolverr Works?

FlareSolverr is built on top of Selenium and Undetected ChromeDriver, which implies different techniques to bypass Cloudflare, such as changing Selenium variable names and adding randomized delayed and mouse moves.

When a request is sent to the FlareSolverr server, it spins a Selenium headless browser with the Undetected ChromeDriver configuration and requests the page URL. Then, it waits for the Cloudflare challenge to get solved automatically or timed out. Finally, it preserves the session values of the successful requests and reuses them for future requests.

That being said, FlareSolverr can't bypass Cloudflare challenges with explicit CAPTCHAs that require clicks.

How to bypass Cloudflare when web scraping in 2024

Learn how to detect Cloudflare blocking, how it identifies web scrapers and tips for bypassing it while scraping.

How to Install FlareSolverr?

FlareSolverr can be installed using source code or executable binaries. However, the most stable method is using Docker. If you don't have Docker installed, you can follow the official Docker installation page.

Create a docker-compose.yml file and add the following code:

version: "2.1"
services:
  flaresolverr:
    # DockerHub mirror flaresolverr/flaresolverr:latest
    image: ghcr.io/flaresolverr/flaresolverr:latest
    container_name: flaresolverr
    environment:
      - LOG_LEVEL=${LOG_LEVEL:-info}
      - LOG_HTML=${LOG_HTML:-false}
      - TZ=Europe/London
    ports:
      - "${PORT:-8191}:8191"
    restart: unless-stopped

Here, we add the basic FlareSolverr configuration found on the official GitHub repository. Let's break down the parameters used:

Parameter	Description
`LOG_LEVEL`	The logging verbosity, setting it to `debug` will include more logging details.
`LOG_HTML`	Logs the HTML response of each request in the console.
`TZ`	Configures the headless browser timezone.
`Port`	The port where the server will run.

Flaresolverr also includes additional configuration parameters.

Parameter	Description
`LANG`	Changes the web browser language. It comes in handy when changing the web scraping language.
`HEADLESS`	Controls whether to run the browser in headful or headless mode, without GUI.
`BROWSER_TIMEOUT`	Configures the browser timeout. The default is 40 seconds, but it can be increased for slow internet connections.

For more details on FlareSolverr's parameters, refer to the official configuration docs.

Now that our configuration file is ready. Let's spin up the FlareSolverr server:

docker-compose up --build

To verify your installation, head over to the FlareSolverr port at 127.0.0.1:8191 and you will get a response similar to this:

{"msg": "FlareSolverr is ready!", "version": "3.3.13", "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}

To web scrape using FlareSolverr, we need to interact with its server using HTTP requests. For this, we'll be using httpx. It can be installed using pip:

pip install httpx

How to Use FlareSolverr?

The main functionality of FlareSolver is fairly straightforward. We route the requests to the FalreSolverr server, which will get executed using Selenium and the Undetected ChromeDriver. However, It also allows for sending POST requests, managing sessions and adding proxies. Let's start with the simple GET requests.

Sending GET Requests

To send requests using FlareSolverr, we need to send a POST request to the FlareSolverr URL and pass the request instructions through the request body:

curl -L -X POST 'http://localhost:8191/v1' \
-H 'Content-Type: application/json' \
--data-raw '{
  "cmd": "request.get",
  "url":"http://www.google.com/",
  "maxTimeout": 60000
}

The above request body is the minimal payload that FlareSolverr accepts. We specify the URL, request timeout and the request type, GET or POST.

Let's replicate the above request using httpx and observe the result:

import httpx

def send_get_request(url: str):
    """send a GET request with FlareSolverr"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "request.get",
        "url": url,
        "maxTimeout": 60000
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=6000)
    return response

response = send_get_request(url="https://google.com")
print(response.text)

We define a send_get_request() function. It requests the FlareSolverr URL with the target website URL and the request payload. Here is the result we got:

{
    "status": "ok",
    "message": "Challenge not detected!",
    "solution": {
        "url": "https://www.google.com/",
        "status": 200,
        "cookies": [
            {
                "domain": ".google.com",
                "expiry": 1721920212,
                "httpOnly": true,
                "name": "NID",
                "path": "/",
                "sameSite": "None",
                "secure": true,
                "value": "511=k58ibgnzvwsZ5YdKvKFbBipVVUYc0XLGbFrNiu_nNTk3dsUR24-xZ6H3XmGiP-1dWXH15MyynGY-z1CIt3HddjzwC5YD5ZQb8g5eU9CwQp993tUypxby2VSGPTEXjG-fnlpvi199oEu0AH7kR_FbWRHECbzEdT6xc8Zu4gHiinjb6zNWkQ31vnU"
            },
            ....
        ],
        "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "headers": {},
        "response": "<html> .... </html>"
    },
    "startTimestamp": 1706109011335,
    "endTimestamp": 1706109014793,
    "version": "3.3.13"
}

From the response, we can see all the cookie values assigned to the request. The assigned headers were also saved, but it's empty since no headers were set. Finally, we got the page HTML.

We have requested a Google page without a Cloudflare challenge. Let's put FlareSolverr into action by requesting nowsecure.nl. It is a simple page that implies the CloudFlare shield:

a page with cloudflare challenge — Cloudflare challenge on nowsecure.nl

Let's attempt to bypass this page Cloudflare challenge using FlareSolverr:

import httpx

def send_get_request(url: str):
    """send a GET request with FlareSolverr"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "request.get",
        "url": url,
        "maxTimeout": 60000
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000)
    return response

response = send_get_request(url="https://nowsecure.nl/")
print(response.text)

If we take a look at the response, we'll find the Cloudflare challenge bypassed!

{
    "status": "ok",
    "message": "Challenge solved!",
    "solution": {
        "url": "https://nowsecure.nl/",
        "status": 200,
        "cookies": [
            {
                "domain": ".nowsecure.nl",
                "expiry": 1737648300,
                "httpOnly": true,
                "name": "cf_clearance",
                "path": "/",
                "sameSite": "None",
                "secure": true,
                "value": "iDVfZ0_So4n_2d7w9q8RRBl8tUktOzdT9g9NL7JrUiM-1706112302-1-ASeuvc/28aIp0ZLlSbMmwDBW9A0rGbi/APO9w90KWdh1OI0QfsjmSr/gSVVjHb8NPL8VQKsxiO5xhLxb8o206Yw="
            }
        ],
        "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "headers": {},
        "response": "<html> .... </html>"
    },
    "startTimestamp": 1706112295097,
    "endTimestamp": 1706112301786,
    "version": "3.3.13"
}

FlareSolverr has successfully bypassed the Cloudflare challenge and returned the session values. Let's have a look at how we can reuse this session.

Managing Sessions

We can grab the session values from Flarsoverr's responses and re-apply them with any HTTP client. However, FlareSolverr provides built-in methods for managing and reusing session values.

First, we have to store the requests' sessions. This can be achieved using the sessions.create command:

import httpx

def send_get_request(url: str):
    """send a GET request with FlareSolverr and save the session"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "sessions.create",
        "url": url,
        "maxTimeout": 60000
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000)
    return response

for i in range(3):
    # create 3 session values
    response = send_get_request(url="https://nowsecure.nl/")
    print(response.text)

In the above code, we change the FlareSolverr command to sessions.create to save and return the request's session. Then, we call the function three times to create different sessions. The response would look like this:

{
    "status": "ok",
    "message": "Session created successfully.",
    "session": "e403c98c-bad5-11ee-8830-0242ac150002",
    "startTimestamp": 1706113831129,
    "endTimestamp": 1706113831774,
    "version": "3.3.13"
}

The next step is retrieving the stored sessions. This can be done using the sessions.list command:

import httpx

def retrieve_sessions():
    """retrieve FlareSolverr sessions"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "sessions.list"
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000)
    return response

response = retrieve_sessions()
print(response.text)

The response contains a list of the saved session IDs:

{
    "status": "ok",
    "message": "",
    "sessions": [
        "e3004b0a-bad5-11ee-aec0-0242ac150002",
        "e37000e4-bad5-11ee-b73b-0242ac150002",
        "e403c98c-bad5-11ee-8830-0242ac150002"
    ],
    "startTimestamp": 1706114460036,
    "endTimestamp": 1706114460036,
    "version": "3.3.13"
}

Now let's reuse one of these sessions by declaring the session ID in the request payload:

import httpx
import json

def retrieve_session():
    """retrieve FlareSolverr session"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "sessions.list"
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload)
    session = json.loads(response.text)["sessions"][0]
    return session


def request_with_session(url: str):
    """send a GET request with a FlareSolverr session"""
    session_id = retrieve_session()
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "request.get",
        "session": session_id,
        "url": url,
        "maxTimeout": 60000
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000)
    return response

response = request_with_session(url="https://nowsecure.nl/")
print(response.text)

Here, we define two functions, let's break them down:

retrieve_session for retrieving a session ID from FlareSolverr stored sessions.
request_with_session for requesting a page URL while reusing a session ID. It's the same as the previous code except for declaring the session ID in the session parameter of the request payload.

Reusing sessions come in handy while scaling web scrapers. We can send a request to bypass Cloudflare and save the session, then reuse the session ID for future requests. This will notably make the requests' execution time faster as we wouldn't have to bypass the challenge with each request.

The last feature we can use to manipulate FlareSolverr's sessions is deleting sessions using the sessions.destroy command:

import httpx

def delete_session(session_id: str):
    """destroy a FlareSolverr session"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "sessions.destroy",
        "session": session_id
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload)
    return response

response = delete_session(session_id="e3004b0a-bad5-11ee-aec0-0242ac150002")
print(response.text)
'{"status": "ok", "message": "The session has been removed.", "startTimestamp": 1706118541437, "endTimestamp": 1706118541558, "version": "3.3.13"}'

Sending POST Requests

Sending POST requests with FlaveSolverr is the same as sending GET requests. All we have to do is change the command to request.post and add the request payload to the postData parameter with the string encoded:

import httpx

def send_post_request(url: str, request_payload: str):
    """send a POST request using FlareSolverr"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "url": url,
        "maxTimeout": 60000,
        "cmd": "request.post",
        "postData": request_payload
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload)
    return response

response = send_post_request(url="https://httpbin.dev/anything", request_payload="key1=value1&key2=value2")
print(response.text)

Adding Proxies

Proxies in FlareSolverr can be added for all commands through the proxy parameter:

import httpx

def send_get_request(url: str):
    """send a GET request with FlareSolverr"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "request.get",
        "url": url,
        "maxTimeout": 60000,
        "proxy": {"url": "proxy_url", "username": "proxy_username", "password": "proxy_password"}
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=6000)
    return response

response = send_get_request(url="https://nowsecure.nl")
print(response.text)

Using proxies in FlareSolverr for web scraping allows us to distribute our requests' traffic across different IP addresses. This will make it harder for Cloudflare to track the IP address origin, preventing IP address throttling and blocking. Refer to our previous article on IP address blocking for more details.

How to Avoid Web Scraper IP Blocking?

Learn what Internet Protocol addresses are and how IP tracking technologies are used to block web scrapers.

FlareSolverr Limitations

We have successfully bypassed Cloudflare with FlareSolverr. However, it can fail to bypass Cloudflare with highly protected websites. For example, let's try to scrape a page on Zoominfo:

import httpx

def send_get_request(url: str):
    """send a GET request with FlareSolverr"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "url": url,
        "maxTimeout": 60000,
        "cmd": "request.get"
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=6000)
    return response

response = send_get_request(url="https://www.zoominfo.com/c/tesla-inc/104333869")
print(response.text)

Unfortunately, FlareSolverr couldn't bypass the Cloudflare challenge and timed out:

{
    "status": "error",
    "message": "Error: Error solving the challenge. Timeout after 60.0 seconds.",
    "startTimestamp": 1706175412072,
    "endTimestamp": 1706175472889,
    "version": "3.3.13"
}

Let's take a look at a better alternative for getting around Cloudflare!

ScrapFly: FlareSolverr Alternative

ScrapFly is a web scraping API that provides an anti-scraping protection bypass to avoid any website blocking. It also allows for scraping at scale by providing:

Residential proxies in over 50 countries - For scraping from any geographical location while also avoiding IP address throttling and blocking.
JavaScript rendering - For scraping JavaScript-loaded content using cloud headless browsers.
Straightforward Python and Typescript SDKs.
And much more!

scrapfly middleware — ScrapFly service does the heavy lifting for you

Here is how we can bypass Cloudflare protection on the previously failed example. All we have to do is replace our HTTP client with the ScrapFly client and enable the asp argument:

# standard web scraping code
import httpx
from parsel import Selector

response = httpx.get("some target website URL")
selector = Selector(response.text)

# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient

# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="You ScrapFly API key")

response = scrapfly.scrape(ScrapeConfig(
    url="https://www.zoominfo.com/c/tesla-inc/104333869",
    asp=True, # enable the anti scraping protection to bypass Cloudflare
    country="US", # set the proxy location to a specfic country
    render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))

print(response.status_code)
"200"
# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']

Try for FREE!

FAQ

To wrap up this guide, let's have a look at frequently asked questions and common errors about bypassing Cloudflare with FlareSolverr.

Can FlareSolverr bypass Cloudflare?

Yes, you can use FlareSolverr to get around Cloudflare protection. However, FlareSolverr is limited to specific Cloudflare versions, and it can fail with highly protected websites.

The cookies provided by FlareSolverr are not valid.

This is a common issue encountered when the request or consumer doesn't use the same IP address as the one used by FlareSolverr. To resolve this issue, you can disable IPv6 for the FlareSolverr and consumer Docker containers. You can also disable the VPNs or proxies if used, as they can cause networking conflicts.

Error solving the challenge. Timeout after X seconds.

This error suggests a failure in bypassing Cloudflare. This might be due to an unsolvable challenge or a short timeout window in the requests. To resolve this error, You can attempt to increase the FlareSolverr timeout.

Summary

In this article, we explained about the FlareSolverr tool. It bypasses Cloudflare by requesting the web pages using the Selenium web browser with the Undetected ChromeDriver configuration.

We went through a step-by-step guide on installing FlareSolverr using Docker. We also explained how to web scrape using FlareSolverr by managing sessions, adding proxies and sending GET and POST requests.

Check out ScrapFly Python SDK

Try ScrapFly for FREE!

FlareSolverr Guide: Bypass Cloudflare While Scraping

What is FlareSolverr?

How FlareSolverr Works?

How to Install FlareSolverr?

How to Use FlareSolverr?

Sending GET Requests

Managing Sessions

Sending POST Requests

Adding Proxies

FlareSolverr Limitations

ScrapFly: FlareSolverr Alternative

FAQ

Can FlareSolverr bypass Cloudflare?

The cookies provided by FlareSolverr are not valid.

Error solving the challenge. Timeout after X seconds.

Summary

Company

Tools

Resources

Learn Web Scraping

Usage

FlareSolverr Guide: Bypass Cloudflare While Scraping

What is FlareSolverr?

How FlareSolverr Works?

How to Install FlareSolverr?

How to Use FlareSolverr?

Sending GET Requests

Managing Sessions

Sending POST Requests

Adding Proxies

FlareSolverr Limitations

ScrapFly: FlareSolverr Alternative

FAQ

Can FlareSolverr bypass Cloudflare?

The cookies provided by FlareSolverr are not valid.

Error solving the challenge. Timeout after X seconds.

Summary

Related Questions

Related Posts

How to Power-Up LLMs with Web Scraping and RAG

How to Scrape Forms

How to Build a Minimum Advertised Price (MAP) Monitoring Tool

Company

Tools

Resources

Learn Web Scraping

Usage