FlareSolverr Guide: Bypass Cloudflare While Scraping

FlareSolverr Guide: Bypass Cloudflare While Scraping

Cloudflare is a popular antibot shield that blocks automated requests such as web scrapers. It's used across various global websites like Glassdoor, Indeed and G2. So, bypassing Cloudflare opens the door for a wide set of web scraping opportunities.

In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works and how to install and use it. Let's get started!

What is FlareSolverr?

FlareSolverr is an open-source proxy server for solving Cloudflare anti-bot challenges.
It bypasses Cloudflare and creates a session with Headers and Cookies
that are reused to authorize future requests against the Cloudflare challenge.

FlareSolverr can be used with both GET and POST requests. It also supports integration with Prometheus for generating metrics and statistics about the bypass performance. FlareSoverr doesn't bypass Cloudflare by solving its challenges. Instead, it mimics normal browsers' configuration. Let's have a closer look!

How FlareSolverr Works?

FlareSolverr is built on top of Selenium and Undetected ChromeDriver, which implies different techniques to bypass Cloudflare, such as changing Selenium variable names and adding randomized delayed and mouse moves.

When a request is sent to the FlareSolverr server, it spins a Selenium headless browser with the Undetected ChromeDriver configuration and requests the page URL. Then, it waits for the Cloudflare challenge to get solved automatically or timed out. Finally, it preserves the session values of the successful requests and reuses them for future requests.

That being said, FlareSolverr can't bypass Cloudflare challenges with explicit CAPTCHAs that require clicks.

How to bypass Cloudflare when web scraping in 2024

Learn how to detect Cloudflare blocking, how it identifies web scrapers and tips for bypassing it while scraping.

How to bypass Cloudflare when web scraping in 2024

How to Install FlareSolverr?

FlareSolverr can be installed using source code or executable binaries. However, the most stable method is using Docker. If you don't have Docker installed, you can follow the official Docker installation page.

Create a docker-compose.yml file and add the following code:

version: "2.1"
services:
  flaresolverr:
    # DockerHub mirror flaresolverr/flaresolverr:latest
    image: ghcr.io/flaresolverr/flaresolverr:latest
    container_name: flaresolverr
    environment:
      - LOG_LEVEL=${LOG_LEVEL:-info}
      - LOG_HTML=${LOG_HTML:-false}
      - TZ=Europe/London
    ports:
      - "${PORT:-8191}:8191"
    restart: unless-stopped

Here, we add the basic FlareSolverr configuration found on the official GitHub repository. Let's break down the parameters used:

Parameter Description
LOG_LEVEL The logging verbosity, setting it to debug will include more logging details.
LOG_HTML Logs the HTML response of each request in the console.
TZ Configures the headless browser timezone.
Port The port where the server will run.

Flaresolverr also includes additional configuration parameters.

Parameter Description
LANG Changes the web browser language. It comes in handy when changing the web scraping language.
HEADLESS Controls whether to run the browser in headful or headless mode, without GUI.
BROWSER_TIMEOUT Configures the browser timeout. The default is 40 seconds, but it can be increased for slow internet connections.

For more details on FlareSolverr's parameters, refer to the official configuration docs.

Now that our configuration file is ready. Let's spin up the FlareSolverr server:

docker-compose up --build

To verify your installation, head over to the FlareSolverr port at 127.0.0.1:8191 and you will get a response similar to this:

{"msg": "FlareSolverr is ready!", "version": "3.3.13", "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}

To web scrape using FlareSolverr, we need to interact with its server using HTTP requests. For this, we'll be using httpx. It can be installed using pip:

pip install httpx

How to Use FlareSolverr?

The main functionality of FlareSolver is fairly straightforward. We route the requests to the FalreSolverr server, which will get executed using Selenium and the Undetected ChromeDriver. However, It also allows for sending POST requests, managing sessions and adding proxies. Let's start with the simple GET requests.

Sending GET Requests

To send requests using FlareSolverr, we need to send a POST request to the FlareSolverr URL and pass the request instructions through the request body:

curl -L -X POST 'http://localhost:8191/v1' \
-H 'Content-Type: application/json' \
--data-raw '{
  "cmd": "request.get",
  "url":"http://www.google.com/",
  "maxTimeout": 60000
}

The above request body is the minimal payload that FlareSolverr accepts. We specify the URL, request timeout and the request type, GET or POST.

Let's replicate the above request using httpx and observe the result:

import httpx

def send_get_request(url: str):
    """send a GET request with FlareSolverr"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "request.get",
        "url": url,
        "maxTimeout": 60000
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=6000)
    return response

response = send_get_request(url="https://google.com")
print(response.text)

We define a send_get_request() function. It requests the FlareSolverr URL with the target website URL and the request payload. Here is the result we got:

{
    "status": "ok",
    "message": "Challenge not detected!",
    "solution": {
        "url": "https://www.google.com/",
        "status": 200,
        "cookies": [
            {
                "domain": ".google.com",
                "expiry": 1721920212,
                "httpOnly": true,
                "name": "NID",
                "path": "/",
                "sameSite": "None",
                "secure": true,
                "value": "511=k58ibgnzvwsZ5YdKvKFbBipVVUYc0XLGbFrNiu_nNTk3dsUR24-xZ6H3XmGiP-1dWXH15MyynGY-z1CIt3HddjzwC5YD5ZQb8g5eU9CwQp993tUypxby2VSGPTEXjG-fnlpvi199oEu0AH7kR_FbWRHECbzEdT6xc8Zu4gHiinjb6zNWkQ31vnU"
            },
            ....
        ],
        "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "headers": {},
        "response": "<html> .... </html>"
    },
    "startTimestamp": 1706109011335,
    "endTimestamp": 1706109014793,
    "version": "3.3.13"
}

From the response, we can see all the cookie values assigned to the request. The assigned headers were also saved, but it's empty since no headers were set. Finally, we got the page HTML.

We have requested a Google page without a Cloudflare challenge. Let's put FlareSolverr into action by requesting nowsecure.nl. It is a simple page that implies the CloudFlare shield:

a page with cloudflare challenge
Cloudflare challenge on nowsecure.nl

Let's attempt to bypass this page Cloudflare challenge using FlareSolverr:

import httpx

def send_get_request(url: str):
    """send a GET request with FlareSolverr"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "request.get",
        "url": url,
        "maxTimeout": 60000
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000)
    return response

response = send_get_request(url="https://nowsecure.nl/")
print(response.text)

If we take a look at the response, we'll find the Cloudflare challenge bypassed!

{
    "status": "ok",
    "message": "Challenge solved!",
    "solution": {
        "url": "https://nowsecure.nl/",
        "status": 200,
        "cookies": [
            {
                "domain": ".nowsecure.nl",
                "expiry": 1737648300,
                "httpOnly": true,
                "name": "cf_clearance",
                "path": "/",
                "sameSite": "None",
                "secure": true,
                "value": "iDVfZ0_So4n_2d7w9q8RRBl8tUktOzdT9g9NL7JrUiM-1706112302-1-ASeuvc/28aIp0ZLlSbMmwDBW9A0rGbi/APO9w90KWdh1OI0QfsjmSr/gSVVjHb8NPL8VQKsxiO5xhLxb8o206Yw="
            }
        ],
        "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "headers": {},
        "response": "<html> .... </html>"
    },
    "startTimestamp": 1706112295097,
    "endTimestamp": 1706112301786,
    "version": "3.3.13"
}

FlareSolverr has successfully bypassed the Cloudflare challenge and returned the session values. Let's have a look at how we can reuse this session.

Managing Sessions

We can grab the session values from Flarsoverr's responses and re-apply them with any HTTP client. However, FlareSolverr provides built-in methods for managing and reusing session values.

First, we have to store the requests' sessions. This can be achieved using the sessions.create command:

import httpx

def send_get_request(url: str):
    """send a GET request with FlareSolverr and save the session"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "sessions.create",
        "url": url,
        "maxTimeout": 60000
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000)
    return response

for i in range(3):
    # create 3 session values
    response = send_get_request(url="https://nowsecure.nl/")
    print(response.text)

In the above code, we change the FlareSolverr command to sessions.create to save and return the request's session. Then, we call the function three times to create different sessions. The response would look like this:

{
    "status": "ok",
    "message": "Session created successfully.",
    "session": "e403c98c-bad5-11ee-8830-0242ac150002",
    "startTimestamp": 1706113831129,
    "endTimestamp": 1706113831774,
    "version": "3.3.13"
}

The next step is retrieving the stored sessions. This can be done using the sessions.list command:

import httpx

def retrieve_sessions():
    """retrieve FlareSolverr sessions"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "sessions.list"
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000)
    return response

response = retrieve_sessions()
print(response.text)

The response contains a list of the saved session IDs:

{
    "status": "ok",
    "message": "",
    "sessions": [
        "e3004b0a-bad5-11ee-aec0-0242ac150002",
        "e37000e4-bad5-11ee-b73b-0242ac150002",
        "e403c98c-bad5-11ee-8830-0242ac150002"
    ],
    "startTimestamp": 1706114460036,
    "endTimestamp": 1706114460036,
    "version": "3.3.13"
}

Now let's reuse one of these sessions by declaring the session ID in the request payload:

import httpx
import json

def retrieve_session():
    """retrieve FlareSolverr session"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "sessions.list"
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload)
    session = json.loads(response.text)["sessions"][0]
    return session


def request_with_session(url: str):
    """send a GET request with a FlareSolverr session"""
    session_id = retrieve_session()
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "request.get",
        "session": session_id,
        "url": url,
        "maxTimeout": 60000
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000)
    return response

response = request_with_session(url="https://nowsecure.nl/")
print(response.text)

Here, we define two functions, let's break them down:

  • retrieve_session for retrieving a session ID from FlareSolverr stored sessions.
  • request_with_session for requesting a page URL while reusing a session ID. It's the same as the previous code except for declaring the session ID in the session parameter of the request payload.

Reusing sessions come in handy while scaling web scrapers. We can send a request to bypass Cloudflare and save the session, then reuse the session ID for future requests. This will notably make the requests' execution time faster as we wouldn't have to bypass the challenge with each request.

The last feature we can use to manipulate FlareSolverr's sessions is deleting sessions using the sessions.destroy command:

import httpx

def delete_session(session_id: str):
    """destroy a FlareSolverr session"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "sessions.destroy",
        "session": session_id
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload)
    return response

response = delete_session(session_id="e3004b0a-bad5-11ee-aec0-0242ac150002")
print(response.text)
'{"status": "ok", "message": "The session has been removed.", "startTimestamp": 1706118541437, "endTimestamp": 1706118541558, "version": "3.3.13"}'

Sending POST Requests

Sending POST requests with FlaveSolverr is the same as sending GET requests. All we have to do is change the command to request.post and add the request payload to the postData parameter with the string encoded:

import httpx

def send_post_request(url: str, request_payload: str):
    """send a POST request using FlareSolverr"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "url": url,
        "maxTimeout": 60000,
        "cmd": "request.post",
        "postData": request_payload
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload)
    return response

response = send_post_request(url="https://httpbin.dev/anything", request_payload="key1=value1&key2=value2")
print(response.text)

Adding Proxies

Proxies in FlareSolverr can be added for all commands through the proxy parameter:

import httpx

def send_get_request(url: str):
    """send a GET request with FlareSolverr"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "cmd": "request.get",
        "url": url,
        "maxTimeout": 60000,
        "proxy": {"url": "proxy_url", "username": "proxy_username", "password": "proxy_password"}
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=6000)
    return response

response = send_get_request(url="https://nowsecure.nl")
print(response.text)

Using proxies in FlareSolverr for web scraping allows us to distribute our requests' traffic across different IP addresses. This will make it harder for Cloudflare to track the IP address origin, preventing IP address throttling and blocking. Refer to our previous article on IP address blocking for more details.

How to Avoid Web Scraper IP Blocking?

Learn what Internet Protocol addresses are and how IP tracking technologies are used to block web scrapers.

How to Avoid Web Scraper IP Blocking?

FlareSolverr Limitations

We have successfully bypassed Cloudflare with FlareSolverr. However, it can fail to bypass Cloudflare with highly protected websites. For example, let's try to scrape a page on Zoominfo:

import httpx

def send_get_request(url: str):
    """send a GET request with FlareSolverr"""
    flaresolverr_url = "http://localhost:8191/v1"
    # basic header content type header
    r_headers = {"Content-Type": "application/json"}
    # request payload
    payload = {
        "url": url,
        "maxTimeout": 60000,
        "cmd": "request.get"
    }
    # send the POST request using httpx
    response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=6000)
    return response

response = send_get_request(url="https://www.zoominfo.com/c/tesla-inc/104333869")
print(response.text)

Unfortunately, FlareSolverr couldn't bypass the Cloudflare challenge and timed out:

{
    "status": "error",
    "message": "Error: Error solving the challenge. Timeout after 60.0 seconds.",
    "startTimestamp": 1706175412072,
    "endTimestamp": 1706175472889,
    "version": "3.3.13"
}

Let's take a look at a better alternative for getting around Cloudflare!

ScrapFly: FlareSolverr Alternative

ScrapFly is a web scraping API that provides an anti-scraping protection bypass to avoid any website blocking.

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:

  • Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
  • Millions of self-healing proxies of the highest possible trust score.
  • Constantly evolving and adapting to new anti-bot systems.
  • We've been doing this publicly since 2020 with the best bypass on the market!

Here is how we can bypass Cloudflare protection on the previously failed example. All we have to do is replace our HTTP client with the ScrapFly client and enable the asp argument:

# standard web scraping code
import httpx
from parsel import Selector

response = httpx.get("some target website URL")
selector = Selector(response.text)

# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient

# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="You ScrapFly API key")

response = scrapfly.scrape(ScrapeConfig(
    url="https://www.zoominfo.com/c/tesla-inc/104333869",
    asp=True, # enable the anti scraping protection to bypass Cloudflare
    country="US", # set the proxy location to a specfic country
    render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))

print(response.status_code)
"200"
# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']

FAQ

To wrap up this guide, let's have a look at frequently asked questions and common errors about bypassing Cloudflare with FlareSolverr.

Can FlareSolverr bypass Cloudflare?

Yes, you can use FlareSolverr to get around Cloudflare protection. However, FlareSolverr is limited to specific Cloudflare versions, and it can fail with highly protected websites.

The cookies provided by FlareSolverr are not valid.

This is a common issue encountered when the request or consumer doesn't use the same IP address as the one used by FlareSolverr. To resolve this issue, you can disable IPv6 for the FlareSolverr and consumer Docker containers. You can also disable the VPNs or proxies if used, as they can cause networking conflicts.

Error solving the challenge. Timeout after X seconds.

This error suggests a failure in bypassing Cloudflare. This might be due to an unsolvable challenge or a short timeout window in the requests. To resolve this error, You can attempt to increase the FlareSolverr timeout.

Summary

In this article, we explained about the FlareSolverr tool. It bypasses Cloudflare by requesting the web pages using the Selenium web browser with the Undetected ChromeDriver configuration.

We went through a step-by-step guide on installing FlareSolverr using Docker. We also explained how to web scrape using FlareSolverr by managing sessions, adding proxies and sending GET and POST requests.

Related Posts

Playwright Examples for Web Scraping and Automation

Learn Playwright with Python and JavaScript examples for automating browsers like Chromium, WebKit, and Firefox.

How to use wget in Python

Learn how to use wget in Python through subprocess calls and what are other options.

Ultimate Guide to JSON Parsing in Python

Learn JSON parsing in Python with this ultimate guide. Explore basic and advanced techniques using json, and tools like ijson and nested-lookup