cURL is one of the oldest tools used for sending HTTP requests. Yet, it's still a great asset for the web scraping toolbox.
In this article, we'll go over a step-by-step guide on sending and configuring HTTP requests with cURL. We'll also explore advanced usages of cURL for web scraping, such as scraping dynamic pages and avoiding getting blocked. Let's get started!
What is cURL and Why Use It?
cURL, standing for "client for URL", is an open-source command-line tool used for transferring data with URLs. It's built on the top of the libcurl C library. It supports the different types of HTTP methods (GET, POST, PUT, etc.) with various HTTP protocols, including HTTP and HTTPS.
cURL isn't only super-fast and straightforward, but it provides a comprehensive request configuration, including:
Adding custom headers and cookies.
Enabling or disabling request redirects.
Downloading binary files.
This makes using cURL for web scraping a viable tool for debugging and developing scraping scripts or even extracting small data portions.
How To Install cURL?
Before we start web scraping with cURL, we must install it. cURL comes pre-installed in almost all operating systems. However, run the below commands to upgrade or install it if it isn't found.
Linux
$ apt-get install curl
Mac
$ brew install curl
Windows
$ choco install curl
To verify your installation, simply run the following command. You should receive the cURL version details:
In this section, we'll explore the basics of cURL and how to navigate it to send different request types. Let's start with the most basic cURL usage: sending GET requests.
Sending GET Requests
cURL follows the below syntax for all the request types:
curl [OPTIONS] URL
OPTIONS
Represents the request option, which are configurations that can be passed to the request to specify headers, cookies, proxies, request type and so on. To list the commonly used options, use the curl -h command. To view all the available ones, use the curl -h all command.
URL
The actual URL to request.
To send a GET request with cURL, all we have to do is specify the URL to request, as it uses the GET method by default:
curl https://httpbin.dev/get
The above command will request the httpbin.dev/get endpoint and return the request details:
In the above cURL request, we override the cURL User-Agent and Accept headers with custom ones. The response will include the newly configured headers:
In the previous sections, we have sent GET requests with cURL. In this one, we'll explain sending POST requests. To send POST requests with cURL, we can utilize the -X option, which determines the request HTTP method:
curl -X POST https://httpbin.dev/post
Thhe above cURL command will send a POST request and return the request details:
Note that on Windows, you need to escape the body with backslashes:
curl -X POST -d "{\"key1\": \"value1\", \"key2\": \"value2\"}" https://httpbin.dev/post
Web Scraping With cURL
The standard web scraping process requires HTML parsing, crawling, processing and saving the extracted. Therefore, cURL itself isn't suitable for these extensive scraping tasks. However, it can be a great asset for debugging and development purposes. Accordingly, we'll explore using cURL for common web scraping tips and tricks.
Scraping Dynamic pages With cURL
Data on dynamic websites are usually loaded through background XHR calls. These API calls can be captured on the browser developer tools and exported as cURL requests for web scraping.
For example, the review data on web-scraping.dev is loaded through background API requests:
First, let's capture the API calls on the above web page using the following steps:
Open the browser developer tools by pressing the F12 key.
Select the network tab and filter by Fetch/XHR calls.
Scroll down the page to load more review data.
After following the above steps, you will find the outgoing API calls recorded on the browser:
Next, copy the cURL representation of the request. Right-click on the request, hover on the copy menu and select copy as cURL (bash) if you are on Mac or Linux and (cmd) for Windows.
We can see the headers, cookies and parameters used with the cURL request. Executing it will return the HTML data found on the browser:
<div class="testimonial">
<identicon-svg username="testimonial-11"></identicon-svg>
<div>
<span class="rating"></span>
<p class="text">The features are great but it took me a while to understand how to use them.</p>
</div>
</div>
<div class="testimonial">
<identicon-svg username="testimonial-12"></identicon-svg>
<div>
<span class="rating"></span>
<p class="text">Love the simplicity and effectiveness of this app.</p>
</div>
</div>
Now that we can execute a successful cURL request, we can import it to an HTTP client such as Postman. This allows us to convert the cURL command into a programming language script like Python requests to continue the scraping process from there.
Moreover, this approach allows our web scraping requests to be identical to those of normal users, reducing our chances of getting blocked!
Avoid cURL Scraping Blocking
cURL can be a viable tool for requesting and transferring data across web pages. However, websites use protection shields, such as Cloudflare, to prevent automated requests like those of cURL from accessing the website.
For example, let's attempt to request G2 with cURL. It's a popular website with Cloudflare protection:
curl https://www.g2.com/
The website greeted us with a Cloudflare challenge to solve:
<!DOCTYPE html><html lang="en-US"><head><title>Just a moment...</title>
To prevent cURL web scraping blocking, we can use Curl Impersonate. A modified version of cURL that simulates the TLS fingerprint of normal web browsers. It also overrides the default cURL headers, such as the User-Agent, with regular header values. This makes the cURL Impersonate requests look like those sent from the browsers, preventing the firewalls from detecting the usage of HTTP clients.
If we request G2 again with Curl Impersonate, we'll get the actual page HTML:
<h1 class="hero-unit__title" id="main">Where you go for software.</h1>
For more details on Curl Impersonate, including the installation and usage. Refer to our dedicated guide.
Adding proxies to cURL
In the previous section, we explored preventing the detection of the usage of cURL for web scraping by modifying the requests' configurations. However, websites use another trick to block requests: IP address.
Using proxies with cURL allows for distributing the traffic load across multiple IP addresses. This makes it harder for websites and firewalls to detect the origin of the IP address, leading to better chances of avoiding blocking.
To add proxies for cURL, we can use the -x or --proxy option followed by the proxy URL:
ScrapFly provides an API player that allows for converting cURL commands into ScrapFly-powered web scraping requests:
ScrapFly also provides a cURL to Python tool that allows for converting cURL command into different Python HTTP clients, such as requests, aiohttp, httpx, and curl_cfii:
Here is an example output of importing a cURL request from the browser into the ScrapFly API player to automatically add the request configuration. We'll also enable the asp parameter to bypass scraping blocking, select a proxy country and use the render_js feature to enable JavaScript:
from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
response: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
url="https://web-scraping.dev/api/testimonials?page=2",
# enable anti scraping protection
asp=True,
# selector a proxy country
country="us",
# enable JavaScript rendering, similat to headless browsers
render_js=True,
# headers assigned to the cURL request from the browser
headers={
"sec-ch-ua": "\"Chromium\";v=\"122\", \"Not(A:Brand\";v=\"24\", \"Google Chrome\";v=\"122\"",
"x-secret-token": "secret123",
"HX-Current-URL": "https://web-scraping.dev/testimonials",
"sec-ch-ua-mobile": "?0",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Referer": "https://web-scraping.dev/testimonials",
"HX-Request": "true",
"sec-ch-ua-platform": "\"Windows\""
},
))
# get the HTML from the response
html = response.scrape_result['content']
# use the built-in Parsel selector
selector = response.selector
To wrap up this guide on web scraping with cURL, let's have a look at some frequently asked questions.
Can I use cURL for web scraping?
Yes, but not in the traditional sense. cURL is an HTTP client that doesn't provide additional utilities for parsing or data processing. Therefore, web scraping with cURL is best suited for debugging and development purposes or extracting a narrow amount of data.
Are there alternatives for cURL?
Yes, curlie is a command-line HTTP client that uses the same cURL features with the HTTPie interface. Another alternative to using cURL for web scraping is the Postman HTTP client. We have covered using Postman in a previous article.
Summary
In this guide, we explained how to web scrape with cURL. We started by exploring different cURL commands for various actions, including:
Sending GET requests.
Managing and manipulating HTTP headers and cookies.
Sending POST requests.
We have also explained common tips and tricks for web scraping with cURL, such as:
Scraping dynamic web pages by replicating background XHR calls.
Avoiding cURL scraping blocking using Curl Impersonate.
Preventing IP address blocking with cURL by adding proxies.
In this guide, we'll explore Curlie, a better cURL version. We'll start by defining what Curlie is and how it compares to cURL. We'll also go over a step-by-step guide on using and configuring Curlie to send HTTP requests.
Learn how to prevent TLS fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking.