How to Use cURL For Web Scraping

How to Use cURL For Web Scraping

cURL is one of the oldest tools used for sending HTTP requests. Yet, it's still a great asset for the web scraping toolbox.

In this article, we'll go over a step-by-step guide on sending and configuring HTTP requests with cURL. We'll also explore advanced usages of cURL for web scraping, such as scraping dynamic pages and avoiding getting blocked. Let's get started!

What is cURL and Why Use It?

cURL, standing for "client for URL", is an open-source command-line tool used for transferring data with URLs. It's built on the top of the libcurl C library. It supports the different types of HTTP methods (GET, POST, PUT, etc.) with various HTTP protocols, including HTTP and HTTPS.

cURL isn't only super-fast and straightforward, but it provides a comprehensive request configuration, including:

  • Adding custom headers and cookies.
  • Enabling or disabling request redirects.
  • Downloading binary files.

This makes using cURL for web scraping a viable tool for debugging and developing scraping scripts or even extracting small data portions.

How To Install cURL?

Before we start web scraping with cURL, we must install it. cURL comes pre-installed in almost all operating systems. However, run the below commands to upgrade or install it if it isn't found.

Linux

$ apt-get install curl

Mac

$ brew install curl

Windows

$ choco install curl

To verify your installation, simply run the following command. You should receive the cURL version details:

$ curl --version
# curl 8.4.0 (Windows) libcurl/8.4.0 Schannel WinIDN
# Release-Date: 2023-10-11

How To Use cURL?

In this section, we'll explore the basics of cURL and how to navigate it to send different request types. Let's start with the most basic cURL usage: sending GET requests.

Sending GET Requests

cURL follows the below syntax for all the request types:

curl [OPTIONS] URL
  • OPTIONS
    Represents the request option, which are configurations that can be passed to the request to specify headers, cookies, proxies, request type and so on. To list the commonly used options, use the curl -h command. To view all the available ones, use the curl -h all command.

  • URL
    The actual URL to request.

To send a GET request with cURL, all we have to do is specify the URL to request, as it uses the GET method by default:

curl https://httpbin.dev/get

The above command will request the httpbin.dev/get endpoint and return the request details:

{
  "args": {},
  "headers": {
    "Accept": [
      "*/*"
    ],
    "Accept-Encoding": [
      "gzip"
    ],
    "Host": [
      "httpbin.dev"
    ],
    "User-Agent": [
      "curl/8.4.0"
    ]
  },
  "url": "https://httpbin.dev/get"
}

We can see that the request has been sent successfully with the default cURL header configurations. Let's have a look at modifying them.

Adding Headers

To add headers with cURL, we can use the -H option for each header. For example, here is how we can send a cURL with User-Agent and Accept headers:

curl -H "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0" -H "Accept: application/json" https://httpbin.dev/headers

In the above cURL request, we override the cURL User-Agent and Accept headers with custom ones. The response will include the newly configured headers:

{
  "headers": {
    "Accept": [
      "application/json"
    ],
    "Accept-Encoding": [
      "gzip"
    ],
    "Host": [
      "httpbin.dev"
    ],
    "User-Agent": [
      "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0"
    ]
  }
}

Alternatively, we can change the cURL User-Agent header through the -A option:

curl -A "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0" https://httpbin.dev/headers
How Headers Are Used to Block Web Scrapers and How to Fix It

Take an extensive look at request headers in web scraping. How can we prevent our web scrapers from being identified and blocked by configuring our requests to appear as normal users'.

How Headers Are Used to Block Web Scrapers and How to Fix It

Adding Cookies

Next, let's set cookies with cURL. For this, we can use the cURL -b option:

curl -b "cookie1=value1; cookie2=value2" https://httpbin.dev/cookies

The above command will set two cookie values with the cURL request sent:

{
  "cookie1": "value1",
  "cookie2": "value2"
}

Alternatively, we can treat the cookies as regular cURL headers and pass them through the cookie header:

curl -H "cookie: cookie1=value1; cookie2=value2" https://httpbin.dev/cookies
How to Handle Cookies in Web Scraping

Learn how to handle cookies in web scraping - what they are and how they work. We'll also go over a practical web scraping using cookies example to scrape pages with login. Let's dive in!

How to Handle Cookies in Web Scraping

Sending Post Requests

In the previous sections, we have sent GET requests with cURL. In this one, we'll explain sending POST requests. To send POST requests with cURL, we can utilize the -X option, which determines the request HTTP method:

curl -X POST https://httpbin.dev/post

Thhe above cURL command will send a POST request and return the request details:

{
  "args": {},
  "headers": {
    "Accept": [
      "*/*"
    ],
    "Accept-Encoding": [
      "gzip"
    ],
    "Content-Length": [
      "0"
    ],
    "Host": [
      "httpbin.dev"
    ],
    "User-Agent": [
      "curl/8.4.0"
    ]
  },
  "url": "https://httpbin.dev/post",
  "data": "",
  "files": null,
  "form": null,
  "json": null
}

In most cases, POST requests require a body. So, let's take a look at adding a request body with cURL requests.

Adding Request Body

To add a request body with cURL, we can use the -d cURL option and pass the body as an object:

curl -X POST -d '{"key1": "value1", "key2": "value2"}' https://httpbin.dev/post

If we observe the response, we'll find the body passed to the request present:

{
  ....
  ""data": "{\"key1\": \"value1\", \"key2\": \"value2\"}",
}

Note that on Windows, you need to escape the body with backslashes:

curl -X POST -d "{\"key1\": \"value1\", \"key2\": \"value2\"}" https://httpbin.dev/post

Web Scraping With cURL

The standard web scraping process requires HTML parsing, crawling, processing and saving the extracted. Therefore, cURL itself isn't suitable for these extensive scraping tasks. However, it can be a great asset for debugging and development purposes. Accordingly, we'll explore using cURL for common web scraping tips and tricks.

Scraping Dynamic pages With cURL

Data on dynamic websites are usually loaded through background XHR calls. These API calls can be captured on the browser developer tools and exported as cURL requests for web scraping.

For example, the review data on web-scraping.dev is loaded through background API requests:

webpage with review data
Reviews on web-scraping.dev

First, let's capture the API calls on the above web page using the following steps:

  • Open the browser developer tools by pressing the F12 key.
  • Select the network tab and filter by Fetch/XHR calls.
  • Scroll down the page to load more review data.

After following the above steps, you will find the outgoing API calls recorded on the browser:

background API calls on the browser developer tools
Background API calls on web-scraping.dev

Next, copy the cURL representation of the request. Right-click on the request, hover on the copy menu and select copy as cURL (bash) if you are on Mac or Linux and (cmd) for Windows.

copy a request on browser developer tools as curl
Copy the request as cURL

The copied cURL command should look this:

curl 'https://web-scraping.dev/api/testimonials?page=2' \
  -H 'authority: web-scraping.dev' \
  -H 'accept: */*' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'cookie: cookiesAccepted=true' \
  -H 'hx-current-url: https://web-scraping.dev/testimonials' \
  -H 'hx-request: true' \
  -H 'referer: https://web-scraping.dev/testimonials' \
  -H 'sec-ch-ua: "Chromium";v="122", "Not(A:Brand";v="24", "Microsoft Edge";v="122"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Windows"' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: same-origin' \
  -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0' \
  -H 'x-secret-token: secret123'

We can see the headers, cookies and parameters used with the cURL request. Executing it will return the HTML data found on the browser:

<div class="testimonial">
    <identicon-svg username="testimonial-11"></identicon-svg>
    <div>
        <span class="rating"></span>
        <p class="text">The features are great but it took me a while to understand how to use them.</p>
    </div>
</div>


<div class="testimonial">
    <identicon-svg username="testimonial-12"></identicon-svg>
    <div>
        <span class="rating"></span>
        <p class="text">Love the simplicity and effectiveness of this app.</p>
    </div>
</div>

Now that we can execute a successful cURL request, we can import it to an HTTP client such as Postman. This allows us to convert the cURL command into a programming language script like Python requests to continue the scraping process from there.

Moreover, this approach allows our web scraping requests to be identical to those of normal users, reducing our chances of getting blocked!

How to Scrape Hidden APIs

In this web scraping tutorial, we'll explain how to find hidden APIs, how to scrape them, and the common challenges faced when developing web scrapers for hidden APIs.

How to Scrape Hidden APIs

Avoid cURL Scraping Blocking

cURL can be a viable tool for requesting and transferring data across web pages. However, websites use protection shields, such as Cloudflare, to prevent automated requests like those of cURL from accessing the website.

How to Scrape Without Getting Blocked? In-Depth Tutorial

In this article, we'll take a look at web scraping without getting blocked by exploring four core concepts where web scrapers fail to cover their tracks and how analysis of these details can lead to blocking.

How to Scrape Without Getting Blocked? In-Depth Tutorial

For example, let's attempt to request G2 with cURL. It's a popular website with Cloudflare protection:

curl https://www.g2.com/

The website greeted us with a Cloudflare challenge to solve:

<!DOCTYPE html><html lang="en-US"><head><title>Just a moment...</title>

To prevent cURL web scraping blocking, we can use Curl Impersonate. A modified version of cURL that simulates the TLS fingerprint of normal web browsers. It also overrides the default cURL headers, such as the User-Agent, with regular header values. This makes the cURL Impersonate requests look like those sent from the browsers, preventing the firewalls from detecting the usage of HTTP clients.

If we request G2 again with Curl Impersonate, we'll get the actual page HTML:

<h1 class="hero-unit__title" id="main">Where you go for software.</h1>

For more details on Curl Impersonate, including the installation and usage. Refer to our dedicated guide.

Use Curl Impersonate to scrape as Chrome or Firefox

In this article, we'll take a deep dive into the Curl Impersoate tool, which prevents TLS and HTTP2 fingerprinting by impersonating normal web browser configurations.

Use Curl Impersonate to scrape as Chrome or Firefox

Adding proxies to cURL

In the previous section, we explored preventing the detection of the usage of cURL for web scraping by modifying the requests' configurations. However, websites use another trick to block requests: IP address.

How to Avoid Web Scraper IP Blocking?

In web scraping, IP tracking and analysis (aka fingerprint) is often used to throttle and block web scrapers or other undesired visitors. In this article, we'll look at Internet Protocol addresses and how IP tracking technologies are used to block web scrapers.

How to Avoid Web Scraper IP Blocking?

Using proxies with cURL allows for distributing the traffic load across multiple IP addresses. This makes it harder for websites and firewalls to detect the origin of the IP address, leading to better chances of avoiding blocking.

To add proxies for cURL, we can use the -x or --proxy option followed by the proxy URL:

curl -x <protocol>://<proxy_host>:<proxy_port> <url>

The above syntax is the unified syntax used to add proxies to cURL requests. In practice, this syntax can be used like this for different proxy types:

# HTTP
curl -x http://proxy_domain.com:8080 https://httpbin.dev/ip
# HTTPS
curl -x https://proxy_domain.com:8080 https://httpbin.dev/ip
# SOCKS5
curl -x socks5://proxy_domain.com:8080 https://httpbin.dev/ip
# Proxies with crednetials
curl -x https://username:password@proxy.proxy_domain.com:8080 https://httpbin.dev/ip

For more details on using proxies for web scraping, refer to our dedicated guide.

Introduction To Proxies in Web Scraping

In this introductory article, we will explore various types of proxies, compare their differences, identify common challenges in proxy usage and discuss the best practices for web scraping.

Introduction To Proxies in Web Scraping

Powering Up with ScrapFly

Curl is a powerful web scraping tool though to scale up web scraping operations we might need a bit more and this is where Scrapfly can lend a hand!

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

ScrapFly provides an API player that allows for converting cURL commands into ScrapFly-powered web scraping requests:

ScrapFly API player screenshot
Import cURL command into ScrapFly's API player

ScrapFly also provides a cURL to Python tool that allows for converting cURL command into different Python HTTP clients, such as requests, aiohttp, httpx, and curl_cfii:

curl to Python tool

Here is an example output of importing a cURL request from the browser into the ScrapFly API player to automatically add the request configuration. We'll also enable the asp parameter to bypass scraping blocking, select a proxy country and use the render_js feature to enable JavaScript:

from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

response: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
    url="https://web-scraping.dev/api/testimonials?page=2",
    # enable anti scraping protection
    asp=True,
    # selector a proxy country
    country="us", 
    # enable JavaScript rendering, similat to headless browsers
    render_js=True,
    # headers assigned to the cURL request from the browser
    headers={ 
        "sec-ch-ua": "\"Chromium\";v=\"122\", \"Not(A:Brand\";v=\"24\", \"Google Chrome\";v=\"122\"",
        "x-secret-token": "secret123",
        "HX-Current-URL": "https://web-scraping.dev/testimonials",
        "sec-ch-ua-mobile": "?0",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        "Referer": "https://web-scraping.dev/testimonials",
        "HX-Request": "true",
        "sec-ch-ua-platform": "\"Windows\""
    },
))

# get the HTML from the response
html = response.scrape_result['content']

# use the built-in Parsel selector
selector = response.selector

FAQ

To wrap up this guide on web scraping with cURL, let's have a look at some frequently asked questions.

Can I use cURL for web scraping?

Yes, but not in the traditional sense. cURL is an HTTP client that doesn't provide additional utilities for parsing or data processing. Therefore, web scraping with cURL is best suited for debugging and development purposes or extracting a narrow amount of data.

Are there alternatives for cURL?

Yes, curlie is a command-line HTTP client that uses the same cURL features with the HTTPie interface. Another alternative to using cURL for web scraping is the Postman HTTP client. We have covered using Postman in a previous article.

Summary

In this guide, we explained how to web scrape with cURL. We started by exploring different cURL commands for various actions, including:

  • Sending GET requests.
  • Managing and manipulating HTTP headers and cookies.
  • Sending POST requests.

We have also explained common tips and tricks for web scraping with cURL, such as:

  • Scraping dynamic web pages by replicating background XHR calls.
  • Avoiding cURL scraping blocking using Curl Impersonate.
  • Preventing IP address blocking with cURL by adding proxies.

Related Posts

Guide to Axios Headers

Learn about Javascript's Axios headers. How to configure, update, inspect headers in request and responses, how to set defaults and useful tips

What is HTTP 401 Error and How to Fix it

Discover the HTTP 401 error meaning, its causes, and solutions in this comprehensive guide. Learn how 401 unauthorized errors occur.

Comprehensive Guide to OkHttp for Java and Kotlin

Learn how to simplify network communication in Java and Android applications using OkHttp.