How to Handle Cookies in Web Scraping

article feature image

Websites seem to seamlessly remember user preferences but how can this be replicated in web scraping? Preferences are usually controlled by cookies and in scraping we can handle this as well!

In this guide, we'll take a look at cookies in web scraping - what they are and how they work. We'll also go over a practical web scraping using cookies example to scrape pages with login. Let's dive in!

What are HTTP Cookies?

HTTP cookies are small pieces of key-to-value datapoints that are stored on an HTTP client or a web browser. Most commonly they are used to store user preferences and information data, such as login credentials, shopping cart items, and website preferences.

Cookies are created by the web server when a user visits a website. They are saved on the user's browser (or HTTP client) through an HTTP response header called Set-Cookie. Then, with each following request, the browser sends back the set cookies to the web server using the Cookie header:

Cookie: key1=value1;key2=value2

To demonstrate this let's take a look at this Python example through httpbin.dev/cookies scraper testing website:

import httpx

# request httpbin.dev to set a cookie for us:
response = httpx.get("https://httpbin.dev/cookies/set?name=scrapfly&password=123")
# this returns a Set-Cookie header which our browser would read and store to memory
response.headers['Set-Cookie']
# 'name=scrapfly;password=123'

# Then, to use cookies in the next request we need to add the Cookie header:
response2 = httpx.get("https://httpbin.dev/cookies", headers={"Cookie": response.headers['Set-Cookie']})

So, cookies are essentially just HTTP headers that are managed by a web browser or an HTTP client as a memory store.


Generally, we can categorize cookies into three different types:

Persistent.
Cookies with an expiration date stay saved on the user's device even after closing the browser. They are often used for long-term data storage, like saving user's settings, bookmarks and authentication state.

Session.
Temporary cookies that are deleted when the user closes the browser. Which are typically used for short-term data storage, like saving shopping cart data during a single browsing session.

Third-party.
Generated by linked advertisements and analytics tools, not by the target website itself. These cookies are often used to track users' browsing habits and history, which enables advertisers to create custom ads based on the users' interests.

That being said, functionally all of these cookies are identical from technical perspective in web scraping. Though, when developing scrapers we're mostly interested in persistent and session cookie for replicating website functionality.

Now that we have an overview of cookies and how they work. Let's explore why we need cookies in web scraping!

Why Cookies in Web Scraping are Important?

Cookies save user preferences like currency, language and region selection which are often expressed through cookie values like lang=en and currency=USD. Through cookie value control it's possible to scrape in another language and currency.

Cookies can also save browsing information like shopping cart data, recommended products, and payment options in e-commerce apps. Adding cookies to web scrapers can help scrape this type of data.

How to Scrape Cart Data From Local Storage

Learn about local storage and how it can be scraped using browser automation tools like Puppeteer, Playwright, Selenium or Scrapfly SDK.

How to Scrape Cart Data From Local Storage

Websites can also analyze web scraper connections, including headers and cookies to determine if the request sender is an organic browser or a scraper bot. Therefore, carefully adding cookies can help avoid web scraping detection.

Another valuable use case of web scraping using cookies is to scrape pages that require a login. So, let's take a look at an example of that though before proceeding, let's look at the tools we'll be using.

Setup

In this guide, we'll use cookies to scrape pages with login using different web scraping libraries:

  • Httpx: An HTTP client that sends requests and receives data in HTML or JSON.
  • Playwright: A library that allows for running and automating headless browsers.
  • ScrapFly: A web scraping API that allows for scraping at scale.
  • BeautifulSoup: A parsing library that extracts data from the HTML.

These libraries can be installed using pip terminal command:

pip install httpx playwright scrapfly bs4

How to Use Cookies in Web Scraping

In this guide, we'll use cookies to scrape an authenticated page from web-scraping.dev:

Private data page on Etsy.com
Cookies practice example on web-scraping.dev

The above page data is private and only accessible to logged-in users. This website controlls login through authentication cookies so to scrape this page we must replicate the authentication process. In this guide, we'll replicate the authentication state by adding the cookie values to the request.

To start, we need to obtain the authentication cookie from the browser. To do that, developer tools which can be accessed through the F12 key in a web browser. Then, head to the Network tab and reload the page where we can see the cookies our browser is sending:

Background requests as seen on Chrome developer tools
Background requests on developer tools

On the left side, we can see all the requests sent from the browser to the web server. However, we are only interested in the login request. We can also observe all the headers and cookies sent along with the request from the Headers tab.

The cookie auth represents the authentication token. We'll add this cookie name and value to the Cookie header in our web scraper:

Httpx
Playwright
ScrapFly
from httpx import Client
from bs4 import BeautifulSoup

# Add the cookie header to the headers object
headers = {"Cookie": "auth=user123-secret-token"}

with Client(headers=headers) as client:
    response = client.get("https://web-scraping.dev/login")
    soup = BeautifulSoup(response.text, "html.parser")
    for div in soup.select("div.form-text.mb-2"):
        print(div.text)
        """Logged in as User123
        The secret message is:🤫"""
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as playwight:
    # Launch a chrome headless browser
    browser = playwight.chromium.launch(headless=True)
    context = browser.new_context()
    context.set_extra_http_headers(
        {
            # Add the cookie header to the headers object
            "Cookie": "auth=user123-secret-token"
        }
    )
    page = context.new_page()
    page.goto("https://web-scraping.dev/login")
    soup = BeautifulSoup(page.content(), "html.parser")
    for div in soup.select("div.form-text.mb-2"):
        print(div.text)
        """Logged in as User123
        The secret message is:🤫"""
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from bs4 import BeautifulSoup

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

api_response: ScrapeApiResponse = scrapfly.scrape(
    scrape_config=ScrapeConfig(
        url="https://web-scraping.dev/login",
        # Add the cookie header to the headers object
        headers={
            "Cookie": "auth=user123-secret-token",
        },
        # Enable headless browsers (like playwright)
        render_js=True,
        # Enable the anti scraping protection to bypass blocking
        asp=True,
    )
)

soup = BeautifulSoup(api_response.scrape_result["content"], "html.parser")
for div in soup.select("div.form-text.mb-2"):
    print(div.text)
    """Logged in as User123
        The secret message is:🤫"""

Here, we add the Cookie header to our web scraping client and wrap the HTML into a BeautifulSoup object. Then, we search for the text data in the HTML.

Works like a charm! By using cookies for web scraping, the website recognizes us as an authenticated user and gives us access to the login-protected data.

FAQ

To wrap up this guide, let's look at some frequently asked questions about cookies in web scraping.

How to add cookies to web scrapers?

Cookies can be added through HTTP headers by attaching the cookies' values to the Cookie header. Or, you can add specific cookie values to the HTTP client, for a real life example see web scraping localization using cookies example page.

How to scrape a website that requires login with Python?

To scrape pages with authentication, you can reverse engineer the request sent from the browser to get the cookies responsible for the authentication state. Or, you can simulate a login session by adding the login credentials to the HTTP client while requesting the page.

Summary

In summary, cookies are small data the server stores on the client's side. They keep track of various information such as the user's preferences, authentication state and browsing data.

We went through a step-by-step example of using cookies in web scraping on pages that require login. Which works by attaching the Cookie header to the web scraping request.

Related Posts

Sending HTTP Requests With Curlie: A better cURL

In this guide, we'll explore Curlie, a better cURL version. We'll start by defining what Curlie is and how it compares to cURL. We'll also go over a step-by-step guide on using and configuring Curlie to send HTTP requests.

How to Use cURL For Web Scraping

In this article, we'll go over a step-by-step guide on sending and configuring HTTP requests with cURL. We'll also explore advanced usages of cURL for web scraping, such as scraping dynamic pages and avoiding getting blocked.

Use Curl Impersonate to scrape as Chrome or Firefox

Learn how to prevent TLS fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking.