Cookies in Web Scraping

Cookies are small data values stored on the client's system by the server. They are used to track the client's state and capabilities like login authentication.

illustration of a general Cookies header flow

Each request made by the client is sending the entire cookie set to the server and each response can add or modify the client's cookies.

Cookies play a role in many different web functionalities encountered in web scraping:

Most importantly, cookies are just request headers and are treated the same way as any other header except cookie values can expire or become invalidated.

Cookies play an important role in a web scraping process. Probably the most commonly encountered cookie use is authentication. Let's take a real life example.

Real Life Example

For this example, let's take a look at web-scraping.dev/login page, how it's using cookies for login authentication and how to scrape it.

screencapture of web-scraping.dev login page

If we press the "Submit" button and perform our login in the browser Developer Tools (F12 key) network tab we can see that the returned response comes with a `Set-Cookie` header:

screencapture of web-scraping.dev login request in Chrome devtools

This header contains a secret authentication token that will allow the client to access areas of the site that require login. Now, every request on web-scraping.dev will include a Cookie header with this secret value:

screencapture of web-scrpaing.dev sending Cookie header to all other endpoints

To replicate this behavior in our web scraper we need to:

  1. Generate the Set-Cookie response header and save the value
  2. Include the Cookie request header with this secret value

In this web-scraping.dev example we generate the cookie by submitting a correct login form. Let's take a look at how it all works together:

Scrapfly.py
Scrapfly.js
python
javascript
from scrapfly import ScrapflyClient, ScrapeConfig, UpstreamHttpClientError
import uuid

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
session = str(uuid.uuid4()).replace("-", "")
login = client.scrape(
    ScrapeConfig(
        url="https://web-scraping.dev/api/login",
        session=session,
        debug=True,
        method="POST",
        data={
            "username": "user123",
            "password": "password",
        },
    )
)
secret = client.scrape(ScrapeConfig("https://web-scraping.dev/login", session=session))
secret_value = secret.selector.css("#secret-message::text").get()
print(secret_value)
import { ScrapflyClient, ScrapeConfig } from 'scrapfly-sdk';
import { v4 as uuidv4 } from 'uuid';
import cheerio from 'cheerio';

const client = new ScrapflyClient({ key: "YOUR SCRAPFLY KEY" });
// create a session 
const session = uuidv4();
// login current sesion:
await client.scrape(
    new ScrapeConfig({ 
        url: 'https://web-scraping.dev/api/login',
        session: session,
        method: "POST",
        data: {
            username: "user123",
            password: "password",
        }
    }),
);
// cookies will automatically be used for request with the same session:
const secretResult = await client.scrape(
    new ScrapeConfig({ 
        url: 'https://web-scraping.dev/login',
        session: session,
    }),
);

let $ = cheerio.load(secretResult.result.content);
let secret = $('#secret-message').text();
console.log(secret);
import httpx  # or requests

# 1. Perform login POST request to get authentication cookies:
login_response = httpx.post(
    "https://web-scraping.dev/api/login",
    data={
        "username": "user123",
        "password": "password",
    },
)
print(dict(login_response.cookies))
{"auth": "user123-secret-token"}
# 2.  use cookies in following requests:
secret_response = httpx.get("https://web-scraping.dev/login", cookies=dict(login_response.cookies))
# find secret message only available to logged in users:
from parsel import Selector

secret = Selector(secret_response.text).css("#secret-message::text").get()
print(secret)
# 🤫


# ALTERNATIVE:
# httpx can track cookies automatically when httpx.Client() or httpx.AsyncClient() is used.

with httpx.Client() as client:
    login_response = httpx.post(
        "https://web-scraping.dev/api/login",
        data={
            "username": "user123",
            "password": "password",
        },
    )
    secret_response = client.get("https://web-scraping.dev/login")
    # find secret message only available to logged in users:
    from parsel import Selector

    secret = Selector(secret_response.text).css("#secret-message::text").get()
    print(secret)
import axios from 'axios';
import cheerio from 'cheerio';


// 1. Perform login POST request to get authentication cookies
let data = new URLSearchParams();
data.append('username', 'user123');
data.append('password', 'password');
let loginResponse = await axios.post('https://web-scraping.dev/api/login', data.toString(), {
    headers: {
        'Content-Type': 'application/x-www-form-urlencoded',
    },
    maxRedirects: 0, // Prevent axios from following redirects
    validateStatus: false, 
});
const cookies = loginResponse.headers['set-cookie'];
console.log(cookies); // 'set-cookie' headers

// 2. Use cookies in following requests
let secretResponse = await axios.get('https://web-scraping.dev/login', {
    headers: {
        'Cookie': cookies.join(';'),  // Send cookies in 'Cookie' header
    },
});

// Find secret message only available to logged in users
let $ = cheerio.load(secretResponse.data);
let secret = $('#secret-message').text();
console.log(secret);
// 🤫

Above we're treating cookies as normal header values. The initial login request receives set-cookie header with our user's password and then we can use it to scrape all endpoints that require authentication.