How to scrape Local Storage using Headless Browsers

article feature image

Local Storage is a modern browser feature that allows websites to store data in the browser. It is similar to cookies but unlike cookies, it is not sent to the server with every request. This makes Local Storage a great place to store data like user preferences (like language selection) or even user browsing state (like product cart).

In this tutorial, we'll be taking a look how Local Storage is being scraped using browser automation tools like Puppeteer, Playwright, Selenium or Scrapfly SDK.

Setup

In this quick overview on Local Storage scraping we'll be taking a look at the most popular headless browser tools which can be installed using pip install command for Python:

$ pip install playwright selenium "scrapfly[all]"

or using npm for Node.js:

$ npm install puppeteer playwright

Note while we'll be covering all these tools you only need to choose one of them to follow along. See our web scraping using browsers intro for more information on how to choose the right tool for your project.

Where can Local Storage be encountered?

Local Storage is becoming an increasingly popular way to store page data as it can contain up to 5-10MB of data compared to 4KB of cookies. This makes it a great place to store user preferences, browsing state or even user-generated content. Most commonly Local Storage is used to store:

  • User's Website preferences like theme
  • User's browsing preferences like language
  • Website state like product cart or

Local Storage is especially popular with SPAs which use javascript to render every page making Local Storage a perfect tool.

How to access Local Storage?

Local Storage is a key-value store that can be accessed using JavaScript. It is accessible through the window.localStorage object. The localStorage object has methods to set, get and remove items from the store:

window.locaStorage.setItem('language', 'en'); // set item
window.locaStorage.getItem('language'); // get item
// 'en'

Unfortunately, this also means that Local Storage is only available in a browser environment. So, to scrape Local Storage we need a headless browser that can interpret these local storage commands.

Real World Example

Let's take a look at how to access and set Local Storage data in Playwright, Puppeteer and Selenium respectively. For this, we'll be using web-scraping.dev as an example which uses Local Storage for managing product cart:

0:00
/
Local Storage use example on web-scraping.dev product cart as seen in Chrome devtools inspector

In the demo above, we can see that when we add an item to the cart, it sets Local Storage with the cart data

To replicate this behavior when web scraping we can execute local storage javascript commands in our headless browsers:

Playwright
Selenium
Puppeteer
Scrapfly SDK
import json
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()

    # If we add an item to cart by clicking on the "Add to cart" button:
    page.goto("https://web-scraping.dev/product/1")
    page.click(".add-to-cart")
    # we can see it sets Local Storage
    local_storage_data = page.evaluate("() => localStorage")
    print(local_storage_data)
    # {'cart': '{"1_orange-small":1}'}  # that's 1 product in the cart

    # we can then modify cart by modifying Local Storage "cart" value:
    # e.g. add one more item to the cart
    cart = json.loads(local_storage_data["cart"])
    cart['1_cherry-small'] = 1
    # then call javascript to set the new value:
    page.evaluate("(cart) => localStorage.setItem('cart', JSON.stringify(cart))", cart)

    # check that the cart has 2 items now:
    page.goto("https://web-scraping.dev/cart")
    page.wait_for_selector(".cart-full .cart-item")
    print(f'items in cart: {len(page.query_selector_all(".cart-item .cart-title"))}')
    # items in cart: 2
import json
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

driver.get("https://web-scraping.dev/product/1")
add_to_cart_button = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".add-to-cart")))
add_to_cart_button.click()
time.sleep(1)  # wait for the Local Storage to get updated

local_storage_data_str = driver.execute_script("return JSON.stringify(localStorage);")
local_storage_data = json.loads(local_storage_data_str)

print(local_storage_data)
# {'cart': '{"1_orange-small":1}'}

cart = json.loads(local_storage_data["cart"])
cart['1_cherry-small'] = 1

driver.execute_script(f"localStorage.setItem('cart', JSON.stringify({json.dumps(cart)}));")

driver.get("https://web-scraping.dev/cart")
cart_items = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".cart-item .cart-title")))
print(f'items in cart: {len(cart_items)}')
# items in cart: 2

driver.quit()
const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();

    // If we add an item to cart by clicking on the "Add to cart" button:
    await page.goto('https://web-scraping.dev/product/1');
    await page.click('.add-to-cart');
    // we can see it sets Local Storage
    const localStorageData = await page.evaluate(() => JSON.parse(JSON.stringify(localStorage)));
    console.log(localStorageData);
    // {'cart': '{"1_orange-small":1}'}  // that's 1 product in the cart

    // we can then modify cart by modifying Local Storage "cart" value:
    // e.g. add one more item to the cart
    const cart = JSON.parse(localStorageData.cart);
    cart['1_cherry-small'] = 1;
    // then call JavaScript to set the new value:
    await page.evaluate((cart) => {
        localStorage.setItem('cart', JSON.stringify(cart));
    }, cart);

    // check that the cart has 2 items now:
    await page.goto('https://web-scraping.dev/cart');
    await page.waitForSelector('.cart-full .cart-item');
    const cartItems = await page.$$eval('.cart-item .cart-title', items => items.length);
    console.log(`items in cart: ${cartItems}`);
    // items in cart: 2

    await browser.close();
})();
import json
from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient("YOUR SCRAPFLY KEY")

result = scrapfly.scrape(ScrapeConfig(
    "https://web-scraping.dev/product/1",
    render_js=True,
    js_scenario=[
        {"click": {"selector": ".add-to-cart"}},
        # retrieve current local storage
        {"execute": {"script": "return localStorage.getItem('cart')", "timeout": "1000"}},
        # we can also change cart by setting local storage:
        {"execute": {"script": f"localStorage.setItem('cart', '{json.dumps({'1_cherry-small': 1})}')", "timeout": "1000"}},
        {"execute": {"script": "return localStorage.getItem('cart')", "timeout": "1000"}},
    ],
))
# final local storage can always be retrieved
data = result.scrape_result['browser_data']['local_storage_data']

# alternatively we can get local storage from each step:
# retrieve cart from results:
cart = result.scrape_result['browser_data']['js_scenario']['steps'][1]['result']
# {"1_orange-small":1}
cart_updated = result.scrape_result['browser_data']['js_scenario']['steps'][3]['result']
# {"1_cherry-small":1}

In the example above, we explore two ways to interact with Local Storage on web-scraping.dev: retrieving existing data using localStorage.getItem() and setting new data using localStorage.setItem().

FAQ

To wrap up this guide on how to scrape Local Storage let's take a look at related frequently asked questions:

What's the difference between Local Storage and Session Storage?

Session Storage is a version of Local Storage that clears on the browser window close. It is accessed the same way as Local Storage using sessionStorage.setItem(key, value) to create an entry and sessionStorage.getItem(key to retrieve it. In web scraping session storage is not encountered as often but since API is the same it should be easy to scrape as described in this guide.

What's the difference between IndexedDB and Local Storage?

IndexedDB is a more advanced version of Local Storage that allows to store more complex data structures. It is accessed using indexedDB.open() and indexedDB.transaction() methods. In web scraping IndexedDB is not encountered as often as Local Storage though it is still possible to scrape it using the same approach as described in this guide - using javascript execution with headless browsers.

Can local storage be scraped without headless browsers?

Sort of. While it's impossible to retrieve Local Storage data without a headless browser it is possible to reverse engineer how the local storage is generated and set it manually in Python. For example, if the cart entry is generated from the product page HTML we can replicate this behavior by extracting the product id from the HTML itself.

Summary

Local Storage is becoming an increasingly popular web technique that is encountered in web scraping.
To scrape Local Storage the best approach is to employ a real web browser and in this guide we've taken a look how to do that in all major browser automation frameworks.

Related Posts

What is a Headless Browser? Top 5 Headless Browser Tools

Quick overview of new emerging tech of browser automation - what exactly are these tools and how are they used in web scraping?

How To Take Screenshots In Python?

Learn how to take Python screenshots through Selenium and Playwright, including common browser tips and tricks for customizing web page captures.

Web Scraping With Cloud Browsers

Introduction cloud browsers and their benefits and a step-by-step setup with self-hosted Selenium-grid cloud browsers.