Web scraping process is mostly sending HTTP requests to website pages for data retrieval. This process can be slow and resource-intensive so this is where caching comes in to save the day! Even using basic caching can allow for a faster response time and significant resource use reduction.
In this article, we'll explore web scraping caching. We'll start by explaining what caching is, its common strategies and why use it while web scraping. Finally, we'll go through a practical example of creating a web scraping cache system using Python and Redis. Let's get started!
What is Caching?
The goal of caching is to reduce redundancy by storing and reusing results. So, where a single function could take several seconds to run a bit of caching can significantly speed up the process by reusing results.
Cache results are usually on fast reading and writing hardware such as RAM but can also be stored on modern hard drive storage or even cloud databases. Most commonly cache is stored as key-value objects or as databases for more complex data structures.
To illustrate this here are some popular caching services commonly encountered in web development and web scraping:
Redis - an in-memory data structure store, used as a database, cache, and message broker. We'll focus on this in this tutorial
Memcached - a general-purpose distributed memory caching system.
Varnish - a very popular web application accelerator also known as a caching HTTP reverse proxy.
Data stored in a cache service has an expiration date, meaning they are only stored for a specific amount of time. When this expiration date is reached, the data gets removed from the store, ensuring the data remains up-to-date.
Why Use Cache For Web Scraping?
Caching increases performance and decreases computing costs by preventing redundancy. Here are a few reasons why you should use cache in web scraping:
Reduce Response Time
Sending HTTP requests while scraping can be time-consuming, especially with rich and complex web pages. On the other hand, cached data are saved in memory storage and can be retrieved in fractions of a second. This boost in response time speeds up the web scraping process and accelerates the development and debugging cycles.
Reduce Server Load
Instead of making repeated requests to the server for the same data, it can be retrieved from the cache. This reduces the number of requests sent, which prevents overloading the websites' servers and results in more ethical web scraping practices.
Reduce Consumed Bandwidth
Using cached data can help minimize bandwidth usage by eliminating the number of repeated requests. This is particularly big when using residential proxies which charge by bandwidth and can be very expensive.
Note that cache can make debugging harder and cause stale data issues. Therefore, it's crucial to use cache only when needed and to set an appropriate expiration date.
Common Caching Techniques
There are several different ways to cache data definde by cache types:
Cache Aside
Manually storing and retrieving data from the external cache storage. That's what we'll focus in this tutorial and this is the most common caching pattern used by Redis and Memcached services.
Read or Write Through Cache
Automated middleware service that sits between the scraper and the server and controls everything. This is encountered in frameworks like Scrapy where cache is managed automatically by the framework.
Write Back Cache
Similar to write/read through but with a delay which can increase performance in high bandwidth operations but is not very applicable in web scraping.
Generally, using cache aside is the most common caching in python technique when web scraping. It's simple to implement and can be used with any web scraping framework or library.
Common Cache Storage
Cache can be stored in any database or even in memory or a simple file. However, Redis and Memchached are two of the most popular cache storage services. They are both in-memory data structure stores, used as a database, cache, and message broker. They are both open-source and have official Python clients and are very accessible in web scraping.
Implement a Cache System With Redis
In this section, we'll create a simple Python cache for web scraping. We'll be using httpx to send HTTP requests and the Redis Python client to interact with the Redis cache database. These libraries can be installed using the following pip command:
pip install httpx redis
We'll also need to install the actual Redis database binaries. Refer to the official Redis installation page for detailed instructions.
Let's start with our Python caching logic by defining a simple caching class:
import redis
class RedisCache:
"""Simple Redis caching class"""
def __init__(self, host='localhost', port=6379, expire_time=300):
"""Connect to the Redis database and set an expiration date for cache in seconds"""
self.redis_client = redis.StrictRedis(host=host, port=port)
self.expire_time = expire_time
def get(self, key):
"""Get a specific object value using key"""
cached_data = self.redis_client.get(key)
if cached_data:
return cached_data.decode('utf-8')
return None
def set(self, key, data):
"""Set a specific object value to a key"""
self.redis_client.setex(key, self.expire_time, data)
Here, we define a simple RedisCache class, it connects to the Redis database and creates an expiration date for the cached data in seconds. We also define getters and setters functions. The set defines a new key-value cache pair, while the get retrieves a cached value using the key.
Next, we'll utilize this logic with httpx to cache HTML pages:
import httpx
def request_with_cache(url: str, cache: RedisCache):
# check if the response is already cached
cached_data = cache.get(url)
if cached_data:
print(f"Using cached data for {url}")
return cached_data
# if not cached, send an HTTP request
response = httpx.get(url)
print(f"Cache not found for {url}")
if response.status_code == 200:
# cache the response for future use
cache.set(url, response.text)
return response.text
We define a request_with_cache function, which uses the URLs as cache keys and HTML as the values. It looks for the URL in the cached data. If it's found, it directly retrieves the HTML from there. If not found, it requests the URL using httpx and saves the HTML for future use.
Let's test out our web scraping cache. We'll send a few requests to the same page and measure the response time. These requests will be split into two groups, half of them will utilize the cache and the other won't:
import time
# previous code
cache = RedisCache()
# webpage te mimics one second delay
url = "https://httpbin.dev/delay/1"
# cache the page first
html = request_with_cache(url, cache)
# first group: requests with cache
start_time = time.time()
for i in range(6):
html = request_with_cache(url, cache)
end_time = time.time()
print(f"Cache execution time: {end_time - start_time:.6f} seconds")
# seconds group: requests without cache
start_time = time.time()
for i in range(6):
response = httpx.get(url)
end_time = time.time()
print(f"Normal requests execution time: {end_time - start_time:.6f} seconds")
We can see a huge performance boost from our cache web scraper. The cached requests took almost no time to complete!
Existing Caching Libraries
Fortunately, many web scraping libraries and frameworks already implement caching through plugins or extensions. Here are some of the most popular ones:
requests-cache - extension for popular Python requests package with multiple caching backends: redis, sqlite, mongodb, gridfs, dynamodb, filesystem and memory (not persistent).
CacheControl - another extension for Python requests package with support for redis and multiple different filesystem backends
httpx-caching - extensions for popular Python httpx http client package with redis and filesystem backends.
axios-cache-adapter - extension for popular javascript http client axios with redis and memory backends.
node-fetch-cache - extension for nodejs fetch http client with filesystem and memory backends.
These tools are almost drop-in solutions for web scraper cache support and are great for starting with caching in scraping.
That being said, in web scraping, it's safest to manage the cache manually to avoid any unexpected behavior as we've covered in our Redis example above.
Web Scraping Cache in ScrapFly
ScrapFly is a web scraping API with a built-in caching mechanism. It can help out with all web scraping scaling operations!
To cache web scraping responses using ScrapFly, all we have to do is enable the cache and add an optional expiration timeout parameter:
from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
response: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
cache=True, # enable the cache feature
cache_ttl=300, # set an expiration timeout in seconds
country="US", # select a specific proxy country location
asp=True, # bypass any website scraping blocking
render_js=True, # enable JS rendering, similar to headless browsers
url="https://httpbin.dev/anything", # add the target website URL
))
# access the HTML from the response
html = response.scrape_result['content']
# use the built-in Parsel selector for HTML parsing
selector = response.selector
To wrap up this guide, let's have a look at some frequently asked questions about using cache for web scraping.
What is caching in web scraping?
Caching in the context of web scraping is the process of storing web page data to use it for future use. This improves performance, saves bandwidth and speeds up response time.
How to implement caching in Scrapy?
Scrapy already includes a powerful cache middleware which can be enabled in the porjects settings.py configuration.
How to implement caching in Selenium, Puppeteer or Playwright?
Headless web browsers already cache web pages by default however it's important to reuse browser profiles with each scraping process for persistent caching as each new profile starts with no cache.
Summary
In this article, we've taken a look at caching in web scraping and what a powerful tool it can be to speed up the scraping process. Caching can store scrape response data and reuse it in subsequent requests. This reduces response time, server load and bandwidth usage.
We've also covered common caching patterns, existing caching libraries and went through a step-by-step guide on creating Python caching middleware for web scraping using Redis.