Cache

The cache feature allows storing of scraped content on Scrapfly's cache servers for up to 1 week. Any subsequent request to the same URL will be served from the cache, which is much faster than scraping the content again.

When cache feature is enabled cache status and use is also indicated in the monitoring dashboard log page cache tab:

cache details on the monitoring logs page

Primarily, the cache feature is used in scraper development and testing but can be used in production as well. See this illustration on how Scrapfly's cache use is being determined:

Cache HIT is determined based on scrape configuration fingerprint: request method, URL, headers and body (if present).

The cache feature can be configured with cache_ttl (Cache Time To Live) feature to configure the cache expiration time in seconds. The cache can also be cleared using cache_clear parameter explicitly.

Sharing Policy

The cache is isolated by project and environment. In other words, the cache from project A is not available in project B.

TTL & Eviction Policy

By default, cache_ttl is set to one day so by default cache will expire in 24 hours. This can be extended up to 1 week using the cache_ttl setting in seconds (604800 seconds)

The cache can be force cleared using cache_clear parameter or in each of your log pages from the web interface.

Specification

Scrapfly's cache system is not related to HTTP cache or any existing caching mechanism. All cache policies on Scrapfly are adapted for web scraping first.

This allows unique caching features like the ability to cache non-safe HTTP methods such as POST, PUT, PATCH

Note that cache use can slightly reduce web scraping speeds when cache is MISSED as Scrapfly has to store the result to it's cache servers.

The cache store the original response body, that means when using cache in addition with content transformation features such as format=markdown for example, you can replay the cache content against different format like format=json.

Limitation

  • Cache feature cannot be used while Session
  • Maximum TTL allowed 604800 equivalent to 7 days

When using cache is used with Javascript Rendering the HIT results will not include screenshots or custom javascript features and the browser is not invoked or billed for.

Usage

require "uri"
require "net/http"

url = URI("https://api.scrapfly.io/scrape?cache=true&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Fproduct%2F1")

https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true

request = Net::HTTP::Get.new(url)

response = https.request(request)
puts response.read_body
https://api.scrapfly.io/scrape?cache=true&key=&url=https%253A%252F%252Fweb-scraping.dev%252Fproduct%252F1

Response example

...
"context": {
    ...
    "cache": {
        "state": "HIT",
        "entry": {
            "user_uuid": "e2f896cc-224c-484f-b633-0f5263623f79",
            "created_at": "2023-07-13 19:10:24 UTC",
            "last_used_at": "2023-07-14 00:08:02 UTC",
            "fingerprint": "315eed1c0703aa6b1761b7753e2ee5c795f97cda",
            "size": 733,
            "env": "LIVE",
            "ttl": 50000,
            "expires_at": "2020-07-14 19:10:24 UTC",
            "response_headers": {
                "date": "Mon, 13 Jul 2020 19:10:24 GMT",
                "content-type": "application/json",
                "content-length": "733",
                "server": "gunicorn/19.9.0",
                "access-control-allow-origin": "*",
                "access-control-allow-credentials": "true",
                "x-cache": "MISS from springgreen-alastor-atlas-blue",
                "x-cache-lookup": "MISS from springgreen-alastor-atlas-blue:8888",
                "connection": "keep-alive"
            },
            "response_status": 200,
            "url": "https://web-scraping.dev/product/1"
        }
    },
    ...
},
...

Integration

Pricing

No additional fee on usage, cache size is not metered - fair use is applied.

Summary