Cache

overview page of web interface
Cache tab of log inspection

The cache feature allows you to store a scrape's content in the cache, with different options, which are described below. Some options are TTL, cache clearance. This feature means you can call multiple APIs to scrape a website without hitting the upstream website.

There are many usages. Mainly it is used to reduced latency when your Spider accesses the same web page multiple times. It could be used for live applications; cache gives you huge performance gains. Also, it's useful to quickly go over your prototype in the development process and avoid abusing the upstream website's rate limit.

The cache feature is also a way to manage pressure applied by your Spider on the upstream website. We recommend using the throttling feature to achieve that, for which it is designed.

Cache Diagram

Sharing Policy

Cache sharing is following multiple rules to be consistent and avoid misconceptions.

The cache is not shared across the project and environment; they are isolated from each other.

The rendered content with JavaScript and non-interpreted content is not shared. In this way, there is no conflict between content. Note scrapes that have JavaScript rendering enabled do have the same content.

We served caches based on fingerprints from your scrape config. A fingerprint is based on the method, URL, headers, and body—if present.

TTL & Eviction Policy

By default, cache TTL is one day. Your cache configures it via the cache_ttl setting, expressed in seconds. When you set a long TTL greater than one month, the cache is automatically evicted by the system if the cache was never used after one month.

You force the cache to clear by giving the cache_clear=true command to the API. You can also purge cache in your log inspection from the web interface.

Specification

Our cache system is not related to HTTP cached or anything else. We adapt all policies with web scraping requirements.

You can cache non-safe HTTP methods such as POST, PUT, PATCH

When the cached is MISSED, compared to a non-cached scrape, an overhead is observed due to the way the cache is synced. Therefore, when you get the scraped result, the cache is fully available, which avoids many side effects.

Limitation

  • Cache feature cannot be used while Session
  • Maximum TTL allowed 604800 equivalent to 7 days
  • If browser feature are used, when the cache is HIT it will have no effect (Rendering time, Javascript Execution, Screenshots) because the browser is not invoked (and not billed)

    Usage

    curl -G \
    --request "GET" \
    --url "https://api.scrapfly.io/scrape" \
    --data-urlencode "key=__API_KEY__" \
    --data-urlencode "url=https://amazon.fr" \
    --data-urlencode "cache=true" \
    --data-urlencode "cache_ttl=500" \
    --data-urlencode "country=de"
    "https://api.scrapfly.io/scrape?key=&url=https%3A%2F%2Famazon.fr&cache=true&cache_ttl=500&country=de"
    
    "api.scrapfly.io"
    "/scrape"
    
    key        = "" 
    url        = "https://amazon.fr" 
    cache      = "true" 
    cache_ttl  = "500" 
    country    = "de" 
    

    Response example

    ...
    "context": {
        ...
        "cache": {
            "state": "HIT",
            "entry": {
                "user_uuid": "e2f896cc-224c-484f-b633-0f5263623f79",
                "created_at": "2020-07-13 19:10:24 UTC",
                "last_used_at": "2020-07-14 00:08:02 UTC",
                "fingerprint": "315eed1c0703aa6b1761b7753e2ee5c795f97cda",
                "size": 733,
                "env": "LIVE",
                "ttl": 50000,
                "expires_at": "2020-07-14 19:10:24 UTC",
                "response_headers": {
                    "date": "Mon, 13 Jul 2020 19:10:24 GMT",
                    "content-type": "application/json",
                    "content-length": "733",
                    "server": "gunicorn/19.9.0",
                    "access-control-allow-origin": "*",
                    "access-control-allow-credentials": "true",
                    "x-cache": "MISS from springgreen-alastor-atlas-blue",
                    "x-cache-lookup": "MISS from springgreen-alastor-atlas-blue:8888",
                    "connection": "keep-alive"
                },
                "response_status": 200,
                "url": "https://httpbin.dev/anything"
            }
        },
        ...
    },
    ...
    

    Integration

    Pricing

    No additional fee on usage, cache size is not metered - fair use is applied