Cache

overview page of web interface
Cache tab of log inspection

Introduction

Cache feature allow you store the content of a scrape in cache, with different options described below such as TTL, cache clearance. It means you can call multiple the API to scrape a website without hitting the upstream website.

They are many usage, mainly used to reduced latency when your spider is accessing multiple time the same web page. It could be used for live application, cache give you huge performance gain. Also in development process it's useful to fastly iterate over your prototype and avoid abusing rate limit of upstream website.

Cache feature is also a way to manage pressure applied by your spider on upstream website. We recommend to use throttling feature to achieve that, which is designed for.

The cache size is not metered. If abusive usage is spotted an account manager will contact you to understand your usage. It does not mean we will suspend your account or anything, just to be aware of the situation.
Cache feature cannot be used while session are enabled. It will result in API Error which describe the error

Sharing Policy

Cache sharing is following multiple rules in order to be consistent and avoid miss conception.

Cache is not shared across project and environment, they are isolated from each other.

The rendered content with javascript and non interpreted content are not shared. In this way, there is no conflict between content (Scrape which have javascript rendering enabled do have the same content)

We served cached based on fingerprint from your scrape config. Fingerprint is based on the method, url, headers and body (if present)

TTL & Eviction Policy

By default cache TTL is one day, you cache configure it via cache_ttl setting, expressed in second. When you set a long TTL greater than one month, if the cache is was never used after one month, cache is automatically evicted by the system

You force to clear the cache by passing cache_clear=true argument to API. You can also purge cache in your log inspection from web interface.

Specification

Our cache system is not related to HTTP cached or anything else. We adapt all policies with web scraping requirements.

You can cache non safe http method such as POST, PUT, PATCH

When cached is MISSED, compared to a non cached scrape an overhead is observed due to the sync way of cache. It means when you get the result of the scrape, the cache is fully available. That avoid a lot of side effect.

Example

Interactive Example: API Cache Example
openapi openapi API Cache Example Sign in
            curl -X GET https://api.scrapfly.io/scrape?key=__API_KEY__&url=https%3A%2F%2Famazon.fr&country=de&cache=true&cache_ttl=500
        
HTTP Call Pretty Print
https://api.scrapfly.io/scrape?key=&url=https%3A%2F%2Famazon.fr&country=de&cache=true&cache_ttl=500

key
=
url
= https%3A%2F%2Famazon.fr
country
= de
cache
= true
cache_ttl
= 500
openapi openapi Cache Result Example | Json
            ...
"context": {
    ...
    "cache": {
        "state": "HIT",
        "entry": {
            "user_uuid": "e2f896cc-224c-484f-b633-0f5263623f79",
            "created_at": "2020-07-13 19:10:24 UTC",
            "last_used_at": "2020-07-14 00:08:02 UTC",
            "fingerprint": "315eed1c0703aa6b1761b7753e2ee5c795f97cda",
            "size": 733,
            "env": "LIVE",
            "ttl": 50000,
            "expires_at": "2020-07-14 19:10:24 UTC",
            "response_headers": {
                "date": "Mon, 13 Jul 2020 19:10:24 GMT",
                "content-type": "application/json",
                "content-length": "733",
                "server": "gunicorn/19.9.0",
                "access-control-allow-origin": "*",
                "access-control-allow-credentials": "true",
                "x-cache": "MISS from springgreen-alastor-atlas-blue",
                "x-cache-lookup": "MISS from springgreen-alastor-atlas-blue:8888",
                "connection": "keep-alive"
            },
            "response_status": 200,
            "url": "http://httpbin.org/anything"
        }
    },
    ...
},
...

        

Integration