Cache
The cache feature allows storing of scraped content on Scrapfly's cache servers for up to 1 week. Any subsequent request to the same URL will be served from the cache, which is much faster than scraping the content again.
When cache
feature is enabled cache status and use is also indicated in the
monitoring dashboard log page cache
tab:
Primarily, the cache
feature is used in scraper development and testing but can be used in production as well.
See this illustration on how Scrapfly's cache use is being determined:
Cache HIT is determined based on scrape configuration fingerprint: request method, URL, headers and body (if present).
The cache feature can be configured with cache_ttl (Cache Time To Live) feature to configure the cache expiration time in seconds. The cache can also be cleared using cache_clear parameter explicitly.
Sharing Policy
The cache is isolated by project and environment. In other words, the cache from project A is not available in project B.
TTL & Eviction Policy
By default, cache_ttl is set to one day so by default cache will expire in 24 hours. This can be extended up to 1 week using the cache_ttl
setting in seconds (604800
seconds)
The cache can be force cleared using cache_clear parameter or in each of your log pages from the web interface.
Specification
Scrapfly's cache system is not related to HTTP cache or any existing caching mechanism. All cache policies on Scrapfly are adapted for web scraping first.
This allows unique caching features like the ability to cache non-safe HTTP methods such as POST
, PUT
, PATCH
Note that cache use can slightly reduce web scraping speeds when cache is MISSED
as Scrapfly has to store the result to it's cache servers.
The cache store the original response body, that means when using cache in addition with content transformation features such asformat=markdown
for example, you can replay the cache content against different format likeformat=json
.
Limitation
- Cache feature cannot be used while Session
- Maximum TTL allowed
604800
equivalent to 7 days
When using cache
is used with Javascript Rendering
the HIT results will not include screenshots or custom javascript features and the browser is not invoked or billed for.
Usage
require "uri"
require "net/http"
url = URI("https://api.scrapfly.io/scrape?cache=true&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Fproduct%2F1")
https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true
request = Net::HTTP::Get.new(url)
response = https.request(request)
puts response.read_body
https://api.scrapfly.io/scrape?cache=true&key=&url=https%253A%252F%252Fweb-scraping.dev%252Fproduct%252F1
Response example
...
"context": {
...
"cache": {
"state": "HIT",
"entry": {
"user_uuid": "e2f896cc-224c-484f-b633-0f5263623f79",
"created_at": "2023-07-13 19:10:24 UTC",
"last_used_at": "2023-07-14 00:08:02 UTC",
"fingerprint": "315eed1c0703aa6b1761b7753e2ee5c795f97cda",
"size": 733,
"env": "LIVE",
"ttl": 50000,
"expires_at": "2020-07-14 19:10:24 UTC",
"response_headers": {
"date": "Mon, 13 Jul 2020 19:10:24 GMT",
"content-type": "application/json",
"content-length": "733",
"server": "gunicorn/19.9.0",
"access-control-allow-origin": "*",
"access-control-allow-credentials": "true",
"x-cache": "MISS from springgreen-alastor-atlas-blue",
"x-cache-lookup": "MISS from springgreen-alastor-atlas-blue:8888",
"connection": "keep-alive"
},
"response_status": 200,
"url": "https://web-scraping.dev/product/1"
}
},
...
},
...
Integration
Pricing
No additional fee on usage, cache size is not metered - fair use is applied.