Handle large payload, large text response greater than 5MB are called "CLOB" (Character Large Object)
and binary are called "BLOB" (Binary Large Object) and can be downloaded separately with streaming support.
Quality of Life
All scrape requests and metadata are automatically tracked on a
Web Dashboard
Billing is reported in every scrape response and the monitoring dashboard
and can be controlled through Scrapfly budget settings. For more see
Web Scraper Billing.
Handle Large Object
Large object CLOB for text and BLOB are offloaded from the API response to prevent any CPU/RAM issue with your JSON/MSGPACK decoder and increase the efficiency of your scrapers.
Instead of the actual content in response.result.content, you get an URL to download the large object. The URL is valid until the log expire.
response.result.format indicate whether it's a large object by checking if it's blob or clob
response.result.content contains the url to download the content. This url need to be authenticated with your API Key (Must be the API key that belong to project/env)
BLOB is not base64 encoded like binary format, you directly retrieve the binary data and the Content-Type header announce the actual type
Errors
Scrapfly uses conventional HTTP response codes to indicate the success or failure of an API request.
Codes in the 2xx
range indicate success.
Codes in the 4xx
range indicate an error that failed given the information provided (e.g., a required parameter was
omitted, not permitted, max concurrency reached, etc.).
Codes in the 5xx
range indicate an error with Scrapfly's servers.
HTTP 422 - Request Failed
provide extra headers in order to help as much as possible:
X-Scrapfly-Reject-Code:
Error Code
X-Scrapfly-Reject-Description:
URL to the related documentation
X-Scrapfly-Reject-Retryable:
Indicate if the scrape is retryable
It is important to properly handle HTTP client errors in order to access the error headers and body.
These details contain valuable information for troubleshooting, resolving the issue or reaching the
support.
HTTP Status Code Summary
200 - OK
Everything worked as expected.
400 - Bad Request
The request was unacceptable, often due to missing a required parameter or a
bad value or a bad format.
401 - Unauthorized
No valid API key provided.
402 - Payment Required
A payment issue occur and need to be resolved
403 - Forbidden
The API key doesn't have permissions to perform the request.
422 - Request Failed
The parameters were valid but the request failed.
429 - Too Many Requests
All free quota used or max allowed concurrency or domain throttled
By default, the API has a read timeout of 155 seconds.
To avoid read timeout errors, you must configure your HTTP client to set the read timeout to 155 seconds.
If you need a different timeout value, please refer to the documentation for information on
how to control the
timeout.
Try out the API directly in your terminal using
curl:
The default response format is JSON, and the scraped content is available in
result.content. Your scrape configuration is present in
config, and other activated feature information is available in
context.
To get the HTML page directly, refer to the
proxified_response
parameter.
Proxy country location - If not set it chooses a random location available.
A reference to a country must be ISO 3166 alpha-2 (2 letters).
The available countries are defined by the proxy pool you use.
Select page language. By default it uses the language of the selected proxy location.
Behind the scenes, it configures the Accept-Language HTTP header.
If the website support the language, the content will be in that lang.
Note: you cannot set headers Accept-Language header manually.
Timeout is expressed in milliseconds. It represents the maximum time allowed for
Scrapfly to perform the scrape. Since timeout is not trivial to
understand see our
extended documentation on timeouts
By default the format is the raw content of the scraped content,
e.g: you scrape a JSON, you will get the JSON content,
you scrape an HTML page you get an HTML page, the content is left untouched.
Clean HTML format is the HTML content cleaned from all the scripts, styles, and other non-essential tags.
Relatives URLs are converted to absolute URLs, iframe are reintegrated in the content (listed API response
response.result.iframe)
JSON format is split into 2 keys: metadata and content,
where metadata contains all the embedded metadata from HTML markup
(microdata, json-ld, microformats, etc).
The content is parsed and cleaned then converted in JSON.
Markdown format supports the following options:
no_links, substitute links by their anchor
no_images, substitute images by their images
only_content, ignore all navigation, footer, headers, menu etc and return the content only
By default the API responds with a JSON response and
the content is located in the result.content key.
With proxified_response enabled the content of the page
is directly returned as body and status code / headers are replaced
with the response values.
If you use a custom format like json or markdown,
the content type will be altered to match the selected format.
When using data extraction features,
the extracted data and the related content-type are located in result.extracted_data.
Store the API result and take a screenshot if the rendering js is enabled.
A sharable link with the saved response is available.
Enable this when communicating with our support.
Pull remote SSL certificate and collect other TLS information.
Only available for https://.
Note that this parameter is not required to scrape https:// targets - it's for collecting SSL data.
Queue you scrape request and redirect API response to a provided webhook endpoint.
You can create a webhook endpoint from your
dashboard,
it takes the name of the webhook. Webhooks are scoped to the given project/env.
ASP dynamically
retry and upgrade some parameters (such as proxy_pool, browser)
to pass and this changes dynamically the cost of the call, to make it more predictable,
you can define a budget to respect. Make sure to set the minimum required to pass your target
or the call will be rejected without even trying.
cost_budget=25
cost_budget=55
Headless Browser / Javascript Rendering
All related parameters require render_js enabled
Stage to wait when rendering the page, You can choose between
complete which is the default, or domcontentloaded
if you want a fast render without waiting the full rendering (faster scrape)
Enable the cache layer. If the cache is MISS the scrape is performed otherwise the cached content is returned.
If the TTL is expired, the cache will be refreshed by scraping again the target.
Session name to be reused - Automatically store and restore cookies, fingerprint and
proxy across many scrapes. Must be alphanumeric, max length is 255 characters.