Billing is reported in every scrape response and the monitoring dashboard
and can be controlled through Scrapfly budget settings. For more see
Web Scraper Billing.
Errors
Scrapfly uses conventional HTTP response codes to indicate the success or failure of an API request.
Codes in the 2xx
range indicate success.
Codes in the 4xx
range indicate an error that failed given the information provided (e.g., a required parameter was omitted, not permitted, max concurrency reached, etc.).
Codes in the 5xx
range indicate an error with Scrapfly's servers.
HTTP 422 - Request Failed
provide extra headers in order to help as much as possible:
X-Scrapfly-Reject-Code:
Error Code
X-Scrapfly-Reject-Description:
URL to the related documentation
X-Scrapfly-Reject-Retryable:
Indicate if the scrape is retryable
It is important to properly handle HTTP client errors in order to access the error headers and body.
These details contain valuable information for troubleshooting, resolving the issue or reaching the support.
HTTP Status Code Summary
200 - OK
Everything worked as expected.
400 - Bad Request
The request was unacceptable, often due to missing a required parameter or a bad value or a bad format.
401 - Unauthorized
No valid API key provided.
402 - Payment Required
A payment issue occur and need to be resolved
403 - Forbidden
The API key doesn't have permissions to perform the request.
422 - Request Failed
The parameters were valid but the request failed.
429 - Too Many Requests
All free quota used or max allowed concurrency or domain throttled
Scrapfly has loads of features and the best way to discover them is through the specification docs below.
Alternatively, see the OpenAPI specification
if you're familiar with openAPI structures.
By default, the API has a read timeout of 155 seconds.
To avoid read timeout errors, you must configure your HTTP client to set the read timeout to 155 seconds.
If you need a different timeout value, please refer to the documentation for information on
how to control the timeout.
Try out the API directly in your terminal using curl:
The default response format is JSON, and the scraped content is available in
result.content. Your scrape configuration is present in
config, and other activated feature information is available in
context.
To get the HTML page directly, refer to the
proxified_response parameter.
Proxy country location - If not set it chooses a random location available.
A reference to a country must be ISO 3166 alpha-2 (2 letters).
The available countries are defined by the proxy pool you use.
Select desired lang - By default it use the language from the proxy location. Behind the scene it configures the
Accept-Language
HTTP header.
If the website support the language, the content will be in that lang.
You can't set lang parameter and
Accept-Language header
Timeout is expressed in milliseconds. It represents the maximum time allowed to Scrapfly to try to perform the scrape. Since
timeout
is
not trivial to understand regarding other settings -
a full documentation is available
By default API respond with a JSON response and the content is located in
result.content. With
proxified_response
the content of the page is directly returned as body and status code / headers are replaced by the
target response.
Store the API result and take a screenshot if the rendering js is enabled. A sharable link with the saved response is available.
It's usefull to enable when you have an issue to communicate with our support.
Pull remote ssl certificate and return other tls information. Only available for
https://.
You do not need to enable it for scraping
https://
target - it works by default, it just add more
information.
Queue you scrape request and received the API response on your webhook endpoint. You can create a webhook endpoint from your
dashboard, it takes the name of the webhook. Webhooks are scoped to the given project/env.
webhook_name=my-webhook-name
Anti Scraping Protection
All related parameters require
asp
enabled
ASP
dynamically retry and upgrade some parameters (such as proxy_pool, browser) to pass and this change dynamically the cost of the call,
to make it more predictable, you can define a budget to respect. Make sure to set the minimum required to pass your target or the call will be rejected without even trying.
cost_budget=25
cost_budget=55
Headless Browser / Javascript Rendering
All related parameters require
render_js
enabled
Take screenshot - You can take multiple screenshot with diffrent format: fullpage or specific element selected via CSS Selector or XPATH.
The key argument is the name of the screenshot and the value is the selector.
Stage to wait when rendering the page, You can choose between complete which is the default, or domcontentloaded if you want a fast render
without waiting the full rendering (faster scrape)
Enable the cache layer. If the cache is MISS the scrape is performed otherwise the cached content is returned. If the TTL is expired, the
cache will be refreshed by scraping again the target.
Session name to be reused - Automatically store and restore cookies, fingerprint and proxy across many scrapes. Must be alphanumeric, max length 255 chars