API Specification
Getting Started
Discover how to use the API, available parameters/features, error handling and other information related API use.
# On steroid's
- By default, the API responds in JSON. Though a more efficient msgpack format is also available. These can be set using accept: application/json and accept: application/msgpack headers respectively.
- Gzip compression is available when header content-encoding: gzip is set
- Text content is returned as utf-8 while binary is encoded in base64
- By default everything is made to correctly scrape without being blocked . Pre-configured user-agent and any other default headers
# Quality of life
- Success rate and monitoring automatically tracked in your dashboard. Discover Monitoring
- Multi project/scraper management out of the box - simply use the API key from the project. Discover Project
- Replay scrape from dashboard log page.
- Experiment with our Visual API playground.
- Status page with notification subscription.
-
Our API responses include useful meta headers:
- X-Scrapfly-Api-Cost API Cost billed
- X-Scrapfly-Remaining-Api-Credit Remaining Api Credit, if 0, billed in extra credit
- X-Scrapfly-Account-Concurrent-Usage You current concurrency usage of your account
- X-Scrapfly-Account-Remaining-Concurrent-Usage Maximum concurrency allowed by the account
- X-Scrapfly-Project-Concurrent-Usage Concurrency usage of the project
- X-Scrapfly-Project-Remaining-Concurrent-Usage If concurrency limit is set on the project otherwise equal to the account concurrency
# Billing
Each API request returns billing information about the used API credits. The X-Scrapfly-Api-Cost
header contains the total amount of API credits used for this request.
The complete API use breakdown is available in the context.cost
field of the JSON response.
Note that binary and text responses are billed differently. The result.format
field indicates the response type where html, json, xml, txt and similar
are billed as TEXT
and image, archive, pdf and similar are billed as BINARY
.
Scenario | API Credits Cost |
---|---|
Datacenter Proxies | 1 |
Datacenter Proxies + Browser | 1 + 5 = 6 |
Residential Proxies | 25 |
Residential Proxies + Browser | 25 + 5 = 30 |
Some specific domain have extra fees, if any fee applied, it's displayed in the log in the cost details tab |
Data responses (.json
, .csv
, .xml
, .txt
etc.) that exceed 1Mb are considered high bandwidth
and bandwidth use after the initial 1Mb is billed as BINARY
bandwidth.
Request body (POST
, PUT
, PATCH
) exceeding 100Kb
are billed as BINARY
bandwidth.
For more on billing, each scrape request features a billing section on the monitoring dashboard with a detailed breakdown of the API credits used.
Downloads are billed in slices of 100kb
. The billed size is available in the cost details of the response context.cost
field
Network Type | API Credits Cost |
---|---|
Datacenter Proxies | 3 per 100kb |
Residential Proxies | 10 per 100kb |
Manage Spending (Limit, Budget, Predicatable spend)
We offer a variety of tools to help you manage your spending and stay within your budget. Here are some of the ways you can do that:
-
Project
can be used to define a global limit
Each scrapfly project can be restricted with a specific credit budget and concurrency limits.
-
Throttlers
can be used to define limits per scraped website and timeframe
Using Throttler's Spending Limit feature, each scrape target can be restricted with a specific credit budget for a given period. For example, you can set a budget of 10_000 credits per day for website A and 100_000 credits per month for website B.
-
API
calls can be defined with a per call cost budget
You can use the
cost_budget
parameter to set a maximum budget for your web scraping requests.- It's important to set the correct minimum budget for your target to ensure that you can pass through any blocks and pay for any blocked results
- Budget only apply for deterministic configuration, cost related to bandwidth usage could not be known before.
- Regardless of the status code, if the scrape is interrupted because the cost budget has been reached and a scrape attempt has been made, the call is billed based on the scrape attempt settings.
By using these features, you can better manage your spending and ensure that you stay within your budget when using our web scraping API.
Scrape Failed Protection and Fairness Policy
Scrapfly's Scrape Failed Protection and Fairness Policy is in place to ensure that failed scrapes are not billed to our customers. To prevent any abuse of our system, we also have a fairness policy in place.
Under this policy, if more than 30% of the failed traffic with eligible status codes (status codes greater than or equal to 400 and not excluded, see below) are detected within a minimum one-hour period, the fairness policy will be disabled and usage will be billed. Additionally, if an account deliberately scrapes a protected website without success and without using our Anti Scraping Protection (ASP), the account may be suspended at the discretion of our account managers.
The following status codes are eligible for our Scrape Failed Protection and Fairness Policy: status codes greater than or equal to 400 and not excluded (see below).
Excluded status codes include: 400, 401, 404, 405, 406, 407, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 422, 424, 426, 428, and 456.# Errors
Scrapfly uses conventional HTTP response codes to indicate the success or failure of an API request.
Codes in the 2xx range indicate success.
Codes in the 4xx range indicate an error that failed given the information provided (e.g., a required parameter was omitted, not permitted, max concurrency reached, etc.).
Codes in the 5xx range indicate an error with Scrapfly's servers.
HTTP 422 - Request Failed provide extra headers in order to help as much as possible:
- X-Scrapfly-Reject-Code: Error Code
- X-Scrapfly-Reject-Description: URL to the related documentation
- X-Scrapfly-Reject-Retryable: Indicate if the scrape is retryable
It is important to properly handle HTTP client errors in order to access the error headers and body. These details contain valuable information for troubleshooting, resolving the issue or reaching the support.
HTTP Status Code Summary
200 - OK | Everything worked as expected. |
---|---|
400 - Bad Request | The request was unacceptable, often due to missing a required parameter or a bad value or a bad format. |
401 - Unauthorized | No valid API key provided. |
402 - Payment Required | A payment issue occur and need to be resolved |
403 - Forbidden | The API key doesn't have permissions to perform the request. |
422 - Request Failed | The parameters were valid but the request failed. |
429 - Too Many Requests | All free quota used or max allowed concurrency or domain throttled |
500, 502, 503 - Server Errors | Something went wrong on Scrapfly's end. |
504 - Timeout | The scrape have timeout |
You can check out the full error list to learn more. |
Specification
Discover and learn the full potential of our API to scrape the desired targets.
If you have any questions you can check out the
Frequently asked question section
and ultimately ask on our chat.
By default, the API has a read timeout of 155 seconds. To avoid read timeout errors, you must configure your HTTP client to set the read timeout to 155 seconds. If you need a different timeout value, please refer to the documentation for information on how to control the timeout.
Test the API straight into your terminal
curl -X GET https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true
curl -X POST https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true -H content-type: text/json --data-raw "{\"test\": \"example\"}"
curl -X PUT https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true -H content-type: text/json --data-raw "{\"test\": \"example\"}"
curl -X PATCH https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true -H content-type: text/json --data-raw "{\"test\": \"example\"}"
curl -I https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true
Want to try out the API without coding?
Check out our visual API player and test/generate code to use our API.
Checkout API Player
The default response format is JSON, and the scraped content is available in response.result.content
.
Your scrape configuration is present in response.config
, and other activated feature information is available in response.context
.
To get the HTML page directly, refer to the
proxified_response
parameter.
url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this
proxy_pool=public_datacenter_pool
proxy_pool=public_residential_pool
headers[content-type]=application%2Fjson
headers[Cookie]=test%3D1%3Bauth%3D1
country=us
-
country=us,ca,mx
-
country=us:1,ca:5,mx:3,-gb
-
country=-gb
Accept-Language
HTTP header.
If the website support the language, the content will be in that lang. You can't set lang parameter and Accept-Language
header
lang=en
lang=ch-FR,fr-FR,en
lang=en-IN,en-US
User-Agent
header
os=win11
os=mac
os=linux
os=chromeos
timeout
is
not trivial to understand regarding other settings -
a full documentation is available
timeout=30000
timeout=120000
retry=true
retry=false
result.content
. With proxified_response the content of the page is directly
returned as body and status code / headers are replaced by the target response.
proxified_response=true
proxified_response=false
debug=true
debug=false
correlation_id=e3ba784cde0d
tags[]=jewelery
tags[]=price
https://
.
You do not need to enable it for scraping https://
target - it works by default, it just add more
information.
ssl=true
ssl=false
webhook=my-webhook-name
asp=true
asp=false
cost_budget=25
cost_budget=55
render_js=true
render_js=false
rendering_wait=5000
wait_for_selector=body
wait_for_selector=input[type="submit"]
wait_for_selector=//button[contains(text(),"Go")]
js=Y29uc29sZS5sb2coJ3Rlc3QnKQ
-
screenshots[page]=fullpage
screenshots[price]=#price
-
js_scenario=eydjbGljayc6IHsnc2VsZWN0b3InOiAnI3N1Ym1pdCd9fQ
latitude,longitude
-
geolocation=48.856614,2.3522219
-
geolocation=-74.005941,40.712784
-
auto_scroll=true
-
auto_scroll=false
cache=true
cache=false
cache_ttl=60
cache_ttl=3600
cache_clear=true
cache_clear=false
session=17013313
session_sticky_proxy=true
session_sticky_proxy=false