API Specification
Getting Started
Discover how to use the API, available parameters/features, error handling and other information related it's usage.
# On steroid's
- By default the API respond in JSON, msgpack is also available. accept: application/json and accept: application/msgpack are supported
- Gzip compression is available when header content-encoding: gzip is set
- Text content is convert to utf-8, binary content is convert to base64
- By default everything is made to correctly scrape without being blocked . Pre-configured user-agent and any other default headers
# Quality of life
- Success rate and monitoring automatically tracked in your dashboard. Discover Monitoring
- Multi project/scraper management out of the box - simply use the API key from the project. Discover Project
- Replay scrape from log
- Experiment with our visual API playground
- Status page with notification subscription
-
Our API response include following useful headers:
- X-Scrapfly-Api-Cost API Cost billed
- X-Scrapfly-Remaining-Api-Credit Remaining Api Credit, if 0, billed in extra credit
- X-Scrapfly-Account-Concurrent-Usage You current concurrency usage of your account
- X-Scrapfly-Account-Remaining-Concurrent-Usage Maximum concurrency allowed by the account
- X-Scrapfly-Project-Concurrent-Usage Concurrency usage of the project
- X-Scrapfly-Project-Remaining-Concurrent-Usage If concurrency limit is set on the project otherwise equal to the account concurrency
# Billing
If you directly want the total of API credit billed, you can check out the header X-Scrapfly-Api-Cost
. If you want to get the details,
you have the information in our JSON response context.cost
where you can find the detail and the total. You can check the format of the response
result.format
that can be TEXT
(html, json, xml, txt, etc) or BINARY
(image, archive, pdf etc).
Scenario | API Credits Cost |
---|---|
Datacenter Proxies | 1 |
Datacenter Proxies + Browser | 1 + 5 = 6 |
Residential Proxies | 25 |
Residential Proxies + Browser | 25 + 5 = 30 |
Download of data file (.json
, .csv
, .xml
and other kind) exceeding 1Mb
,
the first mb is included, all exceeding bandwidth above 1Mb
is billed following the binary format grid.
Download are billed per slice of 100kb
. The billed size is available in the cost details of the response context.cost
Network Type | API Credits Cost |
---|---|
Datacenter Proxies | 3 per 100kb |
Residential Proxies | 10 per 100kb |
Manage Spending (Limit, Budget)
- Project: Allow or disallow extra quota, extra usage spending limit and concurrency limit
- Throttler: Define per target speed limit (request rate, concurrency), API Credit budget for period (hour, day, month)
-
API:
Using
cost_budget
parameter to define the maximum budget the ASP should respect. If a web scrape and the budget interrupt configuration mutation, the web scrape performed is billed regardless of the status code. Make sure to define the correct minimum budget regarding your target, if the budget is too low, you will never be able to pass and pay for blocked result.
Scrape Failed Protection and Fairness Policy
It prevent failed scrape to be billed. In addition to prevent any abuse there is also a fairness policy, however to prevent any abuse if more than 30% of the failed traffic with eligible status code (see above) in minimum of 1-hour period is detected - the fairness policy is disabled and usage billed or if the account deliberately scrape protected website without success and without ASP, your account can be suspended by account manager decision
- Status code
>= 400
and not excluded (see below) are eligible - Excluded status code:
400, 401, 404, 405, 406, 407, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 422, 424, 426, 428, 456
# Errors
Scrapfly uses conventional HTTP response codes to indicate the success or failure of an API request.
Codes in the 2xx range indicate success.
Codes in the 4xx range indicate an error that failed given the information provided (e.g., a required parameter was omitted, not permitted, max concurrency reached, etc.).
Codes in the 5xx range indicate an error with Scrapfly's servers.
HTTP 422 - Request Failed provide extra headers in order to help as much as possible:
- X-Scrapfly-Reject-Code: Error Code
- X-Scrapfly-Reject-Description: URL to the related documentation
- X-Scrapfly-Reject-Retryable: Indicate if the scrape is retryable
HTTP Status Code Summary
200 - OK | Everything worked as expected. |
---|---|
400 - Bad Request | The request was unacceptable, often due to missing a required parameter or a bad value or a bad format. |
401 - Unauthorized | No valid API key provided. |
402 - Payment Required | A payment issue occur and need to be resolved |
403 - Forbidden | The API key doesn't have permissions to perform the request. |
422 - Request Failed | The parameters were valid but the request failed. |
429 - Too Many Requests | All free quota used or max allowed concurrency or domain throttled |
500, 502, 503 - Server Errors | Something went wrong on Scrapfly's end. |
504 - Timeout | The scrape have timeout |
You can check out the full error list to learn more. |
Specification
Discover and learn the full potential of our API to scrape the desired targets.
If you have any questions you can check out the
Frequently asked question section
and ultimately ask on our chat.
The API read timeout is 155s by default. You must configure your http client to set the read timeout to 155s. If you don't want this value
and want to avoid Read timeout
error, you must check the
documentation to control the timeout
curl -X GET https://api.scrapfly.io/scrape?url=http://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true
curl -X POST https://api.scrapfly.io/scrape?url=http://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true
curl -X PUT https://api.scrapfly.io/scrape?url=http://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true
curl -X PATCH https://api.scrapfly.io/scrape?url=http://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true
curl -I https://api.scrapfly.io/scrape?url=http://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true
You will retrieve a JSON response by default, the scraped content is located in response.result.content
.
You will find your scrape configuration in response.config
and various other information
in response.context
regarding the activated features. If you directly want to retrieve the HTML page instead of JSON response,
please check
proxified_response
parameter.
url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this
proxy_pool=public_datacenter_pool
proxy_pool=public_residential_pool
headers[content-type]=application%2Fjson
headers[Cookie]=test%3D1%3Bauth%3D1
country=us
-
country=us,ca,mx
-
country=us:1,ca:5,mx:3,-gb
-
country=-gb
Accept-Language
HTTP header.
If the website support the language, the content will be in that lang. You can't set lang parameter and Accept-Language
header
lang=en
lang=ch-FR,fr-FR,en
lang=en-IN,en-US
User-Agent
header
os=win11
os=mac
os=linux
os=chromeos
timeout
is
not trivial to understand regarding other settings -
a full documentation is available
timeout=30000
timeout=120000
retry=true
retry=false
result.content
. With proxified_response the content of the page is directly
returned as body and status code / headers are replaced by the target response.
proxified_response=true
proxified_response=false
debug=true
debug=false
correlation_id=e3ba784cde0d
tags[]=jewelery
tags[]=price
https://
.
You do not need to enable it for scraping https://
target - it works by default, it just add more
information.
ssl=true
ssl=false
webhook=my-webhook-name
asp=true
asp=false
cost_budget=25
cost_budget=55
render_js=true
render_js=false
rendering_wait=5000
wait_for_selector=body
wait_for_selector=input[type="submit"]
wait_for_selector=//button[contains(text(),"Go")]
js=Y29uc29sZS5sb2coJ3Rlc3QnKQ
-
screenshots[page]=fullpage
screenshots[price]=#price
-
js_scenario=eydjbGljayc6IHsnc2VsZWN0b3InOiAnI3N1Ym1pdCd9fQ
latitude,longitude
-
geolocation=48.856614,2.3522219
-
geolocation=-74.005941,40.712784
-
auto_scroll=true
-
auto_scroll=false
cache=true
cache=false
cache_ttl=60
cache_ttl=3600
cache_clear=true
cache_clear=false
session=17013313
session_sticky_proxy=true
session_sticky_proxy=false