API Specification

Getting Started

Discover how to use the API, available parameters/features, error handling and other information related it's usage.

# On steroid's

  • By default the API respond in JSON, msgpack is also available. accept: application/json and accept: application/msgpack are supported
  • Gzip compression is available when header content-encoding: gzip is set
  • Text content is convert to utf-8, binary content is convert to base64
  • By default everything is made to correctly scrape without being blocked . Pre-configured user-agent and any other default headers

# Quality of life

  • Success rate and monitoring automatically tracked in your dashboard. Discover Monitoring
  • Multi project/scraper management out of the box - simply use the API key from the project. Discover Project
  • Replay scrape from log
  • Experiment with our Visual API playground
  • Status page with notification subscription
  • Our API response include following useful headers:
    • X-Scrapfly-Api-Cost API Cost billed
    • X-Scrapfly-Remaining-Api-Credit Remaining Api Credit, if 0, billed in extra credit
    • X-Scrapfly-Account-Concurrent-Usage You current concurrency usage of your account
    • X-Scrapfly-Account-Remaining-Concurrent-Usage Maximum concurrency allowed by the account
    • X-Scrapfly-Project-Concurrent-Usage Concurrency usage of the project
    • X-Scrapfly-Project-Remaining-Concurrent-Usage If concurrency limit is set on the project otherwise equal to the account concurrency
    Concurrency is defined by your subscription

# Billing

If you directly want the total of API credit billed, you can check out the header X-Scrapfly-Api-Cost. If you want to get the details, you have the information in our JSON response context.cost where you can find the detail and the total. You can check the format of the response result.format that can be TEXT (html, json, xml, txt, etc) or BINARY (image, archive, pdf etc).

Scenario API Credits Cost
Datacenter Proxies 1
Datacenter Proxies + Browser 1 + 5 = 6
Residential Proxies 25
Residential Proxies + Browser 25 + 5 = 30
Some specific domain have extra fees, if any fee applied, it's displayed in the log in the cost details tab

Download of data file (.json, .csv, .xml, .txt and other kind) exceeding 1Mb, the first mb is included, all exceeding bandwidth above 1Mb is billed following the binary format grid.

Request body (POST, PUT, PATCH) exceeding 100Kb are billed following the binary format grid.

For transparence about API Credit Usage, the cost detail is displayed in each log in the cost tab and also availble through the API Response where each cost is explained line by line.

Download are billed per slice of 100kb. The billed size is available in the cost details of the response context.cost

Network Type API Credits Cost
Datacenter Proxies 3 per 100kb
Residential Proxies 10 per 100kb

Manage Spending (Limit, Budget, Predicatable spend)

We offer a variety of tools to help you manage your spending and stay within your budget. Here are some of the ways you can do that:

  1. Project: useful to globally define limit You can set extra quota limits, extra usage spending limits, and concurrency limits for each of your projects. This allows you to control how much you spend on each project and how many requests you can make at once.
  2. Throttler: useful to define API Credits per target domain with granular time window Our Throttler feature allows you to define speed limits for each target, such as request rate and concurrency. You can also set a budget for your API usage over a specific period of time, such as an hour, day, or month.
  3. API: useful to define API Credits budget per API Call You can use the cost_budget parameter to set a maximum budget for your web scraping requests. If your web scrape exceeds the budget limit you set, it will be billed regardless of the status code. It's important to set the correct minimum budget for your target to ensure that you can pass through any blocks and pay for any blocked results. Budget only apply for deterministic configuration, cost related to bandwidth usage could not be known before.

By using these features, you can better manage your spending and ensure that you stay within your budget when using our web scraping API.

Scrape Failed Protection and Fairness Policy

Scrapfly's Scrape Failed Protection and Fairness Policy is in place to ensure that failed scrapes are not billed to our customers. To prevent any abuse of our system, we also have a fairness policy in place.

Under this policy, if more than 30% of the failed traffic with eligible status codes (status codes greater than or equal to 400 and not excluded, see below) are detected within a minimum one-hour period, the fairness policy will be disabled and usage will be billed. Additionally, if an account deliberately scrapes a protected website without success and without using our Anti Scraping Protection (ASP), the account may be suspended at the discretion of our account managers.

The following status codes are eligible for our Scrape Failed Protection and Fairness Policy: status codes greater than or equal to 400 and not excluded (see below).

Excluded status codes include: 400, 401, 404, 405, 406, 407, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 422, 424, 426, 428, and 456.

# Errors

Scrapfly uses conventional HTTP response codes to indicate the success or failure of an API request.

Codes in the 2xx range indicate success.

Codes in the 4xx range indicate an error that failed given the information provided (e.g., a required parameter was omitted, not permitted, max concurrency reached, etc.).

Codes in the 5xx range indicate an error with Scrapfly's servers.


HTTP 422 - Request Failed provide extra headers in order to help as much as possible:

  • X-Scrapfly-Reject-Code: Error Code
  • X-Scrapfly-Reject-Description: URL to the related documentation
  • X-Scrapfly-Reject-Retryable: Indicate if the scrape is retryable
It is important to properly handle HTTP client errors in order to access the error headers and body. These details contain valuable information for troubleshooting, resolving the issue or reaching the support.

HTTP Status Code Summary

200 - OK Everything worked as expected.
400 - Bad Request The request was unacceptable, often due to missing a required parameter or a bad value or a bad format.
401 - Unauthorized No valid API key provided.
402 - Payment Required A payment issue occur and need to be resolved
403 - Forbidden The API key doesn't have permissions to perform the request.
422 - Request Failed The parameters were valid but the request failed.
429 - Too Many Requests All free quota used or max allowed concurrency or domain throttled
500, 502, 503 - Server Errors Something went wrong on Scrapfly's end.
504 - Timeout The scrape have timeout
You can check out the full error list to learn more.

Specification

Discover and learn the full potential of our API to scrape the desired targets.
If you have any questions you can check out the Frequently asked question section and ultimately ask on our chat.

By default, the API has a read timeout of 155 seconds. To avoid read timeout errors, you must configure your HTTP client to set the read timeout to 155 seconds. If you need a different timeout value, please refer to the documentation for information on how to control the timeout.

Test the API straight into your terminal

curl -X GET https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true
curl -X POST https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true -H content-type: text/json --data-raw "{\"test\": \"example\"}"
curl -X PUT https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true -H content-type: text/json --data-raw "{\"test\": \"example\"}"
curl -X PATCH https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true -H content-type: text/json --data-raw "{\"test\": \"example\"}"

Want to try out the API without coding?

Check out our visual API player and test/generate code to use our API.
Checkout API Player

The default response format is JSON, and the scraped content is available in response.result.content. Your scrape configuration is present in response.config, and other activated feature information is available in response.context. To get the HTML page directly, refer to the proxified_response parameter.

HTTP Parameter
Description
Example
url
required
Target url to scrape. Must be url encoded
url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this
key
required
API Key to authenticate the call
key=16eae084cff64841be193a95fc8fa67dso
proxy_pool
popular
public_datacenter_pool
Select the proxy pool. A proxy pool is a network of proxy grouped by quality range and network type. The price vary base on the pool used.
proxy_pool=public_datacenter_pool proxy_pool=public_residential_pool
headers
popular
default: []
Pass custom headers to request. To learn about request customization, you cache checkout the dedicated documentation Must be url encoded
headers[content-type]=application%2Fjson
headers[Cookie]=test%3D1%3Bauth%3D1
debug
default: false
Store the API result and take a screenshot if the rendering js is enabled. A sharable link with the saved response is available. It's usefull to enable when you have an issue to communicate with our support.
debug=true
debug=false
correlation_id
default: null
Help to correlate a group of scrape issued by the same worker or machine. You can use it as filter in our monitoring dashboard
correlation_id=e3ba784cde0d
tags
default: []
Add tags to your calls to easily group or filter them with many values.
tags[]=jewelery
tags[]=price
dns
default: false
Query and retrieve target DNS information
dns=true
dns=false
ssl
default: false
Pull remote ssl certificate and return other tls information. Only available for https://. You do not need to enable it for scraping https:// target - it works by default, it just add more information.
ssl=true
ssl=false
webhook
default: null
Queue you scrape request and received the API response on your webhook endpoint. You can create a webhook endpoint from your dashboard, it takes the name of the webhook. Webhooks are scoped to the given project/env.
webhook=my-webhook-name
Anti Scraping Protection All related parameters require asp enabled
asp
popular
default: false
Anti Scraping Protection - Unblock protected website and bypass protection
asp=true
asp=false
cost_budget
BETA
default: null
ASP dynamically retry and upgrade some parameters (such as proxy_pool, browser) to pass and this change dynamically the cost of the call, to make it more predictable, you can define a budget to respect. Make sure to set the minimum required to pass your target or the call will be rejected without even trying.
cost_budget=25
cost_budget=55
Headless Browser / Javascript Rendering All related parameters require render_js enabled
auto_scroll
default: false
BETA feature
Auto Scroll to the bottom of the page. It allows to trigger javascript rendering based on the user viewport intersection.
  • auto_scroll=true
  • auto_scroll=false
Cache All related parameters require cache enabled
Session All related parameters require session enabled