API Specification

Getting Started

Discover how to use the API, available parameters/features, error handling and other information related API use.

# On steroid's

  • By default, the API responds in JSON. Though a more efficient msgpack format is also available. These can be set using accept: application/json and accept: application/msgpack headers respectively.
  • Gzip compression is available when header content-encoding: gzip is set
  • Text content is returned as utf-8 while binary is encoded in base64
  • By default everything is made to correctly scrape without being blocked . Pre-configured user-agent and any other default headers

# Quality of life

  • Success rate and monitoring automatically tracked in your dashboard. Discover Monitoring
  • Multi project/scraper management out of the box - simply use the API key from the project. Discover Project
  • Replay scrape from dashboard log page.
  • Experiment with our Visual API playground.
  • Status page with notification subscription.
  • Our API responses include useful meta headers:
    • X-Scrapfly-Api-Cost API Cost billed
    • X-Scrapfly-Remaining-Api-Credit Remaining Api Credit, if 0, billed in extra credit
    • X-Scrapfly-Account-Concurrent-Usage You current concurrency usage of your account
    • X-Scrapfly-Account-Remaining-Concurrent-Usage Maximum concurrency allowed by the account
    • X-Scrapfly-Project-Concurrent-Usage Concurrency usage of the project
    • X-Scrapfly-Project-Remaining-Concurrent-Usage If concurrency limit is set on the project otherwise equal to the account concurrency
    Concurrency is defined by your subscription

# Billing

Each API request returns billing information about the used API credits. The X-Scrapfly-Api-Cost header contains the total amount of API credits used for this request. The complete API use breakdown is available in the context.cost field of the JSON response.

Note that binary and text responses are billed differently. The result.format field indicates the response type where html, json, xml, txt and similar are billed as TEXT and image, archive, pdf and similar are billed as BINARY.

Scenario API Credits Cost
Datacenter Proxies 1
Datacenter Proxies + Browser 1 + 5 = 6
Residential Proxies 25
Residential Proxies + Browser 25 + 5 = 30
Some specific domain have extra fees, if any fee applied, it's displayed in the log in the cost details tab

Data responses (.json, .csv, .xml, .txt etc.) that exceed 1Mb are considered high bandwidth and bandwidth use after the initial 1Mb is billed as BINARY bandwidth.

Request body (POST, PUT, PATCH) exceeding 100Kb are billed as BINARY bandwidth.

For more on billing, each scrape request features a billing section on the monitoring dashboard with a detailed breakdown of the API credits used.

Downloads are billed in slices of 100kb. The billed size is available in the cost details of the response context.cost field

Network Type API Credits Cost
Datacenter Proxies 3 per 100kb
Residential Proxies 10 per 100kb

Manage Spending (Limit, Budget, Predicatable spend)

We offer a variety of tools to help you manage your spending and stay within your budget. Here are some of the ways you can do that:

  1. Project can be used to define a global limit

    Each scrapfly project can be restricted with a specific credit budget and concurrency limits.

  2. Throttlers can be used to define limits per scraped website and timeframe

    Using Throttler's Spending Limit feature, each scrape target can be restricted with a specific credit budget for a given period. For example, you can set a budget of 10_000 credits per day for website A and 100_000 credits per month for website B.

  3. API calls can be defined with a per call cost budget

    You can use the cost_budget parameter to set a maximum budget for your web scraping requests.

    • It's important to set the correct minimum budget for your target to ensure that you can pass through any blocks and pay for any blocked results
    • Budget only apply for deterministic configuration, cost related to bandwidth usage could not be known before.
    • Regardless of the status code, if the scrape is interrupted because the cost budget has been reached and a scrape attempt has been made, the call is billed based on the scrape attempt settings.

By using these features, you can better manage your spending and ensure that you stay within your budget when using our web scraping API.

Scrape Failed Protection and Fairness Policy

Scrapfly's Scrape Failed Protection and Fairness Policy is in place to ensure that failed scrapes are not billed to our customers. To prevent any abuse of our system, we also have a fairness policy in place.

Under this policy, if more than 30% of the failed traffic with eligible status codes (status codes greater than or equal to 400 and not excluded, see below) are detected within a minimum one-hour period, the fairness policy will be disabled and usage will be billed. Additionally, if an account deliberately scrapes a protected website without success and without using our Anti Scraping Protection (ASP), the account may be suspended at the discretion of our account managers.

The following status codes are eligible for our Scrape Failed Protection and Fairness Policy: status codes greater than or equal to 400 and not excluded (see below).

Excluded status codes include: 400, 401, 404, 405, 406, 407, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 422, 424, 426, 428, and 456.

# Errors

Scrapfly uses conventional HTTP response codes to indicate the success or failure of an API request.

Codes in the 2xx range indicate success.

Codes in the 4xx range indicate an error that failed given the information provided (e.g., a required parameter was omitted, not permitted, max concurrency reached, etc.).

Codes in the 5xx range indicate an error with Scrapfly's servers.


HTTP 422 - Request Failed provide extra headers in order to help as much as possible:

  • X-Scrapfly-Reject-Code: Error Code
  • X-Scrapfly-Reject-Description: URL to the related documentation
  • X-Scrapfly-Reject-Retryable: Indicate if the scrape is retryable
It is important to properly handle HTTP client errors in order to access the error headers and body. These details contain valuable information for troubleshooting, resolving the issue or reaching the support.

HTTP Status Code Summary

200 - OK Everything worked as expected.
400 - Bad Request The request was unacceptable, often due to missing a required parameter or a bad value or a bad format.
401 - Unauthorized No valid API key provided.
402 - Payment Required A payment issue occur and need to be resolved
403 - Forbidden The API key doesn't have permissions to perform the request.
422 - Request Failed The parameters were valid but the request failed.
429 - Too Many Requests All free quota used or max allowed concurrency or domain throttled
500, 502, 503 - Server Errors Something went wrong on Scrapfly's end.
504 - Timeout The scrape have timeout
You can check out the full error list to learn more.

Specification

Discover and learn the full potential of our API to scrape the desired targets.
If you have any questions you can check out the Frequently asked question section and ultimately ask on our chat.

By default, the API has a read timeout of 155 seconds. To avoid read timeout errors, you must configure your HTTP client to set the read timeout to 155 seconds. If you need a different timeout value, please refer to the documentation for information on how to control the timeout.

Test the API straight into your terminal

curl -X GET https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true
curl -X POST https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true -H content-type: text/json --data-raw "{\"test\": \"example\"}"
curl -X PUT https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true -H content-type: text/json --data-raw "{\"test\": \"example\"}"
curl -X PATCH https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true -H content-type: text/json --data-raw "{\"test\": \"example\"}"

Want to try out the API without coding?

Check out our visual API player and test/generate code to use our API.
Checkout API Player

The default response format is JSON, and the scraped content is available in response.result.content. Your scrape configuration is present in response.config, and other activated feature information is available in response.context. To get the HTML page directly, refer to the proxified_response parameter.

HTTP Parameter
Description
Example
url
required
Target url to scrape. Must be url encoded
url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this
key
required
API Key to authenticate the call
key=16eae084cff64841be193a95fc8fa67dso
proxy_pool
popular
public_datacenter_pool
Select the proxy pool. A proxy pool is a network of proxy grouped by quality range and network type. The price vary base on the pool used.
proxy_pool=public_datacenter_pool proxy_pool=public_residential_pool
headers
popular
default: []
Pass custom headers to request. To learn about request customization, you cache checkout the dedicated documentation Must be url encoded
headers[content-type]=application%2Fjson
headers[Cookie]=test%3D1%3Bauth%3D1
debug
default: false
Store the API result and take a screenshot if the rendering js is enabled. A sharable link with the saved response is available. It's usefull to enable when you have an issue to communicate with our support.
debug=true
debug=false
correlation_id
default: null
Help to correlate a group of scrape issued by the same worker or machine. You can use it as filter in our monitoring dashboard
correlation_id=e3ba784cde0d
tags
default: []
Add tags to your calls to easily group or filter them with many values.
tags[]=jewelery
tags[]=price
dns
default: false
Query and retrieve target DNS information
dns=true
dns=false
ssl
default: false
Pull remote ssl certificate and return other tls information. Only available for https://. You do not need to enable it for scraping https:// target - it works by default, it just add more information.
ssl=true
ssl=false
webhook
default: null
Queue you scrape request and received the API response on your webhook endpoint. You can create a webhook endpoint from your dashboard, it takes the name of the webhook. Webhooks are scoped to the given project/env.
webhook=my-webhook-name
Anti Scraping Protection All related parameters require asp enabled
asp
popular
default: false
Anti Scraping Protection - Unblock protected website and bypass protection
asp=true
asp=false
cost_budget
BETA
default: null
ASP dynamically retry and upgrade some parameters (such as proxy_pool, browser) to pass and this change dynamically the cost of the call, to make it more predictable, you can define a budget to respect. Make sure to set the minimum required to pass your target or the call will be rejected without even trying.
cost_budget=25
cost_budget=55
Headless Browser / Javascript Rendering All related parameters require render_js enabled
auto_scroll
default: false
BETA feature
Auto Scroll to the bottom of the page. It allows to trigger javascript rendering based on the user viewport intersection.
  • auto_scroll=true
  • auto_scroll=false
Cache All related parameters require cache enabled
Session All related parameters require session enabled