API Specification

Getting Started

Discover how to use the API, available parameters/features, error handling and other information related it's usage.

# On steroid's

  • By default the API respond in JSON, msgpack is also available. accept: application/json and accept: application/msgpack are supported
  • Gzip compression is available when header content-encoding: gzip is set
  • Text content is convert to utf-8, binary content is convert to base64
  • By default everything is made to correctly scrape without being blocked . Pre-configured user-agent and any other default headers

# Quality of life

  • Success rate and monitoring automatically tracked in your dashboard. Discover Monitoring
  • Multi project/scraper management out of the box - simply use the API key from the project. Discover Project
  • Replay scrape from log
  • Experiment with our visual API playground
  • Status page with notification subscription
  • Our API response include following useful headers:
    • X-Scrapfly-Api-Cost API Cost billed
    • X-Scrapfly-Remaining-Api-Credit Remaining Api Credit, if 0, billed in extra credit
    • X-Scrapfly-Account-Concurrent-Usage You current concurrency usage of your account
    • X-Scrapfly-Account-Remaining-Concurrent-Usage Maximum concurrency allowed by the account
    • X-Scrapfly-Project-Concurrent-Usage Concurrency usage of the project
    • X-Scrapfly-Project-Remaining-Concurrent-Usage If concurrency limit is set on the project otherwise equal to the account concurrency
    Concurrency is defined by your subscription

# Billing

If you directly want the total of API credit billed, you can check out the header X-Scrapfly-Api-Cost. If you want to get the details, you have the information in our JSON response context.cost where you can find the detail and the total. You can check the format of the response result.format that can be TEXT (html, json, xml, txt, etc) or BINARY (image, archive, pdf etc).

Scenario API Credits Cost
Datacenter Proxies 1
Datacenter Proxies + Browser 1 + 5 = 6
Residential Proxies 25
Residential Proxies + Browser 25 + 5 = 30

Download of data file (.json, .csv, .xml and other kind) exceeding 1Mb, the first mb is included, all exceeding bandwidth above 1Mb is billed following the binary format grid.

Download are billed per slice of 100kb. The billed size is available in the cost details of the response context.cost

Network Type API Credits Cost
Datacenter Proxies 3 per 100kb
Residential Proxies 10 per 100kb

Manage Spending (Limit, Budget)

  • Project: Allow or disallow extra quota, extra usage spending limit and concurrency limit
  • Throttler: Define per target speed limit (request rate, concurrency), API Credit budget for period (hour, day, month)
  • API: Using cost_budget parameter to define the maximum budget the ASP should respect. If a web scrape and the budget interrupt configuration mutation, the web scrape performed is billed regardless of the status code. Make sure to define the correct minimum budget regarding your target, if the budget is too low, you will never be able to pass and pay for blocked result.

Scrape Failed Protection and Fairness Policy

It prevent failed scrape to be billed. In addition to prevent any abuse there is also a fairness policy, however to prevent any abuse if more than 30% of the failed traffic with eligible status code (see above) in minimum of 1-hour period is detected - the fairness policy is disabled and usage billed or if the account deliberately scrape protected website without success and without ASP, your account can be suspended by account manager decision

  • Status code >= 400 and not excluded (see below) are eligible
  • Excluded status code: 400, 401, 404, 405, 406, 407, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 422, 424, 426, 428, 456

# Errors

Scrapfly uses conventional HTTP response codes to indicate the success or failure of an API request.

Codes in the 2xx range indicate success.

Codes in the 4xx range indicate an error that failed given the information provided (e.g., a required parameter was omitted, not permitted, max concurrency reached, etc.).

Codes in the 5xx range indicate an error with Scrapfly's servers.


HTTP 422 - Request Failed provide extra headers in order to help as much as possible:

  • X-Scrapfly-Reject-Code: Error Code
  • X-Scrapfly-Reject-Description: URL to the related documentation
  • X-Scrapfly-Reject-Retryable: Indicate if the scrape is retryable

HTTP Status Code Summary

200 - OK Everything worked as expected.
400 - Bad Request The request was unacceptable, often due to missing a required parameter or a bad value or a bad format.
401 - Unauthorized No valid API key provided.
402 - Payment Required A payment issue occur and need to be resolved
403 - Forbidden The API key doesn't have permissions to perform the request.
422 - Request Failed The parameters were valid but the request failed.
429 - Too Many Requests All free quota used or max allowed concurrency or domain throttled
500, 502, 503 - Server Errors Something went wrong on Scrapfly's end.
504 - Timeout The scrape have timeout
You can check out the full error list to learn more.

Specification

Discover and learn the full potential of our API to scrape the desired targets.
If you have any questions you can check out the Frequently asked question section and ultimately ask on our chat.

The API read timeout is 155s by default. You must configure your http client to set the read timeout to 155s. If you don't want this value and want to avoid Read timeout error, you must check the documentation to control the timeout
curl -X GET https://api.scrapfly.io/scrape?url=http://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true
curl -X POST https://api.scrapfly.io/scrape?url=http://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true
curl -X PUT https://api.scrapfly.io/scrape?url=http://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true
curl -X PATCH https://api.scrapfly.io/scrape?url=http://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true

You will retrieve a JSON response by default, the scraped content is located in response.result.content. You will find your scrape configuration in response.config and various other information in response.context regarding the activated features. If you directly want to retrieve the HTML page instead of JSON response, please check proxified_response parameter.

HTTP Parameter
Description
Example
url
required
Target url to scrape. Must be url encoded
url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this
key
required
API Key to authenticate the call
key=16eae084cff64841be193a95fc8fa67dso
proxy_pool
popular
public_datacenter_pool
Select the proxy pool. A proxy pool is a network of proxy grouped by quality range and network type. The price vary base on the pool used.
proxy_pool=public_datacenter_pool proxy_pool=public_residential_pool
headers
popular
default: []
Pass custom headers to request. To learn about request customization, you cache checkout the dedicated documentation Must be url encoded
headers[content-type]=application%2Fjson
headers[Cookie]=test%3D1%3Bauth%3D1
debug
default: false
Store the API result and take a screenshot if the rendering js is enabled. A sharable link with the saved response is available. It's usefull to enable when you have an issue to communicate with our support.
debug=true
debug=false
correlation_id
default: null
Help to correlate a group of scrape issued by the same worker or machine. You can use it as filter in our monitoring dashboard
correlation_id=e3ba784cde0d
tags
default: []
Add tags to your calls to easily group or filter them with many values.
tags[]=jewelery
tags[]=price
dns
default: false
Query and retrieve target DNS information
dns=true
dns=false
ssl
default: false
Pull remote ssl certificate and return other tls information. Only available for https://. You do not need to enable it for scraping https:// target - it works by default, it just add more information.
ssl=true
ssl=false
webhook
default: null
Queue you scrape request and received the API response on your webhook endpoint. You can create a webhook endpoint from your dashboard, it takes the name of the webhook. Webhooks are scoped to the given project/env.
webhook=my-webhook-name
Anti Scraping Protection All related parameters require asp enabled
asp
popular
default: false
Anti Scraping Protection - Unblock protected website and bypass protection
asp=true
asp=false
cost_budget
BETA
default: null
ASP dynamically retry and upgrade some parameters (such as proxy_pool, browser) to pass and this change dynamically the cost of the call, to make it more predictable, you can define a budget to respect. Make sure to set the minimum required to pass your targt or call will be rejected without even trying. Please read carefully the documentation here.
cost_budget=25
cost_budget=55
Headless Browser / Javascript Rendering All related parameters require render_js enabled
auto_scroll
default: false
BETA feature
Auto Scroll to the bottom of the page. It allows to trigger javascript rendering based on the user viewport intersection.
  • auto_scroll=true
  • auto_scroll=false
Cache All related parameters require cache enabled
Session All related parameters require session enabled