Getting Started with Scrapfly

Discover how to use Scrapfly API - the basics, available parameters and features, error handling and other information related to the API use.

Minimal API call is a GET, POST, PUT, PATCH or HEAD request with url and key parameters:

https://api.scrapfly.io/scrape?url=<URL>&key=<API KEY>

See Usage

On Steroids

Smart defaults - scrape without being blocked . Scrapfly pre-configures user-agent and other request headers.
Anti Scraping Protection feature bypasses all anti-scraping systems.
By default, the API responds in JSON. Though, a more efficient msgpack format is also available by setting the accept: application/msgpack header.
Text content is returned as utf-8 while binary is encoded in base64, so you can scrape any kind of data (pdf, zip, etc)
Gzip compression is available through content-encoding: gzip header.
Ability to debug and replay scrape requests from the dashboard log page and API.
Handle large payload, large text response greater than 5MB are called "CLOB" (Character Large Object) and binary are called "BLOB" (Binary Large Object) and can be downloaded separately with streaming support.

Quality of Life

All scrape requests and metadata are automatically tracked on a Web Dashboard
Multi project/scraper support through Project Management
Experiment with the Visual API playground
Status page with notification subscription.
Full API transparency through useful meta headers:
- X-Scrapfly-Api-Cost API Cost billed
- X-Scrapfly-Remaining-Api-Credit Remaining Api Credit, if 0, billed in extra credit
- X-Scrapfly-Account-Concurrent-Usage You current concurrency usage of your account
- X-Scrapfly-Account-Remaining-Concurrent-Usage Maximum concurrency allowed by the account
- X-Scrapfly-Project-Concurrent-Usage Concurrency usage of the project
- X-Scrapfly-Project-Remaining-Concurrent-Usage If the concurrency limit is set on the project otherwise equal to the account concurrency
Concurrency is defined by your subscription

Billing

Scrapfly uses a credit system to bill scrape API requests where each scrape request has a variable cost based on:

Enabled scrape features and options (browser rendering, blocking bypass etc.).
Response body type (binary vs text results).
ASP feature can override scrape config details to bypass blocking which can alter the overall cost.

For more information see scrape API billing breakdown.

Billing is reported in every scrape response and the monitoring dashboard and can be controlled through Scrapfly budget settings. For more see Web Scraper Billing.

Handle Large Object

Large object CLOB for text and BLOB are offloaded from the API response to prevent any CPU/RAM issue with your JSON/MSGPACK decoder and increase the efficiency of your scrapers.

Instead of the actual content in response.result.content, you get an URL to download the large object. The URL is valid until the log expire.

response.result.format indicate whether it's a large object by checking if it's blob or clob
response.result.content contains the url to download the content. This url need to be authenticated with your API Key (Must be the API key that belong to project/env)
BLOB is not base64 encoded like binary format, you directly retrieve the binary data and the Content-Type header announce the actual type

Errors

Scrapfly uses conventional HTTP response codes to indicate the success or failure of an API request.

Codes in the 2xx range indicate success.

Codes in the 4xx range indicate an error that failed given the information provided (e.g., a required parameter was omitted, not permitted, max concurrency reached, etc.).

Codes in the 5xx range indicate an error with Scrapfly's servers.

HTTP 422 - Request Failed provide extra headers in order to help as much as possible:

X-Scrapfly-Reject-Code: Error Code
X-Scrapfly-Reject-Description: URL to the related documentation
X-Scrapfly-Reject-Retryable: Indicate if the scrape is retryable

It is important to properly handle HTTP client errors in order to access the error headers and body. These details contain valuable information for troubleshooting, resolving the issue or reaching the support.

HTTP Status Code Summary

200 - OK	Everything worked as expected.
400 - Bad Request	The request was unacceptable, often due to missing a required parameter or a bad value or a bad format.
401 - Unauthorized	No valid API key provided.
402 - Payment Required	A payment issue occur and need to be resolved
403 - Forbidden	The API key doesn't have permissions to perform the request.
422 - Request Failed	The parameters were valid but the request failed.
429 - Too Many Requests	All free quota used or max allowed concurrency or domain throttled
500, 502, 503 - Server Errors	Something went wrong on Scrapfly's end.
504 - Timeout	The scrape have timeout
You can check out the full error list to learn more.

Specification

Scrapfly has loads of features and the best way to discover them is through the specification docs below.

If you have any questions you can check out the Frequently asked question section or see the support chat.

By default, the API has a read timeout of 155 seconds. To avoid read timeout errors, you must configure your HTTP client to set the read timeout to 155 seconds. If you need a different timeout value, please refer to the documentation for information on how to control the timeout.

Try out the API directly in your terminal using curl:

curl -X GET https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true&key=

curl -X POST https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true&key= -H content-type: text/json --data-raw "{\"test\": \"example\"}"

curl -X PUT https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true&key= -H content-type: text/json --data-raw "{\"test\": \"example\"}"

curl -X PATCH https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true&key= -H content-type: text/json --data-raw "{\"test\": \"example\"}"

curl -X OPTIONS https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything&country=us&render_js=true&key= -H content-type: text/json --data-raw "{\"test\": \"example\"}"

curl -I https://api.scrapfly.io/scrape?url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this&country=us&render_js=true&key=

Want to try out the API without coding? Check out our visual API player and test/generate code to use our API.

overview page of web interface for Scrapfly web API player

Checkout The Web Player

The default response format is JSON, and the scraped content is available in result.content. Your scrape configuration is present in config, and other activated feature information is available in context. To get the HTML page directly, refer to the proxified_response parameter.

HTTP Parameter

Description

Example

url

required

Target URL to scrape. Must be URL encoded

url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this

key

required

API Key to authenticate the call. You can find your key on your dashboard.

key=16eae084cff64841be193a95fc8fa67dso

proxy_pool

popular

public_datacenter_pool

Select the proxy pool. A proxy pool is a network of proxy grouped by quality range and network type. The price varies based on the pool used. See proxy dashboard for available pools.

proxy_pool=public_datacenter_pool proxy_pool=public_residential_pool

headers

popular

default: []

Pass custom headers to the request. See more on request customization. Must be url encoded

headers[content-type]=application%2Fjson

headers[Cookie]=test%3D1%3Bauth%3D1

country

popular

default: random

Proxy country locations (default is random location) in ISO 3166 alpha-2 (2 letters) country codes. The available countries are listed on your proxy dashboard. Minus prefix can be used to disable a country. Colon suffix can be used to assign randomization weights.

country=us
country=us,ca,mx
country=us:1,ca:5,mx:3,-gb
country=-gb

lang

popular

default: proxy location

Select page language (default uses the language of the selected proxy location). Behind the scenes, it configures the Accept-Language HTTP header. If the website supports the language, the returned content will be in that lang. Note: Accept-Language header cannot be set manually.

lang=en
lang=ch-FR,fr-FR,en
lang=en-IN,en-US

default: null

Operating System, if not selected it's random. Note: you cannot set os parameter and User-Agent header at the same time.

os=win11
os=mac
os=linux
os=chromeos

timeout

default: 150000

Timeout is expressed in milliseconds. It represents the maximum time allowed for Scrapfly to perform the scrape. Since timeout is not trivial to understand see our extended documentation on timeouts

timeout=30000

timeout=120000

format

default: raw

By default scraped content is returned raw and unmodified by Scrapfly. e.g: scraping JSON returns JSON string, scraping HTML returns HTML string

clean_html returns HTML content cleaned from all the scripts, styles, and other non-essential tags. Relatives URLs are converted to absolute URLs, iframe are reintegrated in the content (but still listed API in response.result.iframe)

json returns results as JSON tree split into 2 keys: metadata and content, where metadata contains all the embedded metadata from HTML markup (microdata, json-ld, microformats, etc). The content is parsed and cleaned then converted in JSON.

markdown returns results as markdown and supports the following options:

no_links, substitute links by link anchors
no_images, substitute images by their images
only_content, ignore all navigation, footer, headers, menu etc and return the main page content only

format=raw

format=textLLM Ready

format=markdownLLM Ready

format=markdown:no_links,no_imagesLLM Ready

format=clean_html

format=json

retry

default: true

Improve reliability with retries on failure (network, upstream http code >= 500, etc.). Note that the retry parameter has an impact on timeout.

retry=true

retry=false

proxified_response

popular

default: false

By default the API responds with a JSON response and the content is located in the result.content key. With proxified_response enabled the content of the page is directly returned as body and status code / headers are replaced with the response values. If you use a custom format like json or markdown, the content type will be altered to match the selected format.

When using data extraction features, the extracted data and the related content-type are located in result.extracted_data.

proxified_response=true
proxified_response=false

debug

default: false

Store the API result and take a screenshot if render_js is enabled. A sharable link with the saved response is available. Enable this when communicating with our support.

debug=true

debug=false

correlation_id

default: null

Helper ID for correlating a group of scrapes issued by the same worker or machine. You can use it as a filter in the monitoring dashboard.

correlation_id=e3ba784cde0d