Getting started

Scrapfly expose an HTTP API; you will use this API to scrape the website you want. You simply have to set URL parameters to get Javascript rendering, changing proxy location, etc. The whole complexity is abstracted; just set parameters, and it works. The API is by design used by developer. It's not a "no code" tool.

If you have an account, all examples are interactive and can be load in our "API Playground" which is a visual implementation of our API to test things simply. To try an example on our playground, merely click on the "try" button when logging in. You can register now for free with 1,000 API Calls.

Basics

Scrapfly Web Scraping API is available at the following endpoint:

https://api.scrapfly.io

First API Call

curl -G \
--request "GET" \
--url "https://api.scrapfly.io/scrape" \
--data-urlencode "key=__API_KEY__" \
--data-urlencode "url=http://httpbin.org/anything" \
--data-urlencode "country=de"
"https://api.scrapfly.io/scrape?key=&url=http%253A%252F%252Fhttpbin.org%252Fanything&country=de"

key      = "" 
url      = "http://httpbin.org/anything" 
country  = "de" 
You can use jq json utility to pretty the response.
  • curl ... | jq . print the entire JSON response
  • curl ... | jq .result.config print the config info
  • curl ... | jq .result.result print the result info (content, headers, etc)
  • curl ... | jq .result.context print the scrape context (cache, session, etc ...)
To print the scraped content : curl ... | jq .result.content

Well played, you have done your first scrape! You will see it appear in your monitoring (Ingestion delay is usually ~1min).
To retrieve the log url and see it in your dashboard

  • From your app like: response['result']['log_url']
  • From cli with jq: curl < command > | jq .result.log_url from command line

Quick Usage

curl -G \
--request "GET" \
--url "https://api.scrapfly.io/scrape" \
--data-urlencode "key=__API_KEY__" \
--data-urlencode "url=https://httpbin.org/anything" \
--data-urlencode "country=jp"
Learn more about Geo Targeting
curl -G \
--request "GET" \
--url "https://api.scrapfly.io/scrape" \
--data-urlencode "key=__API_KEY__" \
--data-urlencode "url=https://httpbin.org/anything" \
--data-urlencode "render_js=true"
Learn more about Javascript Rendering
curl -G \
--request "GET" \
--url "https://api.scrapfly.io/scrape" \
--data-urlencode "key=__API_KEY__" \
--data-urlencode "url=https://httpbin.org/anything" \
--data-urlencode "asp=true"
Learn more about our Anti Scraping Protection (ASP)
curl \
--head \
--url "https://api.scrapfly.io/scrape?key=__API_KEY__&url=https%253A%252F%252Fhttpbin.org%252Fanything&headers%5Bx-test%5D=test"
learn more about request customization
curl -G \
--request "GET" \
--url "https://api.scrapfly.io/scrape" \
--data-urlencode "key=__API_KEY__" \
--data-urlencode "url=https://httpbin.org/anything" \
--data-urlencode "proxy_pool=public_residential_pool"
learn more about proxies

Developer experience

Project / Environment

You can split your workload per project. Each project is insulated from each other; all resources like throttlers, spiders, webhooks, logs are scoped per project. Each project has its own LIVE and TEST environment. Each environment has a dedicated API key. They are insulated to make monitoring easier and avoid useless noise from the dev/staging environment. Simply use the provided API key to target desired project / env. project default is created when you create an account.

Learn more about project and how organize you account

API Portability

Our API is OpenAPI 3.0 compliant for better portability and postman for better developer experience

API Integration

Question / Answer

> Can I directly retrieve the upstream response (body, headers) instead of regular Scrapfly JSON Response ?

Yes, you can set proxified_response=true as url parameter. Body and Headers are from upstream.

NOTE: This format only expose body, headers. If you need more information you should not use this format.

> Do I need to manage headers like user agent, proxies?

Nope, by default, all fingerprinting, browser emulation, even with non-javascript, is managed for you. You can still pass the desired header to override those values. Proxies are automatically rotating; you can stick to them by enabling session and our Anti Scraping Protection (ASP) automatically stick if needed to reuse cookies, etc.

> Where is the scraped content and status code?

Everything regarding the result of the scraped website is available through the JSON response and located in the result key.

  • response['response']['content'] Scraped content
  • response['response']['status_code'] Status code of scraped page
  • To see the full result available, you can play with our API player which is an interactive visual playground for our API

> Do I need to manager compression/content encoding stuff?

Nope, whatever compression used gzip, br, deflate and charset we respond with the supported compression provided by your client and in utf-8

> Can I download files?

Yes, you can check the format of content via response['result']['format'], text or binary. If the content is a binary format, the content is base64 encoded.

> Do I need to retry myself against errors?

Our system retry errors (network errors and bad http status from 400 to 500) at least 3 times. If no session is specified, proxies are rotated during retry to ensure the best reliability.

> What is the maximum timeout?

The timeout of our API is 150s. Upstream websites have 30s to respond. Why is your API 150s whereas upstream read timeout is 30s? Sometimes we need to retry or to solve anti bot challenges, which can take time.

> I'm getting ERR::SCRAPE::TOO_MANY_CONCURRENT_REQUEST

Regarding your current plan, you are limited on concurrent calls. You can check in real-time how many concurrent calls are currently processing through your account by checking our Account API through response['usage']['scrape']['concurrent_usage']. As soon as you send a scrape order to our API, your concurrency counter is increment. As soon as you retrieve the response, the counter is decremented.

> How to understand the error returned by the API?

Global API Error

When the error is a "global" API error, you got a 4xx or 5xx response with a simple JSON format where you can get info like response['code'] and response['message'] to understand the situation. Also, all our errors are explained and available here.

Resources API Error ERR::

Resources API error can be recognize by the format, it looks like ERR::{RESOURCE_NAME}::{ERROR_KIND}. All resources errors are explained and available here

As shortcut you can also use this url and replace the placeholder by your error code.
> https://scrapfly.io/docs/scrape-api/error/REPLACE_ME_BY_ERROR_CODE

> How to know how much a call cost?

Scrapfly respond with some headers; you can checkout x-scrapfly-api-cost which contains the count of billed API calls. You are only billed for a successful response and "expected" errors such as 404, 410, 401, 405, 406, 407, 409, 411, 412, 413, 414, 415, 416, 417, 418, 422, 424, 426, 428, 456, all other http error code are not billed.

For more transparency, the cost of each call details are in the log section "cost" available through your dashboard and also available on the monitoring page

If you want to know more about API cost, you can also check :

  • Regular Scrape
  • Javascript Rendering
  • Anti Scraping Protection (ASP)
    • If ASP does not detect protection, No extra Scrape API calls are counted
    • If ASP needs to solve a challenge (Captcha/JavaScript) and generate a session for future usage, it cost 0 additional scrape API calls
    • If ASP session already exists and is still valid: No Extra Scrape API calls are counted
    • If ASP requires to switch on the residential network, Proxy network cost is counted 25 API calls
    • ASP is only billed on success response (404 and 410 are not considered as failed)
    • To know more: Anti Scraping Protection (ASP)
  • Proxy network
    • Datacenter Proxies do not count additional calls
    • Residential Proxies cost 25 additional API calls
    • To know more: Proxy Network

Advanced Features

Error Handling

Error handling, is one of the most important segment of web scraper. To be reliable, you need to retrieve clear and precise error of what happened to create a bullet proof logic around also called "defensive programing". In scraping system, everything will goes bad at some point, so you need to be prepared to handle errors and that's why our error system has been designed in that way, to be reliable on your end.

You can find all related Scrape errors. All related errors are listed below. You can see the full description and example of error response in Errors section. You will find error definition and how to gather information about it.

For easier integration, we provide an exportable definition available here