Getting started
Scrapfly expose an HTTP API; you will use this API to scrape the website you want. You simply have to set URL parameters to get Javascript rendering, changing proxy location, etc. The whole complexity is abstracted; just set parameters, and it works. The API is by design used by developer. It's not a "no code" tool.
If you have an account, all examples are interactive and can be load in our "API Playground" which is a visual implementation of our API to test things simply. To try an example on our playground, merely click on the "try" button when logging in. You can register now for free with 1,000 API Calls.
Basics
Scrapfly Web Scraping API is available at the following endpoint:
https://api.scrapfly.io
First API Call
curl -G \
--request "GET" \
--url "https://api.scrapfly.io/scrape" \
--data-urlencode "key=__API_KEY__" \
--data-urlencode "url=http://httpbin.org/anything" \
--data-urlencode "country=de"
"https://api.scrapfly.io/scrape?key=&url=http%253A%252F%252Fhttpbin.org%252Fanything&country=de"
key = ""
url = "http://httpbin.org/anything"
country = "de"
You can usejq
json utility to pretty the response.To print the scraped content :
curl ... | jq .
print the entire JSON responsecurl ... | jq .result.config
print the config infocurl ... | jq .result.result
print the result info (content, headers, etc)curl ... | jq .result.context
print the scrape context (cache, session, etc ...)curl ... | jq .result.content
Well played, you have done your first scrape! You will see it appear in your
monitoring (Ingestion delay is usually ~1min).
To retrieve the log url and see it in your dashboard
- From your app like:
response['result']['log_url']
- From cli with jq:
curl < command > | jq .result.log_url
from command line
Quick Usage
curl -G \
--request "GET" \
--url "https://api.scrapfly.io/scrape" \
--data-urlencode "key=__API_KEY__" \
--data-urlencode "url=https://httpbin.org/anything" \
--data-urlencode "country=jp"
Learn more about Geo Targeting
curl -G \
--request "GET" \
--url "https://api.scrapfly.io/scrape" \
--data-urlencode "key=__API_KEY__" \
--data-urlencode "url=https://httpbin.org/anything" \
--data-urlencode "render_js=true"
Learn more about Javascript Rendering
curl -G \
--request "GET" \
--url "https://api.scrapfly.io/scrape" \
--data-urlencode "key=__API_KEY__" \
--data-urlencode "url=https://httpbin.org/anything" \
--data-urlencode "asp=true"
Learn more about our Anti Scraping Protection (ASP)
curl \
--head \
--url "https://api.scrapfly.io/scrape?key=__API_KEY__&url=https%253A%252F%252Fhttpbin.org%252Fanything&headers%5Bx-test%5D=test"
learn more about request customization
curl -G \
--request "GET" \
--url "https://api.scrapfly.io/scrape" \
--data-urlencode "key=__API_KEY__" \
--data-urlencode "url=https://httpbin.org/anything" \
--data-urlencode "proxy_pool=public_residential_pool"
learn more about proxies
Developer experience
Project / Environment
You can split your workload per project. Each project is insulated from each other; all resources like throttlers, spiders, webhooks, logs are scoped per project. Each project has its own LIVE and TEST environment. Each environment has a dedicated API key. They are insulated to make monitoring easier and avoid useless noise from the dev/staging environment. Simply use the provided API key to target desired project / env. project default is created when you create an account.
Learn more about project and how organize you accountAPI Portability
Our API is OpenAPI 3.0 compliant for better portability and postman for better developer experience
API Integration
Question / Answer
> Can I directly retrieve the upstream response (body, headers) instead of regular Scrapfly JSON Response ?
Yes, you can set proxified_response=true
as url parameter. Body and Headers are from upstream.
NOTE: This format only expose body, headers. If you need more information you should not use this format.
> Do I need to manage headers like user agent, proxies?
Nope, by default, all fingerprinting, browser emulation, even with non-javascript, is managed for you. You can still pass the desired header to override those values. Proxies are automatically rotating; you can stick to them by enabling session and our Anti Scraping Protection (ASP) automatically stick if needed to reuse cookies, etc.
> Where is the scraped content and status code?
Everything regarding the result of the scraped website is available through the JSON response and located in the result
key.
response['response']['content']
Scraped contentresponse['response']['status_code']
Status code of scraped page- To see the full result available, you can play with our API player which is an interactive visual playground for our API
> Do I need to manager compression/content encoding stuff?
Nope, whatever compression used gzip, br, deflate
and charset we respond with the supported compression provided by your client and in utf-8
> Can I download files?
Yes, you can check the format of content via response['result']['format']
, text
or binary
. If the content is a binary
format, the content is base64 encoded.
You must not use browser rendering when you want to download media/image, most of the time, regarding the content-type, the browser will load by generating html document and load it throug media html tag (img, video, audio)
> Do I need to retry myself against errors?
Our system retry errors (network errors and bad http status from 400 to 500) at least 3 times. If no session is specified, proxies are rotated during retry to ensure the best reliability.
> Can I Prevent Extra Usage?
API Response contain this header X-Scrapfly-Remaining-Scrape
which indicates you the amount of API count on your account. If the value is 0
then you
are in extra usage. You can also have account information (quota, concurrency and so on) via our Account API
The python SDK expose a ready to go method.
> What is the maximum timeout?
The timeout of our API is 150s. Upstream websites have 30s to respond. Why is your API 150s whereas upstream read timeout is 30s? Sometimes we need to retry or to solve anti bot challenges, which can take time.
>
I'm getting ERR::SCRAPE::TOO_MANY_CONCURRENT_REQUEST
Regarding your current plan, you are limited on concurrent calls. You can check in real-time how many concurrent calls are currently processing through your account by checking our
Account API through response['usage']['scrape']['concurrent_usage']
. As soon as you send a scrape order to our API, your concurrency counter is increment. As soon as you retrieve the response, the counter is decremented.
> How to understand the error returned by the API?
Global API Error
When the error is a "global" API error, you got a 4xx
or 5xx
response with a simple JSON format where you can get info like
response['code']
and response['message']
to understand the situation. Also, all our errors are explained and
available here.
Resources API Error ERR::
Resources API error can be recognize by the format, it looks like ERR::{RESOURCE_NAME}::{ERROR_KIND}
. All resources errors are explained and
available here
As shortcut you can also use this url and replace the placeholder by your error code.
> https://scrapfly.io/docs/scrape-api/error/REPLACE_ME_BY_ERROR_CODE
> How to know how much a call cost?
Scrapfly respond with some headers; you can checkout x-scrapfly-api-cost
which contains the count of billed API calls. You are only billed for
a successful response and "expected" errors such as 400, 404, 410, 401, 405, 406, 407, 409, 411, 412, 413, 414, 415, 416, 417, 418, 422, 424, 426, 428, 456
,
all other http error code are not billed.
For more transparency, the cost of each call details are in the log section "cost" available through your dashboard and also available on the monitoring page
If you want to know more about API cost, you can also check :
Scenario | API Call Cost |
---|---|
ASP + Residential Proxies | 25 |
ASP + Residential Proxies + Browser | 25 + 5 = 30 |
ASP + Datacenter Proxies | 1 |
ASP + Datacenter Proxies + Browser | 6 |
Advanced Features
- Requests Customization (Cookies, Headers, Method, Geo Targeting)
- Javascript Rendering
- Residential Proxies
- Anti Scraping Protection (ASP)
- Cache system
- Session
- Async Scrape
Error Handling
Error handling, is one of the most important segment of web scraper. To be reliable, you need to retrieve clear and precise error of what happened to create a bullet proof logic around also called "defensive programing". In scraping system, everything will goes bad at some point, so you need to be prepared to handle errors and that's why our error system has been designed in that way, to be reliable on your end.
You can find all related Scrape errors. All related errors are listed below. You can see the full description and example of error response in Errors section. You will find error definition and how to gather information about it.
For easier integration, we provide an exportable definition available here
- 500 - ERR::SCRAPE::SSL_ERROR
- 400 - ERR::SCRAPE::SCENARIO_EXECUTION
- 408 - ERR::SCRAPE::SCENARIO_TIMEOUT
- 403 - ERR::SCRAPE::DOMAIN_NOT_ALLOWED
- 400 - ERR::SCRAPE::SCENARIO_DEADLINE_OVERFLOW
- 500 - ERR::SCRAPE::UPSTREAM_TOO_MANY_REDIRECT
- 400 - ERR::SCRAPE::DOM_SELECTOR_NOT_FOUND
- 400 - ERR::SCRAPE::DOM_SELECTOR_INVISIBLE
- 429 - ERR::SCRAPE::QUOTA_LIMIT_REACHED
- 408 - ERR::SCRAPE::OPERATION_TIMEOUT
- 429 - ERR::SCRAPE::PROJECT_QUOTA_LIMIT_REACHED
- 429 - ERR::SCRAPE::TOO_MANY_CONCURRENT_REQUEST
- 503 - ERR::SCRAPE::NO_BROWSER_AVAILABLE
- 500 - ERR::SCRAPE::UNABLE_TO_TAKE_SCREENSHOT
- 408 - ERR::SCRAPE::UPSTREAM_TIMEOUT
- 500 - ERR::SCRAPE::JAVASCRIPT_EXECUTION
- 200 - ERR::SCRAPE::BAD_UPSTREAM_RESPONSE
- 523 - ERR::SCRAPE::DNS_NAME_NOT_RESOLVED
- 500 - ERR::SCRAPE::BAD_PROTOCOL
- 499 - ERR::SCRAPE::NETWORK_SERVER_DISCONNECTED
- 499 - ERR::SCRAPE::NETWORK_ERROR
- 503 - ERR::SCRAPE::DRIVER_CRASHED
- 408 - ERR::SCRAPE::DRIVER_TIMEOUT