Getting started

Basics

openapi openapi API Endpoint | Http
            https://api.scrapfly.io
        

API Keys / Project

Your API Keys / Project are available from the top menu bar. You can also find this information in the overview page of your dashboard which display details of your project

Sign in or Sign Up to display information directly in the documentation

Open API

Scrapfly provide Open API specification in order to facilitate integration with standard. Everything is well documented and refer to more deeper doc.

You get benefits from the whole Open API open source eco system such as :

openapi openapi Open API Specification URL | Http
            https://scrapfly.io/docs/openapi
        

You can load Open API spec in any compatible viewer / player.

Postman

Postman is a collaborative platform for API development. Quickly test, play with Scrapfly API

openapi openapi Scrapfly Postman Definition | Http
            https://scrapfly.io/api/postman
        

You can import Scrapfly Collection directly into your postman application.

First API Call

Interactive Example: Retrieve account information from API
openapi openapi Retrieve account information from API Sign in
            curl -X GET https://api.scrapfly.io/scrape?key=__API_KEY__&url=http%3A%2F%2Fhttpbin.org%2Fanything&country=de&proxy_pool=public_pool
        
HTTP Call Pretty Print
https://api.scrapfly.io/scrape?key=&url=http%3A%2F%2Fhttpbin.org%2Fanything&country=de&proxy_pool=public_pool

key
=
url
= http%3A%2F%2Fhttpbin.org%2Fanything
country
= de
proxy_pool
= public_pool

Good to know

The http code attached to the response is reflecting scrapfly service not upstream! The status_code in the response body is reflecting the upstream service.
  • You have to use the appropriate API key to refer to the right project / API key
  • Only successful scrape are billed and counted in quota - API Response contain header X-Scrapfly-Api-Cost indicate you the amount of API call billed
  • Protocol used by the scraper is http2
  • The scrapper retry 5 times on failure, for server error where http code is greater or equal than 500 and network issue. Automatic retries from system are not counted or billed in usage
  • Upstream connection timeout is 5s
  • Max api read timeout is 2,5 minutes (150 seconds) - Anti Scraping Protection (ASP) can take up to 60s to bypass websites - Think to update the read timeout of your HTTP client, unless, you will get a timeout error whereas is still on process
  • The scraper follow a maximum of 25 redirection
  • The HTTP code return by the response reflect the Scrapfly API service, not upstream response. All data about the upstream are located in the response body (Same for headers, cookies etc). It mean if the target website respond with 500 and our system is up and running, we response with 200
  • We strongly recommend to enable "follow redirection" feature on your http client to prevent url redirection issue
  • We send you back the X-Remaning-Scrape header for the quota left. If you are at 0 and your plan allow and you have money provisioned to your account, you are performing extra request
  • If the maximum of concurrent scrape or the throttling limit is reached, we provide you a 429 response a Retry-After header to let you know the optimized time to retry (we respond in seconds. E.g : Retry-After: 30)
  • Link header with rel=log give you the URL to access to the log of your scrape
  • Things that are automatically handled for you :
    • All date returned or send to API are expressed in UTC
    • UTF-8 encode / decode
    • Compression and decompression br, gzip, deflate
    • Binary / Text format (If binary is returned, you will retrieve a base64)
    • Anti Scraping Protection is automatic and will adjust all settings for you (retrieve / activate session, cookies, headers, JS rendering, captcha solving)
    • Accept-Encoding header is automatically added with "text/html" by default
    • Upgrade-Insecure-Requests header is automatically added with 1 by default
    • Cache-Control header is automatically added with no-cache by default
    • User-Agent header is automatically added
    • Content-Encoding of response is automatically managed
    • Cluster of real browsers
    • Proxies
    • ... well, just send the url to scrape, it works!

Integration