Getting Started with Scrapfly

Discover how to use Scrapfly API - the basics, available parameters and features, error handling and other information related to the API use.

On Steroids

  • Smart defaults - scrape without being blocked . Scrapfly pre-configures user-agent and other request headers.
  • Anti Scraping Protection feature bypasses all anti-scraping systems.
  • By default, the API responds in JSON. Though, a more efficient msgpack format is also available by setting the accept: application/msgpack header.
  • Text content is returned as utf-8 while binary is encoded in base64, so you can scrape any kind of data (pdf, zip, etc)
  • Gzip compression is available through content-encoding: gzip header.
  • Ability to debug and replay scrape requests from the dashboard log page and API.
  • Handle large payload, large text response greater than 5MB are called "CLOB" (Character Large Object) and binary are called "BLOB" (Binary Large Object) and can be downloaded separately with streaming support.

Quality of Life

  • All scrape requests and metadata are automatically tracked on a Web Dashboard
  • Multi project/scraper support through Project Management
  • Experiment with the Visual API playground
  • Status page with notification subscription.
  • Full API transparency through useful meta headers:
    • X-Scrapfly-Api-Cost API Cost billed
    • X-Scrapfly-Remaining-Api-Credit Remaining Api Credit, if 0, billed in extra credit
    • X-Scrapfly-Account-Concurrent-Usage You current concurrency usage of your account
    • X-Scrapfly-Account-Remaining-Concurrent-Usage Maximum concurrency allowed by the account
    • X-Scrapfly-Project-Concurrent-Usage Concurrency usage of the project
    • X-Scrapfly-Project-Remaining-Concurrent-Usage If the concurrency limit is set on the project otherwise equal to the account concurrency
    Concurrency is defined by your subscription

Billing

Scrapfly uses a credit system to bill scrape API requests where each scrape request has a variable cost based on:

  • Enabled scrape features and options (browser rendering, blocking bypass etc.).
  • Response body type (binary vs text results).
  • ASP feature can override scrape config details to bypass blocking which can alter the overall cost.

For more information see scrape API billing breakdown.

Billing is reported in every scrape response and the monitoring dashboard and can be controlled through Scrapfly budget settings. For more see Web Scraper Billing.

Handle Large Object

Large object CLOB for text and BLOB are offloaded from the API response to prevent any CPU/RAM issue with your JSON/MSGPACK decoder and increase the efficiency of your scrapers.

Instead of the actual content in response.result.content, you get an URL to download the large object. The URL is valid until the log expire.

  • response.result.format indicate whether it's a large object by checking if it's blob or clob
  • response.result.content contains the url to download the content. This url need to be authenticated with your API Key (Must be the API key that belong to project/env)
  • BLOB is not base64 encoded like binary format, you directly retrieve the binary data and the Content-Type header announce the actual type

Errors

Scrapfly uses conventional HTTP response codes to indicate the success or failure of an API request.

Codes in the 2xx range indicate success.

Codes in the 4xx range indicate an error that failed given the information provided (e.g., a required parameter was omitted, not permitted, max concurrency reached, etc.).

Codes in the 5xx range indicate an error with Scrapfly's servers.


HTTP 422 - Request Failed provide extra headers in order to help as much as possible:

  • X-Scrapfly-Reject-Code: Error Code
  • X-Scrapfly-Reject-Description: URL to the related documentation
  • X-Scrapfly-Reject-Retryable: Indicate if the scrape is retryable
It is important to properly handle HTTP client errors in order to access the error headers and body. These details contain valuable information for troubleshooting, resolving the issue or reaching the support.

HTTP Status Code Summary

200 - OK Everything worked as expected.
400 - Bad Request The request was unacceptable, often due to missing a required parameter or a bad value or a bad format.
401 - Unauthorized No valid API key provided.
402 - Payment Required A payment issue occur and need to be resolved
403 - Forbidden The API key doesn't have permissions to perform the request.
422 - Request Failed The parameters were valid but the request failed.
429 - Too Many Requests All free quota used or max allowed concurrency or domain throttled
500, 502, 503 - Server Errors Something went wrong on Scrapfly's end.
504 - Timeout The scrape have timeout
You can check out the full error list to learn more.

Specification

Scrapfly has loads of features and the best way to discover them is through the specification docs below.

If you have any questions you can check out the Frequently asked question section or see the support chat.

By default, the API has a read timeout of 155 seconds. To avoid read timeout errors, you must configure your HTTP client to set the read timeout to 155 seconds. If you need a different timeout value, please refer to the documentation for information on how to control the timeout.

Try out the API directly in your terminal using curl:

Want to try out the API without coding? Check out our visual API player and test/generate code to use our API.

Checkout The Web Player

The default response format is JSON, and the scraped content is available in result.content. Your scrape configuration is present in config, and other activated feature information is available in context. To get the HTML page directly, refer to the proxified_response parameter.

Summary