Scrapfly Extraction API BETA

Extraction API allows to extraction of structured data from any text content such as HTML, Text, Markdown using AI, LLM and custom parsing instructions.

Using Extraction API the data parsing possibilities are essentially endless but here are the most common use cases:

  • Use LLM to ask questions about the content.
  • Use LLM to extract structured data like JSON or CSV and apply data formatting or conversions.
  • Use predefined extraction models for automatic extraction of product, review, real-estate listing and article data.
  • Use custom extraction templates to parse data exactly as specified.

If you need to combine Web Scraping and Data Extraction, use our Web Scraping API which directly integrates the extraction API.

Intro Video

On Steroids

Quality of Life

  • Multi project/env support through Project Management
  • Server-side cache for repeated extraction requests through the cache parameter.
  • Status page with a notification subscription.
  • Full API transparency through useful meta headers:
    • X-Scrapfly-Api-Cost API Cost billed
    • X-Scrapfly-Remaining-Api-Credit Remaining Api Credit, if 0, billed in extra credit
    • X-Scrapfly-Account-Concurrent-Usage You current concurrency usage of your account
    • X-Scrapfly-Account-Remaining-Concurrent-Usage Maximum concurrency allowed by the account
    • X-Scrapfly-Project-Concurrent-Usage Concurrency usage of the project
    • X-Scrapfly-Project-Remaining-Concurrent-Usage If the concurrency limit is set on the project otherwise equal to the account concurrency
    Concurrency is based on the subscription tier

Billing

Scrapfly uses a credit system to bill Extraction API requests.

For the Beta phase, all extraction methods have a fixed cost of 5 API Credits. This cost is temporary and will be adjusted in the future when the API concludes the Beta phase.

Extraction API Billing.

Errors

Scrapfly uses conventional HTTP response codes to indicate the success or failure of an API request.

Codes in the 2xx range indicate success.

Codes in the 4xx range indicate an error that failed given the information provided (e.g., a required parameter was omitted, not permitted, max concurrency reached, etc.).

Codes in the 5xx range indicate an error with Scrapfly's servers.


HTTP 422 - Request Failed provide extra headers in order to help as much as possible:

  • X-Scrapfly-Reject-Code: Error Code
  • X-Scrapfly-Reject-Description: URL to the related documentation
  • X-Scrapfly-Reject-Retryable: Indicate if the request is retryable
It is important to properly handle HTTP client errors in order to access the error headers and body. These details contain valuable information for troubleshooting, resolving the issue or reaching the support.

HTTP Status Code Summary

200 - OK Everything worked as expected.
400 - Bad Request The request was unacceptable, often due to missing a required parameter or a bad value or a bad format.
401 - Unauthorized No valid API key provided.
402 - Payment Required A payment issue occur and need to be resolved
403 - Forbidden The API key doesn't have permissions to perform the request.
422 - Request Failed The parameters were valid but the request failed.
429 - Too Many Requests All free quota used or max allowed concurrency or domain throttled
500, 502, 503 - Server Errors Something went wrong on Scrapfly's end.
504 - Timeout The request have timeout
You can check out the full error list to learn more.

Specification

Scrapfly has loads of features and the best way to discover them is through the specification docs below. For this example, you need to have file test.html in your current directory. In our example test.html contains the content of the page: https://web-scraping.dev/product/1

We will use prompt extraction and ask to extract the product price in JSON format

The following data format is supported html, markdown, text, xml ERR::EXTRACTION::CONTENT_TYPE_NOT_SUPPORTED

Automatic Decompression

You can send compressed document, we support gzip, zstd, deflate. When sending a compressed document, you must announce it via Content-Encoding headers. Example: Content-Encoding: gzip

Example of usage with curl
  1. Download a document to parse
  2. Compress your document
    This command will create product.html.gz
  3. Send the compressed document to the API

All related errors are listed below. You can see the full description and an example of error response on the Errors section.

Summary