Scrapfly Extraction API

View as markdown Copy for LLM

Extraction API allows to extraction of structured data from any text content such as HTML, Text, Markdown using AI, LLM and custom parsing instructions.

Using Extraction API the data parsing possibilities are essentially endless but here are the most common use cases:

Use LLM to ask questions about the content.
Use LLM to extract structured data like JSON or CSV and apply data formatting or conversions.
Use predefined extraction models for automatic extraction of product, review, real-estate listing and article data.
Use custom extraction templates to parse data exactly as specified.

If you need to combine Web Scraping and Data Extraction, use our Web Scraping API which directly integrates the extraction API.

Minimal API call is a POST request with key parameter and one of extraction options: extraction_template, extraction_prompt or extraction_model.

https://api.scrapfly.io/extraction?key=<API KEY>

See dedicated feature docs for exact extraction type use:

Auto Model LLM Prompt Template Extraction

Intro Video

On Steroids

Three different extraction methods available:
Automatically prepare content for extraction.
Data quality metrics are available (for the predefined AI extraction models).
Automatic decompression

Quality of Life

Multi project/env support through Project Management
Server-side cache for repeated extraction requests through the cache parameter.
Status page with a notification subscription.
Full API transparency through useful meta headers:
- X-Scrapfly-Api-Cost API Cost billed
- X-Scrapfly-Remaining-Api-Credit Remaining Api Credit, if 0, billed in extra credit
- X-Scrapfly-Account-Concurrent-Usage You current concurrency usage of your account
- X-Scrapfly-Account-Remaining-Concurrent-Usage Maximum concurrency allowed by the account
- X-Scrapfly-Project-Concurrent-Usage Concurrency usage of the project
- X-Scrapfly-Project-Remaining-Concurrent-Usage If the concurrency limit is set on the project otherwise equal to the account concurrency
Concurrency is based on the subscription tier

Billing

Scrapfly uses a credit system to bill Extraction API requests.

All extraction methods have a fixed cost of 5 API Credits.

Extraction API Billing.

Errors

Scrapfly uses conventional HTTP response codes to indicate the success or failure of an API request.

Codes in the 2xx range indicate success.

Codes in the 4xx range indicate an error that failed given the information provided (e.g., a required parameter was omitted, not permitted, max concurrency reached, etc.).

Codes in the 5xx range indicate an error with Scrapfly's servers.

HTTP 422 - Request Failed provide extra headers in order to help as much as possible:

X-Scrapfly-Reject-Code: Error Code
X-Scrapfly-Reject-Description: URL to the related documentation
X-Scrapfly-Reject-Retryable: Indicate if the request is retryable

It is important to properly handle HTTP client errors in order to access the error headers and body. These details contain valuable information for troubleshooting, resolving the issue or reaching the support.

HTTP Status Code Summary

200 - OK	Everything worked as expected.
400 - Bad Request	The request was unacceptable, often due to missing a required parameter or a bad value or a bad format.
401 - Unauthorized	No valid API key provided.
402 - Payment Required	A payment issue occur and need to be resolved
403 - Forbidden	The API key doesn't have permissions to perform the request.
422 - Request Failed	The parameters were valid but the request failed.
429 - Too Many Requests	All free quota used or max allowed concurrency or domain throttled
500, 502, 503 - Server Errors	Something went wrong on Scrapfly's end.
504 - Timeout	The request have timeout
You can check out the full error list to learn more.

Specification

Scrapfly has loads of features and the best way to discover them is through the specification docs below. For this example, you need to have file test.html in your current directory. In our example test.html contains the content of the page: https://web-scraping.dev/product/1

We will use prompt extraction and ask to extract the product price in JSON format

curl -X POST \
-H "content-type: text/html" \
"https://api.scrapfly.io/extraction?key=&extraction_prompt=Extract+the+price+and+currency+of+the+product+in+json+format" \
-d @test.html

{"content_type":"application/json","data":{"currency":"$","price":"9.99"}}

The following data format is supported html, markdown, text, xml ERR::EXTRACTION::CONTENT_TYPE_NOT_SUPPORTED

HTTP Parameter

Description

Example

key

required

API Key to authenticate the call. You can find your key on your dashboard.

key=16eae084cff64841be193a95fc8fa67dso

body

required

The request body must contain the content of the page you want to extract data from. The content must be in the format specified by the content-type header or via the content_type HTTP parameter.

content_type

required

Content type of the document pass in the body - You must specify the content type of the document by using this parameter or via the content-type header.

This parameter has priority over the content-type header.

content_type=text/html

url

This URL is used to transform any relative URLs in the document into absolute URLs automatically.

It can be either the base URL or the exact URL of the document. Must be url encoded

url=https://httpbin.dev/anything?q=I%20want%20to%20Scrape%20this

charset

default: auto

Charset of the document pass in the body. If you are not sure, you can use the auto value and we will try to detect it. Bad charset can lead to bad extraction, so it's important to set it correctly. The most common charset is utf-8 for text document and ascii for binary

The symptom of a bad charset is that the text is not correctly displayed (accent, special characters, etc).

charset=utf-8

extraction_template

default: null

Define an extraction template to get structured data.

Use an ephemeral template (declared on the fly on the API call) or a stored template (declared in the dashboard) by using the template name.

extraction_template=ephemeral:base64(json_template)

extraction_prompt

default: null

Instruction to extract data or ask question on the scraped content with an LLM (Large Language Model). Must be url encoded

extraction_prompt=Summarize this document

extraction_model

default: null

AI Extraction to auto parse document to get structured data.

extraction_model=product

webhook_name

popular

default: false

Queue you scrape request and redirect API response to a provided webhook endpoint. You can create a webhook endpoint from your dashboard, it takes the name of the webhook. Webhooks are scoped to the given project/env.

webhook_name=my-webhook-name

Automatic Decompression

You can send compressed document, we support gzip, zstd, deflate. When sending a compressed document, you must announce it via Content-Encoding headers. Example: Content-Encoding: gzip

Example of usage with curl

Download a document to parse

curl https://web-scraping.dev/product/1 > product.html

Compress your document
```
gzip -k product.html
```
This command will create product.html.gz

Send the compressed document to the API

curl -X POST \
-H "content-type: text/html" \
-H "content-encoding: gzip" \
"https://api.scrapfly.io/extraction?key=&url=https%3A%2F%2Fweb-scraping.dev&extraction_prompt=Extract%20the%20product%20specification%20in%20json%20format" \
--data-binary @product.html.gz

Related Errors

All related errors are listed below. You can see the full description and an example of error response on the Errors section.

ERR::EXTRACTION::CONFIG_ERROR - Parameters sent to the API are not valid
ERR::EXTRACTION::CONTENT_TYPE_NOT_SUPPORTED - The content type of the response is not supported for extraction.
ERR::EXTRACTION::DATA_ERROR - Extracted data is invalid or have an issue
ERR::EXTRACTION::INVALID_RULE - The extraction rule is invalid
ERR::EXTRACTION::INVALID_TEMPLATE - The template used for extraction is invalid
ERR::EXTRACTION::NO_CONTENT - Target response is empty
ERR::EXTRACTION::OPERATION_TIMEOUT - Extraction Operation Timeout
ERR::EXTRACTION::OUT_OF_CAPACITY - Not able to extract more data, backend are out of capacity, retry later.
ERR::EXTRACTION::TEMPLATE_NOT_FOUND - The provided template do not exist
ERR::EXTRACTION::TIMEOUT - The extraction was tool long (maximum 25s) or do not had enough time to complete