Scrapfly Extraction API BETA
Extraction API allows to extraction of structured data from any text content such as HTML, Text, Markdown using AI, LLM and custom parsing instructions.
Using Extraction API the data parsing possibilities are essentially endless but here are the most common use cases:
- Use LLM to ask questions about the content.
- Use LLM to extract structured data like JSON or CSV and apply data formatting or conversions.
- Use predefined extraction models for automatic extraction of product, review, real-estate listing and article data.
- Use custom extraction templates to parse data exactly as specified.
If you need to combine Web Scraping and Data Extraction, use our Web Scraping API which directly integrates the extraction API.
Intro Video
On Steroids
- Three different extraction methods available:
- Automatically prepare content for extraction.
- Data quality metrics are available (for the predefined AI extraction models).
- Automatic decompression
Quality of Life
- Multi project/env support through Project Management
-
Server-side cache for repeated extraction requests through the
cache
parameter. - Status page with a notification subscription.
-
Full API transparency through useful meta headers:
- X-Scrapfly-Api-Cost API Cost billed
- X-Scrapfly-Remaining-Api-Credit Remaining Api Credit, if 0, billed in extra credit
- X-Scrapfly-Account-Concurrent-Usage You current concurrency usage of your account
- X-Scrapfly-Account-Remaining-Concurrent-Usage Maximum concurrency allowed by the account
- X-Scrapfly-Project-Concurrent-Usage Concurrency usage of the project
- X-Scrapfly-Project-Remaining-Concurrent-Usage If the concurrency limit is set on the project otherwise equal to the account concurrency
Billing
Scrapfly uses a credit system to bill Extraction API requests.
For the Beta phase, all extraction methods have a fixed cost of 5 API Credits. This cost is temporary and will be adjusted in the future when the API concludes the Beta phase.
Errors
Scrapfly uses conventional HTTP response codes to indicate the success or failure of an API request.
Codes in the 2xx range indicate success.
Codes in the 4xx range indicate an error that failed given the information provided (e.g., a required parameter was omitted, not permitted, max concurrency reached, etc.).
Codes in the 5xx range indicate an error with Scrapfly's servers.
HTTP 422 - Request Failed provide extra headers in order to help as much as possible:
- X-Scrapfly-Reject-Code: Error Code
- X-Scrapfly-Reject-Description: URL to the related documentation
- X-Scrapfly-Reject-Retryable: Indicate if the request is retryable
It is important to properly handle HTTP client errors in order to access the error headers and body. These details contain valuable information for troubleshooting, resolving the issue or reaching the support.
HTTP Status Code Summary
200 - OK | Everything worked as expected. |
---|---|
400 - Bad Request | The request was unacceptable, often due to missing a required parameter or a bad value or a bad format. |
401 - Unauthorized | No valid API key provided. |
402 - Payment Required | A payment issue occur and need to be resolved |
403 - Forbidden | The API key doesn't have permissions to perform the request. |
422 - Request Failed | The parameters were valid but the request failed. |
429 - Too Many Requests | All free quota used or max allowed concurrency or domain throttled |
500, 502, 503 - Server Errors | Something went wrong on Scrapfly's end. |
504 - Timeout | The request have timeout |
You can check out the full error list to learn more. |
Specification
Scrapfly has loads of features and the best way to discover them is through the specification docs below.
For this example, you need to have file
test.html
in your current directory. In our example
test.html
contains the content of the page:
https://web-scraping.dev/product/1
We will use prompt extraction and ask to extract the product price in JSON format
The following data format is supportedhtml
,markdown
,text
,xml
ERR::EXTRACTION::CONTENT_TYPE_NOT_SUPPORTED
Automatic Decompression
You can send compressed document, we support gzip
, zstd
, deflate
.
When sending a compressed document, you must announce it via Content-Encoding
headers. Example:
Content-Encoding: gzip
Example of usage with curl
-
Download a document to parse
-
Compress your document
product.html.gz
-
Send the compressed document to the API
Related Errors
All related errors are listed below. You can see the full description and an example of error response on the Errors section.
- ERR::EXTRACTION::CONFIG_ERROR - Parameters sent to the API are not valid
- ERR::EXTRACTION::CONTENT_TYPE_NOT_SUPPORTED - The content type of the response is not supported for extraction.
- ERR::EXTRACTION::DATA_ERROR - Extracted data is invalid or have an issue
- ERR::EXTRACTION::INVALID_RULE - The extraction rule is invalid
- ERR::EXTRACTION::INVALID_TEMPLATE - The template used for extraction is invalid
- ERR::EXTRACTION::NO_CONTENT - Target response is empty
- ERR::EXTRACTION::OPERATION_TIMEOUT - Extraction Operation Timeout
- ERR::EXTRACTION::OUT_OF_CAPACITY - Not able to extract more data, backend are out of capacity, retry later.
- ERR::EXTRACTION::TEMPLATE_NOT_FOUND - The provided template do not exist
- ERR::EXTRACTION::TIMEOUT - The extraction was tool long (maximum 25s) or do not had enough time to complete