Data Extraction BETA
Scrapfly provides a powerful data extraction API that allows you to extract structured data from any web page.
Combined with cache feature, we cache the raw data from the website, allowing you to re-extract the data with multiple extraction passes at a much faster speed and lower cost. This applies to the following extraction types:
Learn more about cache feature
If your scrape is not reliable because there is heavy protection in place or is taking too long and make the extraction timeout, you can scrape without extraction and then submit your content to the dedicated Extraction API.
Extraction Template
The extraction template allows you to define custom extraction rules for retrieving data from a webpage in JSON format. By using these templates, you can extract structured data from HTML, JSON, or XML content with precision.
Key Features:
- Customizable Rules: Define your own extraction rules to tailor data extraction according to your needs.
- Versatile Data Sources: Extract data from various content types including HTML, JSON, and XML.
- Structured Output: Retrieve well-structured data in JSON format, making it easier to process and analyze.
You will find the template and extraction rules in the extraction API section
To use the extraction template with the Web Scraping API, pass the following parameter extraction_template=ephemeral:base64(json_template)
Example
Extraction Template
Extracted Data
Scraper API Usage
require "uri"
require "net/http"
url = URI("https://api.scrapfly.io/scrape?tags=player%2Cproject%3Adefault&extraction_template=ephemeral%3AeyJzZWxlY3RvcnMiOlt7Im5hbWUiOiJkZXNjcmlwdGlvbiIsInF1ZXJ5IjoicC5wcm9kdWN0LWRlc2NyaXB0aW9uOjp0ZXh0IiwidHlwZSI6ImNzcyJ9LHsibmFtZSI6InByaWNlX2Jsb2NrIiwibmVzdGVkIjpbeyJleHRyYWN0b3IiOnsibmFtZSI6InByaWNlIn0sImZvcm1hdHRlcnMiOlt7ImFyZ3MiOnsia2V5IjoiY3VycmVuY3kifSwibmFtZSI6InBpY2sifV0sIm5hbWUiOiJwcmljZV9yZWdleCIsIm9wdGlvbnMiOnsiY29udGVudCI6InRleHQiLCJkb3RhbGwiOnRydWUsImlnbm9yZWNhc2UiOnRydWUsIm11bHRpbGluZSI6ZmFsc2V9LCJxdWVyeSI6IihcXCRcXGR7Mn1cXC5cXGR7Mn0pIiwidHlwZSI6InJlZ2V4In1dLCJxdWVyeSI6Ii5wcm9kdWN0LWRhdGEgZGl2LnByaWNlIiwidHlwZSI6ImNzcyJ9LHsibmFtZSI6InByaWNlX2Zyb21faHRtbCIsIm5lc3RlZCI6W3siZm9ybWF0dGVycyI6W3sibmFtZSI6InVwcGVyY2FzZSJ9LHsibmFtZSI6InJlbW92ZV9odG1sIn1dLCJuYW1lIjoicHJpY2VfaHRtbF9yZWdleCIsIm5lc3RlZCI6W3sibXVsdGlwbGUiOnRydWUsIm5hbWUiOiJwcmljZSByZWdleCIsInF1ZXJ5IjoiLisiLCJ0eXBlIjoicmVnZXgifV0sInF1ZXJ5IjoiLisiLCJ0eXBlIjoicmVnZXgifV0sInF1ZXJ5IjoiLnByb2R1Y3QtZGF0YSBkaXYucHJpY2UiLCJ0eXBlIjoiY3NzIn0seyJleHRyYWN0b3IiOnsibmFtZSI6InByaWNlIn0sIm5hbWUiOiJwcmljZSIsInF1ZXJ5Ijoic3Bhbi5wcm9kdWN0LXByaWNlOjp0ZXh0IiwidHlwZSI6ImNzcyJ9LHsiZm9ybWF0dGVycyI6W3sibmFtZSI6ImFic29sdXRlX3VybCJ9LHsibmFtZSI6InVuaXF1ZSJ9XSwibXVsdGlwbGUiOnRydWUsIm5hbWUiOiJwYWdlX2xpbmtzIiwicXVlcnkiOiJcL1wvYVwvQGhyZWYiLCJ0eXBlIjoieHBhdGgifSx7ImZvcm1hdHRlcnMiOlt7Im5hbWUiOiJhYnNvbHV0ZV91cmwifSx7Im5hbWUiOiJ1bmlxdWUifV0sIm11bHRpcGxlIjp0cnVlLCJuYW1lIjoicGFnZV9pbWFnZXMiLCJxdWVyeSI6IlwvXC9pbWdcL0BzcmMiLCJ0eXBlIjoieHBhdGgifSx7Im5hbWUiOiJyZXZpZXdzIiwibmVzdGVkIjpbeyJjYXN0IjoiZmxvYXQiLCJuYW1lIjoicmF0aW5nIiwicXVlcnkiOiJjb3VudChcL1wvc3ZnKSIsInR5cGUiOiJ4cGF0aCJ9LHsiZm9ybWF0dGVycyI6W3siYXJncyI6eyJmb3JtYXQiOiIlZFwvJW1cLyVZIn0sIm5hbWUiOiJkYXRldGltZSJ9XSwibmFtZSI6ImRhdGUiLCJxdWVyeSI6IlwvXC9zcGFuWzFdXC90ZXh0KCkiLCJ0eXBlIjoieHBhdGgifSx7Im5hbWUiOiJ0ZXh0IiwicXVlcnkiOiJcL1wvcFsxXVwvdGV4dCgpIiwidHlwZSI6InhwYXRoIn1dLCJxdWVyeSI6IiNyZXZpZXdzID4gZGl2LnJldmlldyIsInR5cGUiOiJjc3MifV0sInNvdXJjZSI6Imh0bWwifQ&cache=true&asp=true&render_js=true&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Fproduct%2F1")
https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true
request = Net::HTTP::Get.new(url)
response = https.request(request)
puts response.read_body
https://api.scrapfly.io/scrape?tags=player%252Cproject%253Adefault&extraction_template=ephemeral%253AeyJzZWxlY3RvcnMiOlt7Im5hbWUiOiJkZXNjcmlwdGlvbiIsInF1ZXJ5IjoicC5wcm9kdWN0LWRlc2NyaXB0aW9uOjp0ZXh0IiwidHlwZSI6ImNzcyJ9LHsibmFtZSI6InByaWNlX2Jsb2NrIiwibmVzdGVkIjpbeyJleHRyYWN0b3IiOnsibmFtZSI6InByaWNlIn0sImZvcm1hdHRlcnMiOlt7ImFyZ3MiOnsia2V5IjoiY3VycmVuY3kifSwibmFtZSI6InBpY2sifV0sIm5hbWUiOiJwcmljZV9yZWdleCIsIm9wdGlvbnMiOnsiY29udGVudCI6InRleHQiLCJkb3RhbGwiOnRydWUsImlnbm9yZWNhc2UiOnRydWUsIm11bHRpbGluZSI6ZmFsc2V9LCJxdWVyeSI6IihcXCRcXGR7Mn1cXC5cXGR7Mn0pIiwidHlwZSI6InJlZ2V4In1dLCJxdWVyeSI6Ii5wcm9kdWN0LWRhdGEgZGl2LnByaWNlIiwidHlwZSI6ImNzcyJ9LHsibmFtZSI6InByaWNlX2Zyb21faHRtbCIsIm5lc3RlZCI6W3siZm9ybWF0dGVycyI6W3sibmFtZSI6InVwcGVyY2FzZSJ9LHsibmFtZSI6InJlbW92ZV9odG1sIn1dLCJuYW1lIjoicHJpY2VfaHRtbF9yZWdleCIsIm5lc3RlZCI6W3sibXVsdGlwbGUiOnRydWUsIm5hbWUiOiJwcmljZSByZWdleCIsInF1ZXJ5IjoiLisiLCJ0eXBlIjoicmVnZXgifV0sInF1ZXJ5IjoiLisiLCJ0eXBlIjoicmVnZXgifV0sInF1ZXJ5IjoiLnByb2R1Y3QtZGF0YSBkaXYucHJpY2UiLCJ0eXBlIjoiY3NzIn0seyJleHRyYWN0b3IiOnsibmFtZSI6InByaWNlIn0sIm5hbWUiOiJwcmljZSIsInF1ZXJ5Ijoic3Bhbi5wcm9kdWN0LXByaWNlOjp0ZXh0IiwidHlwZSI6ImNzcyJ9LHsiZm9ybWF0dGVycyI6W3sibmFtZSI6ImFic29sdXRlX3VybCJ9LHsibmFtZSI6InVuaXF1ZSJ9XSwibXVsdGlwbGUiOnRydWUsIm5hbWUiOiJwYWdlX2xpbmtzIiwicXVlcnkiOiJcL1wvYVwvQGhyZWYiLCJ0eXBlIjoieHBhdGgifSx7ImZvcm1hdHRlcnMiOlt7Im5hbWUiOiJhYnNvbHV0ZV91cmwifSx7Im5hbWUiOiJ1bmlxdWUifV0sIm11bHRpcGxlIjp0cnVlLCJuYW1lIjoicGFnZV9pbWFnZXMiLCJxdWVyeSI6IlwvXC9pbWdcL0BzcmMiLCJ0eXBlIjoieHBhdGgifSx7Im5hbWUiOiJyZXZpZXdzIiwibmVzdGVkIjpbeyJjYXN0IjoiZmxvYXQiLCJuYW1lIjoicmF0aW5nIiwicXVlcnkiOiJjb3VudChcL1wvc3ZnKSIsInR5cGUiOiJ4cGF0aCJ9LHsiZm9ybWF0dGVycyI6W3siYXJncyI6eyJmb3JtYXQiOiIlZFwvJW1cLyVZIn0sIm5hbWUiOiJkYXRldGltZSJ9XSwibmFtZSI6ImRhdGUiLCJxdWVyeSI6IlwvXC9zcGFuWzFdXC90ZXh0KCkiLCJ0eXBlIjoieHBhdGgifSx7Im5hbWUiOiJ0ZXh0IiwicXVlcnkiOiJcL1wvcFsxXVwvdGV4dCgpIiwidHlwZSI6InhwYXRoIn1dLCJxdWVyeSI6IiNyZXZpZXdzID4gZGl2LnJldmlldyIsInR5cGUiOiJjc3MifV0sInNvdXJjZSI6Imh0bWwifQ&cache=true&asp=true&render_js=true&key=&url=https%253A%252F%252Fweb-scraping.dev%252Fproduct%252F1
LLM Extraction
Large Language Model extraction allows you to extract data from web pages using natural language instructions. You can describe the data you want to extract in plain English, and our models will handle the rest. It have the advantage of being easy and very powerful. You can ask the sentiment of an article or extract a structured data from a JSON schema to fulfill.
Benefits
- Ease of Use: No need to worry about technical details; our models manage everything for you.
- Versatile Content Types: Supports various text content types including
text/plain, html, xml, markdown, application/json, rss
. We plan to support additional content types likeapplication/pdf
in the future.
How It works
-
Provide Natural Language Instructions with API Parameter
and describe the data you want to extract.
extraction_prompt=my prompt
-
Get Structured Data:
Receive well-organized data based on your specifications in
result.extracted_data.data
. You also have the content type availableresult.extracted_data.content_type
- JSON is not re encoded (JSON in JSON) for simplicity, so you can directly access your data.
You will find the template and extraction rules in the Extraction API - LLM Extraction
Example
Prompt: Result: Try yourself:require "uri"
require "net/http"
url = URI("https://api.scrapfly.io/scrape?tags=player%2Cproject%3Adefault&extraction_prompt=In%20this%20document%2C%20what%20is%20the%20general%20sentiment%20about%20the%20product&asp=true&render_js=true&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Freviews")
https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true
request = Net::HTTP::Get.new(url)
response = https.request(request)
puts response.read_body
https://api.scrapfly.io/scrape?tags=player%252Cproject%253Adefault&extraction_prompt=In%2520this%2520document%252C%2520what%2520is%2520the%2520general%2520sentiment%2520about%2520the%2520product&asp=true&render_js=true&key=&url=https%253A%252F%252Fweb-scraping.dev%252Freviews
AI Automatic Extraction
Choose between multiple AI models to extract structured data from web pages. Our AI models are trained to extract data from various content types, including HTML, JSON, XML, and more.
To see all our available models and how to use them, visit the dedicated page: Extraction API - AI Automatic Extraction
Example
require "uri"
require "net/http"
url = URI("https://api.scrapfly.io/scrape?tags=player%2Cproject%3Adefault&extraction_model=product&asp=true&render_js=true&auto_scroll=true&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Fproduct%2F1")
https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true
request = Net::HTTP::Get.new(url)
response = https.request(request)
puts response.read_body
https://api.scrapfly.io/scrape?tags=player%252Cproject%253Adefault&extraction_model=product&asp=true&render_js=true&auto_scroll=true&key=&url=https%253A%252F%252Fweb-scraping.dev%252Fproduct%252F1
Related Errors
All related errors are listed below. You can see full description and example of error response on the Errors section.
- ERR::EXTRACTION::CONFIG_ERROR - Parameters sent to the API are not valid
- ERR::EXTRACTION::CONTENT_TYPE_NOT_SUPPORTED - The content type of the response is not supported for extraction.
- ERR::EXTRACTION::DATA_ERROR - Extracted data is invalid or have an issue
- ERR::EXTRACTION::INVALID_RULE - The extraction rule is invalid
- ERR::EXTRACTION::INVALID_TEMPLATE - The template used for extraction is invalid
- ERR::EXTRACTION::NO_CONTENT - Target response is empty
- ERR::EXTRACTION::OPERATION_TIMEOUT - Extraction Operation Timeout
- ERR::EXTRACTION::OUT_OF_CAPACITY - Not able to extract more data, backend are out of capacity, retry later.
- ERR::EXTRACTION::TEMPLATE_NOT_FOUND - The provided template do not exist
- ERR::EXTRACTION::TIMEOUT - The extraction was tool long (maximum 25s) or do not had enough time to complete
Pricing
During the beta period, the Scrapfly Extraction API is billed as follows:
- Extraction Template: 1 API Credits
- Extraction Prompt: 5 API Credits
- Extraction Model: 5 API Credits
This cost is temporary and will be adjusted in the future when the API will be out of Beta.
Each API request returns billing information about the used API credits.
The X-Scrapfly-Api-Cost
header contains the total
amount of API credits used for this request.