Extraction Rules and Template
The extraction templates allows to define custom extraction rules for parsing data from HTML, XML and JSON documents and generate structured JSON output. In other words, using JSON definition we can tell the API where to find data, how to extract it and format it.
This tool is a great service for developers who are familiar with data parsing tools like CSS selectors, XPath or JMESPath as it provides full control over the extraction but still saves a lot of time by handling the most complex parts of data parsing.
Key Features:
- Customizable Rules: Define your own extraction rules to tailor data extraction according to your needs.
- Versatile Data Sources: Extract data from various content types including HTML, JSON, and XML.
- Structured Output: Retrieve well-structured data in JSON format, making it easier to process and analyze.
Template Specification
Structure of the extraction template
Schema object
-
source
string
The source of the expected data, either 'html' or 'json'.
Default Value: html -
name
string
Name of the template - only used for persistent templates.
Default Value: ephemeral -
version
string
The version of the document - only used for persistent templates.
Default Value: @latest -
match
object
Match the template for the given URL.
-
domain
string
Match against the current domain
-
path
string
Match against the current path
-
domain
string
-
selectors
array
Items object
-
name
string
The name of the data field to extract.
-
type
string
The type of selector to use.
-
options
object
selector options
-
cast
string
Cast the value in the given type.
-
multiple
boolean
When multiple is true, capture all matched content.
-
query
string
The query string for the selector.
-
formatters
array
Items object
-
name
string
The type of formatter to apply.
-
args
object
Optional arguments for the formatter.
-
name
string
-
extractor
object
-
name
string
The extractor to apply - Extractor are executed before formatters.
-
args
object
Optional arguments for the extractor.
-
name
string
-
nested
array
Nested selectors for each item in the list.
Items #/properties/selectors/items (Recursion)
-
name
string
JSON schema Validation
Template Example and Output
This example template will extract various data to illustrate all available features from the playground website
https://web-scraping.dev/product/1
Extraction Template
Extracted Data
Usage
-
Prepare your content
For the examples below we will use HTML data from https://web-scraping.dev/product/1. To follow along save its contents to the current directory under
product.html
-
Create your extraction template
The template consists of 2 primary root keys:
source
which indicates parsed content type (usuallyhtml
) andselectors
which is an array defining all of the extraction instructions. Here's an example:Then we will send this template in
base64
format. To send base64 url-encoded data, you can take a look on our base64 tool -
Call the API
Call the extraction endpoint with
extraction_template
parameter set to the base64 encoded template withephemeral:
prefix:Command Explanation
This command uses
curl
to send a POST request to an API endpoint with specified headers and data.Components
-
curl -X POST
:curl
is a command-line tool for transferring data with URLs.-X POST
specifies the HTTP method to be used, which is POST in this case.
-
-H "content-type: text/html"
:-H
is used to specify an HTTP header for the request."content-type: text/html"
sets the Content-Type header totext/html
, indicating that the data being sent is HTML.
-
URL:
- The URL of the API endpoint being accessed, including query parameters for authentication and specifying the target URL and extraction template.
-
key
: An API key for authentication. -
url
: The URL of the web page to be scraped, URL-encoded. -
extraction_template
: A base64-encoded string representing the extraction template.
-
-d @product.html
:-d
is used to specify the data to be sent in the POST request body.@product.html
indicates that the data should be read from a file namedproduct.html
.
-
-
Retrieve the Results
The API will return the extracted data in JSON format. For example:
Web Scraping API
Consult the full documentation about extraction with the Web Scraping APIWhen using extraction through the Web Scraping API, you will scrape the content, and then extract the data from it:
require "uri"
require "net/http"
url = URI("https://api.scrapfly.io/scrape?tags=player%2Cproject%3Adefault&extraction_template=ephemeral%3AeyJzZWxlY3RvcnMiOlt7Im5hbWUiOiJkZXNjcmlwdGlvbiIsInF1ZXJ5IjoicC5wcm9kdWN0LWRlc2NyaXB0aW9uOjp0ZXh0IiwidHlwZSI6ImNzcyJ9LHsibmFtZSI6InByaWNlX2Jsb2NrIiwibmVzdGVkIjpbeyJleHRyYWN0b3IiOnsibmFtZSI6InByaWNlIn0sImZvcm1hdHRlcnMiOlt7ImFyZ3MiOnsia2V5IjoiY3VycmVuY3kifSwibmFtZSI6InBpY2sifV0sIm5hbWUiOiJwcmljZV9yZWdleCIsIm9wdGlvbnMiOnsiY29udGVudCI6InRleHQiLCJkb3RhbGwiOnRydWUsImlnbm9yZWNhc2UiOnRydWUsIm11bHRpbGluZSI6ZmFsc2V9LCJxdWVyeSI6IihcXCRcXGR7Mn1cXC5cXGR7Mn0pIiwidHlwZSI6InJlZ2V4In1dLCJxdWVyeSI6Ii5wcm9kdWN0LWRhdGEgZGl2LnByaWNlIiwidHlwZSI6ImNzcyJ9LHsibmFtZSI6InByaWNlX2Zyb21faHRtbCIsIm5lc3RlZCI6W3siZm9ybWF0dGVycyI6W3sibmFtZSI6InVwcGVyY2FzZSJ9LHsibmFtZSI6InJlbW92ZV9odG1sIn1dLCJuYW1lIjoicHJpY2VfaHRtbF9yZWdleCIsIm5lc3RlZCI6W3sibXVsdGlwbGUiOnRydWUsIm5hbWUiOiJwcmljZSByZWdleCIsInF1ZXJ5IjoiLisiLCJ0eXBlIjoicmVnZXgifV0sInF1ZXJ5IjoiLisiLCJ0eXBlIjoicmVnZXgifV0sInF1ZXJ5IjoiLnByb2R1Y3QtZGF0YSBkaXYucHJpY2UiLCJ0eXBlIjoiY3NzIn0seyJleHRyYWN0b3IiOnsibmFtZSI6InByaWNlIn0sIm5hbWUiOiJwcmljZSIsInF1ZXJ5Ijoic3Bhbi5wcm9kdWN0LXByaWNlOjp0ZXh0IiwidHlwZSI6ImNzcyJ9LHsiZm9ybWF0dGVycyI6W3sibmFtZSI6ImFic29sdXRlX3VybCJ9LHsibmFtZSI6InVuaXF1ZSJ9XSwibXVsdGlwbGUiOnRydWUsIm5hbWUiOiJwYWdlX2xpbmtzIiwicXVlcnkiOiJcL1wvYVwvQGhyZWYiLCJ0eXBlIjoieHBhdGgifSx7ImZvcm1hdHRlcnMiOlt7Im5hbWUiOiJhYnNvbHV0ZV91cmwifSx7Im5hbWUiOiJ1bmlxdWUifV0sIm11bHRpcGxlIjp0cnVlLCJuYW1lIjoicGFnZV9pbWFnZXMiLCJxdWVyeSI6IlwvXC9pbWdcL0BzcmMiLCJ0eXBlIjoieHBhdGgifSx7Im5hbWUiOiJyZXZpZXdzIiwibmVzdGVkIjpbeyJjYXN0IjoiZmxvYXQiLCJuYW1lIjoicmF0aW5nIiwicXVlcnkiOiJjb3VudChcL1wvc3ZnKSIsInR5cGUiOiJ4cGF0aCJ9LHsiZm9ybWF0dGVycyI6W3siYXJncyI6eyJmb3JtYXQiOiIlZFwvJW1cLyVZIn0sIm5hbWUiOiJkYXRldGltZSJ9XSwibmFtZSI6ImRhdGUiLCJxdWVyeSI6IlwvXC9zcGFuWzFdXC90ZXh0KCkiLCJ0eXBlIjoieHBhdGgifSx7Im5hbWUiOiJ0ZXh0IiwicXVlcnkiOiJcL1wvcFsxXVwvdGV4dCgpIiwidHlwZSI6InhwYXRoIn1dLCJxdWVyeSI6IiNyZXZpZXdzID4gZGl2LnJldmlldyIsInR5cGUiOiJjc3MifV0sInNvdXJjZSI6Imh0bWwifQ&cache=true&asp=true&render_js=true&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Fproduct%2F1")
https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true
request = Net::HTTP::Get.new(url)
response = https.request(request)
puts response.read_body
https://api.scrapfly.io/scrape?tags=player%252Cproject%253Adefault&extraction_template=ephemeral%253AeyJzZWxlY3RvcnMiOlt7Im5hbWUiOiJkZXNjcmlwdGlvbiIsInF1ZXJ5IjoicC5wcm9kdWN0LWRlc2NyaXB0aW9uOjp0ZXh0IiwidHlwZSI6ImNzcyJ9LHsibmFtZSI6InByaWNlX2Jsb2NrIiwibmVzdGVkIjpbeyJleHRyYWN0b3IiOnsibmFtZSI6InByaWNlIn0sImZvcm1hdHRlcnMiOlt7ImFyZ3MiOnsia2V5IjoiY3VycmVuY3kifSwibmFtZSI6InBpY2sifV0sIm5hbWUiOiJwcmljZV9yZWdleCIsIm9wdGlvbnMiOnsiY29udGVudCI6InRleHQiLCJkb3RhbGwiOnRydWUsImlnbm9yZWNhc2UiOnRydWUsIm11bHRpbGluZSI6ZmFsc2V9LCJxdWVyeSI6IihcXCRcXGR7Mn1cXC5cXGR7Mn0pIiwidHlwZSI6InJlZ2V4In1dLCJxdWVyeSI6Ii5wcm9kdWN0LWRhdGEgZGl2LnByaWNlIiwidHlwZSI6ImNzcyJ9LHsibmFtZSI6InByaWNlX2Zyb21faHRtbCIsIm5lc3RlZCI6W3siZm9ybWF0dGVycyI6W3sibmFtZSI6InVwcGVyY2FzZSJ9LHsibmFtZSI6InJlbW92ZV9odG1sIn1dLCJuYW1lIjoicHJpY2VfaHRtbF9yZWdleCIsIm5lc3RlZCI6W3sibXVsdGlwbGUiOnRydWUsIm5hbWUiOiJwcmljZSByZWdleCIsInF1ZXJ5IjoiLisiLCJ0eXBlIjoicmVnZXgifV0sInF1ZXJ5IjoiLisiLCJ0eXBlIjoicmVnZXgifV0sInF1ZXJ5IjoiLnByb2R1Y3QtZGF0YSBkaXYucHJpY2UiLCJ0eXBlIjoiY3NzIn0seyJleHRyYWN0b3IiOnsibmFtZSI6InByaWNlIn0sIm5hbWUiOiJwcmljZSIsInF1ZXJ5Ijoic3Bhbi5wcm9kdWN0LXByaWNlOjp0ZXh0IiwidHlwZSI6ImNzcyJ9LHsiZm9ybWF0dGVycyI6W3sibmFtZSI6ImFic29sdXRlX3VybCJ9LHsibmFtZSI6InVuaXF1ZSJ9XSwibXVsdGlwbGUiOnRydWUsIm5hbWUiOiJwYWdlX2xpbmtzIiwicXVlcnkiOiJcL1wvYVwvQGhyZWYiLCJ0eXBlIjoieHBhdGgifSx7ImZvcm1hdHRlcnMiOlt7Im5hbWUiOiJhYnNvbHV0ZV91cmwifSx7Im5hbWUiOiJ1bmlxdWUifV0sIm11bHRpcGxlIjp0cnVlLCJuYW1lIjoicGFnZV9pbWFnZXMiLCJxdWVyeSI6IlwvXC9pbWdcL0BzcmMiLCJ0eXBlIjoieHBhdGgifSx7Im5hbWUiOiJyZXZpZXdzIiwibmVzdGVkIjpbeyJjYXN0IjoiZmxvYXQiLCJuYW1lIjoicmF0aW5nIiwicXVlcnkiOiJjb3VudChcL1wvc3ZnKSIsInR5cGUiOiJ4cGF0aCJ9LHsiZm9ybWF0dGVycyI6W3siYXJncyI6eyJmb3JtYXQiOiIlZFwvJW1cLyVZIn0sIm5hbWUiOiJkYXRldGltZSJ9XSwibmFtZSI6ImRhdGUiLCJxdWVyeSI6IlwvXC9zcGFuWzFdXC90ZXh0KCkiLCJ0eXBlIjoieHBhdGgifSx7Im5hbWUiOiJ0ZXh0IiwicXVlcnkiOiJcL1wvcFsxXVwvdGV4dCgpIiwidHlwZSI6InhwYXRoIn1dLCJxdWVyeSI6IiNyZXZpZXdzID4gZGl2LnJldmlldyIsInR5cGUiOiJjc3MifV0sInNvdXJjZSI6Imh0bWwifQ&cache=true&asp=true&render_js=true&key=&url=https%253A%252F%252Fweb-scraping.dev%252Fproduct%252F1
To extract directly the data while using the Web Scraping API, you must pass the template likeextraction_template=ephemeral:base64(template)
when you pass the template on the fly, or use a template saved from your dashboardextraction_template=my-template
.
(Saved template are coming soon)
Combined with cache feature, we cache the raw data from the website, allowing you to re-extract the data with multiple extraction passes at a much faster speed and lower cost. This applies to the following extraction types:
Learn more about cache feature
Extraction Rules
Extraction rules will instruct what to retrieve and how. By default it returns the first matched element,
you can set multiple: true
to retrieve all matched elements
CSS Selector
Extracts data using CSS selectors with some extra features:
::attr(attribute_name)
retrieve the attribute values::text
retrieve the text node
{
"selectors": [
{
"name": "description",
"query": "p.product-description::text",
"type": "css"
}
]
}
{
"description": "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy."
}
XPath Selector
Extracts data using XPath expressions.
{
"selectors": [
{
"name": "page_links",
"query": "//a/@href",
"type": "xpath",
"multiple": true // Capture all matched content, by default return the first element
}
]
}
{
"page_links": [
"https://web-scraping.dev/",
"https://web-scraping.dev",
"https://web-scraping.dev/docs",
"https://web-scraping.dev/api/graphql",
"https://web-scraping.dev/products",
"https://web-scraping.dev/reviews",
"https://web-scraping.dev/testimonials",
"https://web-scraping.dev/login",
"https://web-scraping.dev/cart",
...
]
}
JMESPath Selector
Extracts data from JSON using JMESPath expressions.
{
"selectors": [
{
"name": "price",
"query": "items[?name=='price'].value",
"type": "jmespath"
}
]
}
$9.99
Regex Selector
Extracts data using regular expressions.
{
"selectors": [
{
"name": "price",
"query": "(\\$\\d{2}\\.\\d{2})",
"type": "regex"
}
]
}
{
"price regex": "USD"
}
Nested Selectors
You can structure your schema by using nested selector, it also simplify the query. All available selectors can be used
Example:
{
"selectors": [
{
"name": "reviews",
"query": "div.product-reviews",
"type": "css",
"nested": [
{
"cast": "float",
"name": "rating",
"query": "count(\/\/svg)",
"type": "xpath"
},
{
"formatters": [
{
"args": {
"format": "%d\/%m\/%Y"
},
"name": "datetime"
}
],
"name": "date",
"query": "\/\/span[1]\/text()",
"type": "xpath"
},
{
"name": "text",
"query": "\/\/p[1]\/text()",
"type": "xpath"
}
],
"query": "#reviews > div.review",
"type": "css"
}
]
}
{
"price html block": [
{
"product": [
{
"name": "description",
"query": "p.product-description::text",
"type": "css"
},
]
}
]
}
Extractors
Extractors are applied before formatters and are used to extract specific types of data. It helps a lot to convert specific type of data into a structured and normalized format
Price Extractor
Extracts price information.
{
"extractor": {
"name": "price"
},
"name": "price",
"query": ".price::text",
"type": "css"
}
Images Extractor
Extracts image URLs.
{
"extractor": {
"name": "image"
},
"name": "product_images",
"query": "//img/@src",
"type": "xpath"
}
Links Extractor
Extracts hyperlinks.
{
"extractor": {
"name": "links"
},
"name": "page_links",
"query": "//a/@href",
"type": "xpath"
}
Emails Extractor
Extracts email addresses.
{
"extractor": {
"name": "emails"
},
"name": "contact_emails",
"query": "//body",
"type": "xpath"
}
Formatters
Formatters are applied in the order they are specified and are used to transform the extracted data.
Example of formater usage
{
"selectors": [
{
"name": "title",
"type": "css",
"query": "title::text",
"formatters": [{"name": "trim"}]
},
{
"name": "links",
"type": "xpath",
"multiple": true,
"query": "//a/@href",
"formatters": [
{"name": "absolute_url"},
{"name": "unique"}
]
}
]
}
trim
- Description: Trims whitespace from the extracted data.
pick
- Description: Picks a specific key from a dictionary.
- Arguments:
-
key
: The key to pick from the dictionary. Required and case-sensitive
unique
- Description: Ensures the extracted data contains unique values.
unquote
- Description: Decodes a URL-encoded string.
lowercase
- Description: Converts the extracted data to lowercase.
uppercase
- Description: Converts the extracted data to uppercase.
datetime
- Description: Formats date strings.
- Arguments:
-
format
: The format to convert the date string into. Default is%Y-%m-%d
.
titleize
- Description: Converts the extracted data to title case.
capitalize
- Description: Capitalizes the first letter of the extracted data.
slugify
- Description: Converts the extracted data into a URL slug.
- Arguments:
-
separator
: The separator to use for the slug. Default is-
.
replace
- Description: Replaces occurrences of a string with another string.
- Arguments:
-
search
: The string to search for. (Required)replace
: The string to replace with. (Required)
split
- Description: Splits the extracted data by a delimiter.
- Arguments:
-
delimiter
: The delimiter to split the string by. (Required)
join
- Description: Joins an iterable of strings into a single string with a delimiter.
- Arguments:
-
delimiter
: The delimiter to join the strings with. (Required)
json_decode
- Description: Decodes a JSON string into a Python object.
- Arguments:
-
fail_silently
: Whether to fail silently if decoding fails. Default isfalse
.
url_encode
- Description: Encodes a string into URL format.
url_decode
- Description: Decodes a URL-encoded string.
base64_encode
- Description: Encodes a string into Base64 format.
base64_decode
- Description: Decodes a Base64-encoded string.
remove_html
- Description: Removes HTML tags from the extracted data.
absolute_url
- Description: Converts a relative URL to an absolute URL based on the base URL.
Error Handling
All related errors are listed below. You can see full description and example of error response on the Errors section.
- ERR::EXTRACTION::DATA_ERROR - Extracted data is invalid or have an issue
- ERR::EXTRACTION::ERR::EXTRACTION::TIMEOUT - Data extraction timeout
- ERR::EXTRACTION::INVALID_RULE - The extraction rule is invalid
- ERR::EXTRACTION::INVALID_TEMPLATE - The template used for extraction is invalid
- ERR::EXTRACTION::NO_CONTENT - Target response is empty
- ERR::EXTRACTION::OUT_OF_CAPACITY - Not able to extract more data, backend are out of capacity, retry later.
- ERR::EXTRACTION::TEMPLATE_NOT_FOUND - The provided template do not exist
Pricing
Template extraction is billed 1 API Credits.
For more information about the pricing you can learn more on the dedicated section