Extraction Rules and Template

The extraction templates allows to define custom extraction rules for parsing data from HTML, XML and JSON documents and generate structured JSON output. In other words, using JSON definition we can tell the API where to find data, how to extract it and format it.

This tool is a great service for developers who are familiar with data parsing tools like CSS selectors, XPath or JMESPath as it provides full control over the extraction but still saves a lot of time by handling the most complex parts of data parsing.

Key Features:

  • Customizable Rules: Define your own extraction rules to tailor data extraction according to your needs.
  • Versatile Data Sources: Extract data from various content types including HTML, JSON, and XML.
  • Structured Output: Retrieve well-structured data in JSON format, making it easier to process and analyze.

Template Specification

Structure of the extraction template
Schema object
  • source string

    The source of the expected data, either 'html' or 'json'.

    Default Value: html
  • name string

    Name of the template - only used for persistent templates.

    Default Value: ephemeral
  • version string

    The version of the document - only used for persistent templates.

    Default Value: @latest
  • match object

    Match the template for the given URL.

    • domain string

      Match against the current domain

    • path string

      Match against the current path

  • selectors array
    Items object
    • name string

      The name of the data field to extract.

    • type string

      The type of selector to use.

    • options object

      selector options

    • cast string

      Cast the value in the given type.

    • multiple boolean

      When multiple is true, capture all matched content.

    • query string

      The query string for the selector.

    • formatters array
      Items object
      • name string

        The type of formatter to apply.

      • args object

        Optional arguments for the formatter.

    • extractor object
      • name string

        The extractor to apply - Extractor are executed before formatters.

      • args object

        Optional arguments for the extractor.

    • nested array

      Nested selectors for each item in the list.

      Items #/properties/selectors/items (Recursion)
JSON schema Validation
To learn about JSON schema

Template Example and Output

This example template will extract various data to illustrate all available features from the playground website https://web-scraping.dev/product/1

Extraction Template
Extracted Data

Usage

  1. Prepare your content

    For the examples below we will use HTML data from https://web-scraping.dev/product/1. To follow along save its contents to the current directory under product.html

  2. Create your extraction template

    The template consists of 2 primary root keys: source which indicates parsed content type (usually html) and selectors which is an array defining all of the extraction instructions. Here's an example:

    Then we will send this template in base64 format. To send base64 url-encoded data, you can take a look on our base64 tool

  3. Call the API

    Call the extraction endpoint with extraction_template parameter set to the base64 encoded template with ephemeral: prefix:

    Command Explanation

    This command uses curl to send a POST request to an API endpoint with specified headers and data.

    Components
    • curl -X POST:
      • curl is a command-line tool for transferring data with URLs.
      • -X POST specifies the HTTP method to be used, which is POST in this case.
    • -H "content-type: text/html":
      • -H is used to specify an HTTP header for the request.
      • "content-type: text/html" sets the Content-Type header to text/html, indicating that the data being sent is HTML.
    • URL:
      • The URL of the API endpoint being accessed, including query parameters for authentication and specifying the target URL and extraction template.
      • key : An API key for authentication.
      • url : The URL of the web page to be scraped, URL-encoded.
      • extraction_template : A base64-encoded string representing the extraction template.
    • -d @product.html:
      • -d is used to specify the data to be sent in the POST request body.
      • @product.html indicates that the data should be read from a file named product.html.
  4. Retrieve the Results

    The API will return the extracted data in JSON format. For example:

Web Scraping API

Consult the full documentation about extraction with the Web Scraping API

When using extraction through the Web Scraping API, you will scrape the content, and then extract the data from it:

require "uri"
require "net/http"

url = URI("https://api.scrapfly.io/scrape?tags=player%2Cproject%3Adefault&extraction_template=ephemeral%3AeyJzZWxlY3RvcnMiOlt7Im5hbWUiOiJkZXNjcmlwdGlvbiIsInF1ZXJ5IjoicC5wcm9kdWN0LWRlc2NyaXB0aW9uOjp0ZXh0IiwidHlwZSI6ImNzcyJ9LHsibmFtZSI6InByaWNlX2Jsb2NrIiwibmVzdGVkIjpbeyJleHRyYWN0b3IiOnsibmFtZSI6InByaWNlIn0sImZvcm1hdHRlcnMiOlt7ImFyZ3MiOnsia2V5IjoiY3VycmVuY3kifSwibmFtZSI6InBpY2sifV0sIm5hbWUiOiJwcmljZV9yZWdleCIsIm9wdGlvbnMiOnsiY29udGVudCI6InRleHQiLCJkb3RhbGwiOnRydWUsImlnbm9yZWNhc2UiOnRydWUsIm11bHRpbGluZSI6ZmFsc2V9LCJxdWVyeSI6IihcXCRcXGR7Mn1cXC5cXGR7Mn0pIiwidHlwZSI6InJlZ2V4In1dLCJxdWVyeSI6Ii5wcm9kdWN0LWRhdGEgZGl2LnByaWNlIiwidHlwZSI6ImNzcyJ9LHsibmFtZSI6InByaWNlX2Zyb21faHRtbCIsIm5lc3RlZCI6W3siZm9ybWF0dGVycyI6W3sibmFtZSI6InVwcGVyY2FzZSJ9LHsibmFtZSI6InJlbW92ZV9odG1sIn1dLCJuYW1lIjoicHJpY2VfaHRtbF9yZWdleCIsIm5lc3RlZCI6W3sibXVsdGlwbGUiOnRydWUsIm5hbWUiOiJwcmljZSByZWdleCIsInF1ZXJ5IjoiLisiLCJ0eXBlIjoicmVnZXgifV0sInF1ZXJ5IjoiLisiLCJ0eXBlIjoicmVnZXgifV0sInF1ZXJ5IjoiLnByb2R1Y3QtZGF0YSBkaXYucHJpY2UiLCJ0eXBlIjoiY3NzIn0seyJleHRyYWN0b3IiOnsibmFtZSI6InByaWNlIn0sIm5hbWUiOiJwcmljZSIsInF1ZXJ5Ijoic3Bhbi5wcm9kdWN0LXByaWNlOjp0ZXh0IiwidHlwZSI6ImNzcyJ9LHsiZm9ybWF0dGVycyI6W3sibmFtZSI6ImFic29sdXRlX3VybCJ9LHsibmFtZSI6InVuaXF1ZSJ9XSwibXVsdGlwbGUiOnRydWUsIm5hbWUiOiJwYWdlX2xpbmtzIiwicXVlcnkiOiJcL1wvYVwvQGhyZWYiLCJ0eXBlIjoieHBhdGgifSx7ImZvcm1hdHRlcnMiOlt7Im5hbWUiOiJhYnNvbHV0ZV91cmwifSx7Im5hbWUiOiJ1bmlxdWUifV0sIm11bHRpcGxlIjp0cnVlLCJuYW1lIjoicGFnZV9pbWFnZXMiLCJxdWVyeSI6IlwvXC9pbWdcL0BzcmMiLCJ0eXBlIjoieHBhdGgifSx7Im5hbWUiOiJyZXZpZXdzIiwibmVzdGVkIjpbeyJjYXN0IjoiZmxvYXQiLCJuYW1lIjoicmF0aW5nIiwicXVlcnkiOiJjb3VudChcL1wvc3ZnKSIsInR5cGUiOiJ4cGF0aCJ9LHsiZm9ybWF0dGVycyI6W3siYXJncyI6eyJmb3JtYXQiOiIlZFwvJW1cLyVZIn0sIm5hbWUiOiJkYXRldGltZSJ9XSwibmFtZSI6ImRhdGUiLCJxdWVyeSI6IlwvXC9zcGFuWzFdXC90ZXh0KCkiLCJ0eXBlIjoieHBhdGgifSx7Im5hbWUiOiJ0ZXh0IiwicXVlcnkiOiJcL1wvcFsxXVwvdGV4dCgpIiwidHlwZSI6InhwYXRoIn1dLCJxdWVyeSI6IiNyZXZpZXdzID4gZGl2LnJldmlldyIsInR5cGUiOiJjc3MifV0sInNvdXJjZSI6Imh0bWwifQ&cache=true&asp=true&render_js=true&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Fproduct%2F1")

https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true

request = Net::HTTP::Get.new(url)

response = https.request(request)
puts response.read_body
https://api.scrapfly.io/scrape?tags=player%252Cproject%253Adefault&extraction_template=ephemeral%253AeyJzZWxlY3RvcnMiOlt7Im5hbWUiOiJkZXNjcmlwdGlvbiIsInF1ZXJ5IjoicC5wcm9kdWN0LWRlc2NyaXB0aW9uOjp0ZXh0IiwidHlwZSI6ImNzcyJ9LHsibmFtZSI6InByaWNlX2Jsb2NrIiwibmVzdGVkIjpbeyJleHRyYWN0b3IiOnsibmFtZSI6InByaWNlIn0sImZvcm1hdHRlcnMiOlt7ImFyZ3MiOnsia2V5IjoiY3VycmVuY3kifSwibmFtZSI6InBpY2sifV0sIm5hbWUiOiJwcmljZV9yZWdleCIsIm9wdGlvbnMiOnsiY29udGVudCI6InRleHQiLCJkb3RhbGwiOnRydWUsImlnbm9yZWNhc2UiOnRydWUsIm11bHRpbGluZSI6ZmFsc2V9LCJxdWVyeSI6IihcXCRcXGR7Mn1cXC5cXGR7Mn0pIiwidHlwZSI6InJlZ2V4In1dLCJxdWVyeSI6Ii5wcm9kdWN0LWRhdGEgZGl2LnByaWNlIiwidHlwZSI6ImNzcyJ9LHsibmFtZSI6InByaWNlX2Zyb21faHRtbCIsIm5lc3RlZCI6W3siZm9ybWF0dGVycyI6W3sibmFtZSI6InVwcGVyY2FzZSJ9LHsibmFtZSI6InJlbW92ZV9odG1sIn1dLCJuYW1lIjoicHJpY2VfaHRtbF9yZWdleCIsIm5lc3RlZCI6W3sibXVsdGlwbGUiOnRydWUsIm5hbWUiOiJwcmljZSByZWdleCIsInF1ZXJ5IjoiLisiLCJ0eXBlIjoicmVnZXgifV0sInF1ZXJ5IjoiLisiLCJ0eXBlIjoicmVnZXgifV0sInF1ZXJ5IjoiLnByb2R1Y3QtZGF0YSBkaXYucHJpY2UiLCJ0eXBlIjoiY3NzIn0seyJleHRyYWN0b3IiOnsibmFtZSI6InByaWNlIn0sIm5hbWUiOiJwcmljZSIsInF1ZXJ5Ijoic3Bhbi5wcm9kdWN0LXByaWNlOjp0ZXh0IiwidHlwZSI6ImNzcyJ9LHsiZm9ybWF0dGVycyI6W3sibmFtZSI6ImFic29sdXRlX3VybCJ9LHsibmFtZSI6InVuaXF1ZSJ9XSwibXVsdGlwbGUiOnRydWUsIm5hbWUiOiJwYWdlX2xpbmtzIiwicXVlcnkiOiJcL1wvYVwvQGhyZWYiLCJ0eXBlIjoieHBhdGgifSx7ImZvcm1hdHRlcnMiOlt7Im5hbWUiOiJhYnNvbHV0ZV91cmwifSx7Im5hbWUiOiJ1bmlxdWUifV0sIm11bHRpcGxlIjp0cnVlLCJuYW1lIjoicGFnZV9pbWFnZXMiLCJxdWVyeSI6IlwvXC9pbWdcL0BzcmMiLCJ0eXBlIjoieHBhdGgifSx7Im5hbWUiOiJyZXZpZXdzIiwibmVzdGVkIjpbeyJjYXN0IjoiZmxvYXQiLCJuYW1lIjoicmF0aW5nIiwicXVlcnkiOiJjb3VudChcL1wvc3ZnKSIsInR5cGUiOiJ4cGF0aCJ9LHsiZm9ybWF0dGVycyI6W3siYXJncyI6eyJmb3JtYXQiOiIlZFwvJW1cLyVZIn0sIm5hbWUiOiJkYXRldGltZSJ9XSwibmFtZSI6ImRhdGUiLCJxdWVyeSI6IlwvXC9zcGFuWzFdXC90ZXh0KCkiLCJ0eXBlIjoieHBhdGgifSx7Im5hbWUiOiJ0ZXh0IiwicXVlcnkiOiJcL1wvcFsxXVwvdGV4dCgpIiwidHlwZSI6InhwYXRoIn1dLCJxdWVyeSI6IiNyZXZpZXdzID4gZGl2LnJldmlldyIsInR5cGUiOiJjc3MifV0sInNvdXJjZSI6Imh0bWwifQ&cache=true&asp=true&render_js=true&key=&url=https%253A%252F%252Fweb-scraping.dev%252Fproduct%252F1
To extract directly the data while using the Web Scraping API, you must pass the template like extraction_template=ephemeral:base64(template) when you pass the template on the fly, or use a template saved from your dashboard extraction_template=my-template.
(Saved template are coming soon)
Combined with cache feature, we cache the raw data from the website, allowing you to re-extract the data with multiple extraction passes at a much faster speed and lower cost. This applies to the following extraction types:
Learn more about cache feature

Extraction Rules

Extraction rules will instruct what to retrieve and how. By default it returns the first matched element, you can set multiple: true to retrieve all matched elements

CSS Selector

Extracts data using CSS selectors with some extra features:

  • ::attr(attribute_name) retrieve the attribute values
  • ::text retrieve the text node
{
    "selectors": [
        {
            "name": "description",
            "query": "p.product-description::text",
            "type": "css"
        }
    ]
}
{
    "description": "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy."
}

XPath Selector

Extracts data using XPath expressions.

{
    "selectors": [
        {
            "name": "page_links",
            "query": "//a/@href",
            "type": "xpath",
            "multiple": true // Capture all matched content, by default return the first element
        }
    ]
}
{
    "page_links": [
        "https://web-scraping.dev/",
        "https://web-scraping.dev",
        "https://web-scraping.dev/docs",
        "https://web-scraping.dev/api/graphql",
        "https://web-scraping.dev/products",
        "https://web-scraping.dev/reviews",
        "https://web-scraping.dev/testimonials",
        "https://web-scraping.dev/login",
        "https://web-scraping.dev/cart",
        ...
    ]
}

JMESPath Selector

Extracts data from JSON using JMESPath expressions.

{
    "selectors": [
        {
            "name": "price",
            "query": "items[?name=='price'].value",
            "type": "jmespath"
        }
    ]
}
$9.99

Regex Selector

Extracts data using regular expressions.

{
    "selectors": [
        {
            "name": "price",
            "query": "(\\$\\d{2}\\.\\d{2})",
            "type": "regex"
        }
    ]
}
{
    "price regex": "USD"
}

Nested Selectors

You can structure your schema by using nested selector, it also simplify the query. All available selectors can be used

Example:

{
    "selectors": [
        {
            "name": "reviews",
          	"query": "div.product-reviews",
            "type": "css",
            "nested": [
                {
                    "cast": "float",
                    "name": "rating",
                    "query": "count(\/\/svg)",
                    "type": "xpath"
                },
                {
                    "formatters": [
                        {
                            "args": {
                                "format": "%d\/%m\/%Y"
                            },
                            "name": "datetime"
                        }
                    ],
                    "name": "date",
                    "query": "\/\/span[1]\/text()",
                    "type": "xpath"
                },
                {
                    "name": "text",
                    "query": "\/\/p[1]\/text()",
                    "type": "xpath"
                }
            ],
            "query": "#reviews > div.review",
            "type": "css"
        }
    ]
}
{
    "price html block": [
        {
            "product": [
                {
                    "name": "description",
                    "query": "p.product-description::text",
                    "type": "css"
                },
            ]
        }
    ]
}

Extractors

Extractors are applied before formatters and are used to extract specific types of data. It helps a lot to convert specific type of data into a structured and normalized format

Price Extractor

Extracts price information.

{
    "extractor": {
        "name": "price"
    },
    "name": "price",
    "query": ".price::text",
    "type": "css"
}

Images Extractor

Extracts image URLs.

{
    "extractor": {
        "name": "image"
    },
    "name": "product_images",
    "query": "//img/@src",
    "type": "xpath"
}

Extracts hyperlinks.

{
    "extractor": {
        "name": "links"
    },
    "name": "page_links",
    "query": "//a/@href",
    "type": "xpath"
}

Emails Extractor

Extracts email addresses.

{
    "extractor": {
        "name": "emails"
    },
    "name": "contact_emails",
    "query": "//body",
    "type": "xpath"
}

Formatters

Formatters are applied in the order they are specified and are used to transform the extracted data.

Example of formater usage
{
    "selectors": [
      {
        "name": "title",
        "type": "css",
        "query": "title::text",
        "formatters": [{"name": "trim"}]
      },
      {
        "name": "links",
        "type": "xpath",
        "multiple": true,
        "query": "//a/@href",
        "formatters": [
          {"name": "absolute_url"},
          {"name": "unique"}
        ]
      }
    ]
}

trim

  • Description: Trims whitespace from the extracted data.

pick

  • Description: Picks a specific key from a dictionary.
  • Arguments:
    • key: The key to pick from the dictionary. Required and case-sensitive
If the key do not exist, return null

unique

  • Description: Ensures the extracted data contains unique values.

unquote

  • Description: Decodes a URL-encoded string.

lowercase

  • Description: Converts the extracted data to lowercase.

uppercase

  • Description: Converts the extracted data to uppercase.

datetime

  • Description: Formats date strings.
  • Arguments:
    • format: The format to convert the date string into. Default is %Y-%m-%d.

titleize

  • Description: Converts the extracted data to title case.

capitalize

  • Description: Capitalizes the first letter of the extracted data.

slugify

  • Description: Converts the extracted data into a URL slug.
  • Arguments:
    • separator: The separator to use for the slug. Default is -.

replace

  • Description: Replaces occurrences of a string with another string.
  • Arguments:
    • search: The string to search for. (Required)
    • replace: The string to replace with. (Required)

split

  • Description: Splits the extracted data by a delimiter.
  • Arguments:
    • delimiter: The delimiter to split the string by. (Required)

join

  • Description: Joins an iterable of strings into a single string with a delimiter.
  • Arguments:
    • delimiter: The delimiter to join the strings with. (Required)

json_decode

  • Description: Decodes a JSON string into a Python object.
  • Arguments:
    • fail_silently: Whether to fail silently if decoding fails. Default is false.

url_encode

  • Description: Encodes a string into URL format.

url_decode

  • Description: Decodes a URL-encoded string.

base64_encode

  • Description: Encodes a string into Base64 format.

base64_decode

  • Description: Decodes a Base64-encoded string.

remove_html

  • Description: Removes HTML tags from the extracted data.

absolute_url

  • Description: Converts a relative URL to an absolute URL based on the base URL.

All related errors are listed below. You can see full description and example of error response on the Errors section.

Pricing

Template extraction is billed 1 API Credits.

For more information about the pricing you can learn more on the dedicated section

Summary