Extraction Rules

View as markdown

Automatically extract structured data from crawled pages by mapping URL patterns to extraction methods. Combine the power of recursive crawling with intelligent data extraction for fully automated web scraping pipelines.

Pattern-Based Extraction

Extraction rules allow you to apply different extraction strategies to different page types within the same crawl. For example, extract product data from /products/* pages and article content from /blog/* pages - all in a single crawler configuration.

How Extraction Rules Work

The extraction_rules parameter maps URL patterns to extraction configurations. As the crawler visits each page, it checks if the URL matches any defined patterns and automatically applies the corresponding extraction method.

Configuration Syntax

The extraction_rules parameter accepts a JSON object mapping URL patterns to extraction configurations:

{
  "extraction_rules": {
    "/products/*": {
      "type": "model",
      "value": "product"
    },
    "/blog/*": {
      "type": "prompt",
      "value": "Extract the article title, author, publish date, and main content"
    },
    "/reviews/*": {
      "type": "template",
      "value": "ephemeral:<BASE64_ENCODED_TEMPLATE>"
    }
  }
}

Pattern Format

Exact match: "/products/special-page" matches only that specific URL path
Wildcard: "/products/*" matches all pages under /products/
Multi-level: "/category/*/products/*" matches nested paths
Maximum length: 1000 characters per pattern

Pattern Matching Rules

Patterns are matched against the URL path only (not domain or query parameters)
The first matching pattern is used - order matters!
If no pattern matches, the page is crawled but not extracted

Extraction Methods

Extraction rules support the same three extraction methods available in the Extraction API:

Auto Model

type: "model"

Use pre-trained AI models to extract common data types automatically.

Value: Model name (e.g., "product", "article", "review_list")

Auto Model Documentation

LLM Prompt

type: "prompt"

Provide natural language instructions for what data to extract.

Value: Prompt text (max 10,000 characters)

LLM Prompt Documentation

Template

type: "template"

Define precise extraction rules using CSS, XPath, or regex selectors.

Value: ephemeral:<base64>

Template Documentation

Usage Examples

E-commerce Site with Auto Models

Crawl an e-commerce site and extract structured data from different page types using pre-trained AI models:

curl -X POST "https://api.scrapfly.io/crawl?key=" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://web-scraping.dev/products",
    "page_limit": 10,
    "extraction_rules": {
      "/product/*": {
        "type": "model",
        "value": "product"
      },
      "/products": {
        "type": "model",
        "value": "product_listing"
      }
    }
  }'

What This Does

Product detail pages (/product/*): Extracts full product data including name, price, variants, description, specifications, reviews, and images
Product listing page (/products): Extracts array of products with name, price, image, and link from the paginated catalog

Example Output:

Product page extracts: {"name": "Box of Chocolate Candy", "price": {"amount": "9.99", "currency": "USD"}, "rating": 4.7, ...}
Listing page extracts: {"products": [{"name": "Box of Chocolate...", "price": "$24.99", ...}, ...]}

Why this works: Auto models are pre-trained on thousands of e-commerce sites, automatically detecting standard fields like price, name, description without configuration.

Blog with LLM Prompt

Use LLM prompts to extract blog articles with custom metadata and content analysis:

curl -X POST "https://api.scrapfly.io/crawl?key=" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://scrapfly.io/blog/",
    "page_limit": 10,
    "extraction_rules": {
      "/blog/*": {
        "type": "prompt",
        "value": "Extract the article data as JSON with: title, author_name, publish_date (YYYY-MM-DD format), reading_time_minutes (as number), main_topic, article_summary (max 200 chars), and primary_code_language (if tutorial includes code examples, otherwise null)"
      },
      "/blog/": {
        "type": "model",
        "value": "article"
      }
    }
  }'

What This Does

Blog articles (/blog/*): Uses LLM prompt to extract article metadata plus custom fields like reading time, topic classification, and code language detection
Blog index (/blog/): Uses article model for fast extraction of the article list page

Example Output:

{"title": "How to Scrape Amazon Product Data", "author_name": "Scrapfly Team", "publish_date": "2024-03-15", "reading_time_minutes": 12, "main_topic": "web scraping tutorial", "article_summary": "Learn how to extract Amazon product data using...", "primary_code_language": "Python"}

Why use prompts: LLM prompts can extract standard fields, derive new insights (topic classification, reading time), and transform data formats (date normalization) in a single extraction pass.

Mixed Extraction Methods

Combine auto models for standard pages and templates for complex nested structures:

curl -X POST "https://api.scrapfly.io/crawl?key=" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://web-scraping.dev/products",
    "page_limit": 10,
    "extraction_rules": {
      "/product/*": {
        "type": "template",
        "value": {
          "source": "html",
          "selectors": [
            {
              "name": "name",
              "query": "h3.product-title::text",
              "type": "css"
            },
            {
              "name": "price",
              "query": ".product-price::text",
              "type": "css"
            },
            {
              "name": "image",
              "query": ".product-img::attr(src)",
              "type": "css"
            },
            {
              "name": "specifications",
              "query": ".product-description dl",
              "type": "css",
              "nested": [
                {
                  "name": "key",
                  "query": "dt::text",
                  "type": "css"
                },
                {
                  "name": "value",
                  "query": "dd::text",
                  "type": "css"
                }
              ]
            },
            {
              "name": "variants",
              "query": ".variant-options .variant",
              "type": "css",
              "multiple": true,
              "nested": [
                {
                  "name": "color",
                  "query": ".color-name::text",
                  "type": "css"
                },
                {
                  "name": "size",
                  "query": ".size-value::text",
                  "type": "css"
                },
                {
                  "name": "in_stock",
                  "query": ".stock-status::attr(data-available)",
                  "type": "css"
                }
              ]
            }
          ]
        }
      },
      "/products": {
        "type": "model",
        "value": "product_listing"
      }
    }
  }'

What This Does

Product pages (/product/*): Uses template to extract product details plus nested specs and variants arrays
Product listing (/products): Uses product_listing model for fast extraction of list pages

Example Output:

{"name": "Box of Chocolate Candy", "price": "$9.99", "specifications": [{"key": "Weight", "value": "500g"}, {"key": "Material", "value": "Chocolate"}], "variants": [{"color": "Dark", "size": "Medium", "in_stock": "true"}]}

Why mix methods: Templates provide precision for complex nested structures (specs, variants) while models offer speed for simple list pages - optimizing both accuracy and cost.

Accessing Extracted Data

When using extraction rules, extracted data is included in the crawler results alongside the raw HTML content. The extracted data appears in the extracted_data field for each matched URL.

Query Extracted Content via API

curl "https://api.scrapfly.io/crawl/{crawler_uuid}/contents?key=&format=json"

Response example:

{
  "pages": [
    {
      "url": "https://web-scraping.dev/product/1",
      "status_code": 200,
      "content": "<!DOCTYPE html>...",
      "extracted_data": {
        "name": "Box of Chocolate Candy",
        "price": "$9.99",
        "image": "https://web-scraping.dev/assets/products/orange-chocolate-box-medium.png",
        "specifications": [
          {"key": "Weight", "value": "500g"},
          {"key": "Material", "value": "Chocolate"}
        ],
        "variants": [
          {
            "color": "Dark",
            "size": "Medium",
            "in_stock": "true"
          }
        ]
      }
    }
  ]
}

Download as Artifact

For large crawls, download extracted data as part of the WARC artifact. The extracted data is stored in conversion records with Content-Type: application/json.

curl "https://api.scrapfly.io/crawl/{crawler_uuid}/artifact?key=&type=warc" -o crawl.warc.gz

See WARC Format documentation for parsing WARC files with extracted data.

Best Practices

Recommended Practices

Order patterns from specific to general: Place more specific patterns before wildcards
Example: "/products/featured" before "/products/*"
Use appropriate extraction methods: Choose auto models for standard data types, prompts for custom fields, templates for complex structures
Test extraction on sample URLs first: Use the standalone Extraction API to validate extraction configs before crawling
Keep prompts focused: Shorter, specific prompts yield better extraction results than lengthy instructions
Monitor extraction success: Check the extracted_data field in results to ensure extraction worked as expected

Common Pitfalls

Pattern order matters: The first matching pattern wins - avoid overlapping patterns where order is ambiguous
URL encoding in patterns: Patterns match decoded URL paths, not encoded ones
Extraction adds cost: Each extracted page uses additional API credits - see billing documentation
Template complexity: Very complex templates may slow down extraction - consider breaking into multiple simpler rules

Billing & Credits

Extraction rules consume additional API credits on top of the base crawling cost:

Auto Model: +5 credits per extracted page
LLM Prompt: +10 credits per extracted page
Template: +1 credit per extracted page

Only pages matching extraction rules incur extraction costs. Non-matched pages are crawled at standard rates. For detailed pricing, see Crawler API Billing.

Limitations

Limit	Value	Description
Max patterns per crawler	50	Maximum number of extraction rules
Pattern max length	1000 chars	Maximum characters per URL pattern
Prompt max length	10,000 chars	Maximum characters per LLM prompt
Template max size	100 KB	Maximum size of encoded template

Next Steps

Learn about Auto Model extraction and available models
Explore LLM Prompt extraction for custom data needs
Master Template extraction for precise control
Understand how to retrieve crawler results with extracted data
Check crawler billing to optimize extraction costs

External Resources

Guide: Web Scraping with AI and LLMs
Extraction API Documentation
Base64 Encoding Tool for template encoding

Extraction Rules

Pattern-Based Extraction

How Extraction Rules Work

Configuration Syntax

Pattern Format

Pattern Matching Rules

Extraction Methods

Auto Model

LLM Prompt

Template

Usage Examples

E-commerce Site with Auto Models

What This Does

Blog with LLM Prompt

What This Does

Mixed Extraction Methods

What This Does

Accessing Extracted Data

Query Extracted Content via API

Download as Artifact

Best Practices

Recommended Practices

Common Pitfalls

Billing & Credits

Limitations

Next Steps

External Resources

Summary