Extraction Rules

Automatically extract structured data from crawled pages by mapping URL patterns to extraction methods. Combine the power of recursive crawling with intelligent data extraction for fully automated web scraping pipelines.

How Extraction Rules Work

The extraction_rules parameter maps URL patterns to extraction configurations. As the crawler visits each page, it checks if the URL matches any defined patterns and automatically applies the corresponding extraction method.

CRAWLER Visits pages recursively URL PATTERN MATCHING /products/* ? EXTRACTION Apply method prompt/model/template JSON Structured Data

Configuration Syntax

The extraction_rules parameter accepts a JSON object mapping URL patterns to extraction configurations:

Pattern Format

  • Exact match: "/products/special-page" matches only that specific URL path
  • Wildcard: "/products/*" matches all pages under /products/
  • Multi-level: "/category/*/products/*" matches nested paths
  • Maximum length: 1000 characters per pattern

Extraction Methods

Extraction rules support the same three extraction methods available in the Extraction API:

Auto Model

type: "model"

Use pre-trained AI models to extract common data types automatically.

Value: Model name (e.g., "product", "article", "review_list")

Auto Model Documentation

LLM Prompt

type: "prompt"

Provide natural language instructions for what data to extract.

Value: Prompt text (max 10,000 characters)

LLM Prompt Documentation

Template

type: "template"

Define precise extraction rules using CSS, XPath, or regex selectors.

Value: ephemeral:<base64>

Template Documentation

Usage Examples

E-commerce Site with Auto Models

Crawl an e-commerce site and extract structured data from different page types using pre-trained AI models:

What This Does

  • Product detail pages (/product/*): Extracts full product data including name, price, variants, description, specifications, reviews, and images
  • Product listing page (/products): Extracts array of products with name, price, image, and link from the paginated catalog

Why this works: Auto models are pre-trained on thousands of e-commerce sites, automatically detecting standard fields like price, name, description without configuration.

Blog with LLM Prompt

Use LLM prompts to extract blog articles with custom metadata and content analysis:

What This Does

  • Blog articles (/blog/*): Uses LLM prompt to extract article metadata plus custom fields like reading time, topic classification, and code language detection
  • Blog index (/blog/): Uses article model for fast extraction of the article list page

Why use prompts: LLM prompts can extract standard fields, derive new insights (topic classification, reading time), and transform data formats (date normalization) in a single extraction pass.

Mixed Extraction Methods

Combine auto models for standard pages and templates for complex nested structures:

What This Does

  • Product pages (/product/*): Uses template to extract product details plus nested specs and variants arrays
  • Product listing (/products): Uses product_listing model for fast extraction of list pages

Why mix methods: Templates provide precision for complex nested structures (specs, variants) while models offer speed for simple list pages - optimizing both accuracy and cost.

Accessing Extracted Data

When using extraction rules, extracted data is included in the crawler results alongside the raw HTML content. The extracted data appears in the extracted_data field for each matched URL.

Query Extracted Content via API

Response example:

Download as Artifact

For large crawls, download extracted data as part of the WARC artifact. The extracted data is stored in conversion records with Content-Type: application/json.

See WARC Format documentation for parsing WARC files with extracted data.

Best Practices

Billing & Credits

Extraction rules consume additional API credits on top of the base crawling cost:

  • Auto Model: +5 credits per extracted page
  • LLM Prompt: +10 credits per extracted page
  • Template: +1 credit per extracted page

Only pages matching extraction rules incur extraction costs. Non-matched pages are crawled at standard rates. For detailed pricing, see Crawler API Billing.

Limitations

Limit Value Description
Max patterns per crawler 50 Maximum number of extraction rules
Pattern max length 1000 chars Maximum characters per URL pattern
Prompt max length 10,000 chars Maximum characters per LLM prompt
Template max size 100 KB Maximum size of encoded template

Next Steps

External Resources

Summary