Extraction Rules
Automatically extract structured data from crawled pages by mapping URL patterns to extraction methods. Combine the power of recursive crawling with intelligent data extraction for fully automated web scraping pipelines.
Pattern-Based Extraction
Extraction rules allow you to apply different extraction strategies to different page types within the same crawl.
For example, extract product data from /products/* pages and article content from /blog/* pages - all in a single crawler configuration.
How Extraction Rules Work
The extraction_rules parameter maps URL patterns to extraction configurations.
As the crawler visits each page, it checks if the URL matches any defined patterns and automatically applies the corresponding extraction method.
Configuration Syntax
The extraction_rules parameter accepts a JSON object mapping URL patterns to extraction configurations:
Pattern Format
- Exact match:
"/products/special-page"matches only that specific URL path - Wildcard:
"/products/*"matches all pages under /products/ - Multi-level:
"/category/*/products/*"matches nested paths - Maximum length: 1000 characters per pattern
Pattern Matching Rules
- Patterns are matched against the URL path only (not domain or query parameters)
- The first matching pattern is used - order matters!
- If no pattern matches, the page is crawled but not extracted
Extraction Methods
Extraction rules support the same three extraction methods available in the Extraction API:
Auto Model
type: "model"
Use pre-trained AI models to extract common data types automatically.
Value: Model name (e.g., "product", "article", "review_list")
LLM Prompt
type: "prompt"
Provide natural language instructions for what data to extract.
Value: Prompt text (max 10,000 characters)
Template
type: "template"
Define precise extraction rules using CSS, XPath, or regex selectors.
Value: ephemeral:<base64>
Usage Examples
E-commerce Site with Auto Models
Crawl an e-commerce site and extract structured data from different page types using pre-trained AI models:
What This Does
- Product detail pages (
/product/*): Extracts full product data including name, price, variants, description, specifications, reviews, and images - Product listing page (
/products): Extracts array of products with name, price, image, and link from the paginated catalog
- Product page extracts:
{"name": "Box of Chocolate Candy", "price": {"amount": "9.99", "currency": "USD"}, "rating": 4.7, ...} - Listing page extracts:
{"products": [{"name": "Box of Chocolate...", "price": "$24.99", ...}, ...]}
Why this works: Auto models are pre-trained on thousands of e-commerce sites, automatically detecting standard fields like price, name, description without configuration.
Blog with LLM Prompt
Use LLM prompts to extract blog articles with custom metadata and content analysis:
What This Does
- Blog articles (
/blog/*): Uses LLM prompt to extract article metadata plus custom fields like reading time, topic classification, and code language detection - Blog index (
/blog/): Usesarticlemodel for fast extraction of the article list page
{"title": "How to Scrape Amazon Product Data", "author_name": "Scrapfly Team", "publish_date": "2024-03-15", "reading_time_minutes": 12, "main_topic": "web scraping tutorial", "article_summary": "Learn how to extract Amazon product data using...", "primary_code_language": "Python"}
Why use prompts: LLM prompts can extract standard fields, derive new insights (topic classification, reading time), and transform data formats (date normalization) in a single extraction pass.
Mixed Extraction Methods
Combine auto models for standard pages and templates for complex nested structures:
What This Does
- Product pages (
/product/*): Uses template to extract product details plus nested specs and variants arrays - Product listing (
/products): Usesproduct_listingmodel for fast extraction of list pages
{"name": "Box of Chocolate Candy", "price": "$9.99", "specifications": [{"key": "Weight", "value": "500g"}, {"key": "Material", "value": "Chocolate"}], "variants": [{"color": "Dark", "size": "Medium", "in_stock": "true"}]}
Why mix methods: Templates provide precision for complex nested structures (specs, variants) while models offer speed for simple list pages - optimizing both accuracy and cost.
Accessing Extracted Data
When using extraction rules, extracted data is included in the crawler results alongside the raw HTML content.
The extracted data appears in the extracted_data field for each matched URL.
Query Extracted Content via API
Response example:
Download as Artifact
For large crawls, download extracted data as part of the WARC artifact. The extracted data is stored in conversion records with Content-Type: application/json.
See WARC Format documentation for parsing WARC files with extracted data.
Best Practices
Recommended Practices
- Order patterns from specific to general: Place more specific patterns before wildcards
Example:"/products/featured"before"/products/*" - Use appropriate extraction methods: Choose auto models for standard data types, prompts for custom fields, templates for complex structures
- Test extraction on sample URLs first: Use the standalone Extraction API to validate extraction configs before crawling
- Keep prompts focused: Shorter, specific prompts yield better extraction results than lengthy instructions
- Monitor extraction success: Check the
extracted_datafield in results to ensure extraction worked as expected
Common Pitfalls
- Pattern order matters: The first matching pattern wins - avoid overlapping patterns where order is ambiguous
- URL encoding in patterns: Patterns match decoded URL paths, not encoded ones
- Extraction adds cost: Each extracted page uses additional API credits - see billing documentation
- Template complexity: Very complex templates may slow down extraction - consider breaking into multiple simpler rules
Billing & Credits
Extraction rules consume additional API credits on top of the base crawling cost:
- Auto Model: +5 credits per extracted page
- LLM Prompt: +10 credits per extracted page
- Template: +1 credit per extracted page
Only pages matching extraction rules incur extraction costs. Non-matched pages are crawled at standard rates. For detailed pricing, see Crawler API Billing.
Limitations
| Limit | Value | Description |
|---|---|---|
| Max patterns per crawler | 50 | Maximum number of extraction rules |
| Pattern max length | 1000 chars | Maximum characters per URL pattern |
| Prompt max length | 10,000 chars | Maximum characters per LLM prompt |
| Template max size | 100 KB | Maximum size of encoded template |
Next Steps
- Learn about Auto Model extraction and available models
- Explore LLM Prompt extraction for custom data needs
- Master Template extraction for precise control
- Understand how to retrieve crawler results with extracted data
- Check crawler billing to optimize extraction costs
External Resources
- Guide: Web Scraping with AI and LLMs
- Extraction API Documentation
- Base64 Encoding Tool for template encoding