# Scrapfly Documentation

## Table of Contents

### Dashboard

- [Intro](https://scrapfly.io/docs)
- [Project](https://scrapfly.io/docs/project)
- [Account](https://scrapfly.io/docs/account)
- [Workspace & Team](https://scrapfly.io/docs/workspace-and-team)
- [Billing](https://scrapfly.io/docs/billing)

### Products

#### MCP Server

- [Getting Started](https://scrapfly.io/docs/mcp/getting-started)
- [Tools & API Spec](https://scrapfly.io/docs/mcp/tools)
- [Authentication](https://scrapfly.io/docs/mcp/authentication)
- [Examples & Use Cases](https://scrapfly.io/docs/mcp/examples)
- [FAQ](https://scrapfly.io/docs/mcp/faq)
##### Integrations

- [Overview](https://scrapfly.io/docs/mcp/integrations)
- [Claude Desktop](https://scrapfly.io/docs/mcp/integrations/claude-desktop)
- [Claude Code](https://scrapfly.io/docs/mcp/integrations/claude-code)
- [ChatGPT](https://scrapfly.io/docs/mcp/integrations/chatgpt)
- [Cursor](https://scrapfly.io/docs/mcp/integrations/cursor)
- [Cline](https://scrapfly.io/docs/mcp/integrations/cline)
- [Windsurf](https://scrapfly.io/docs/mcp/integrations/windsurf)
- [Zed](https://scrapfly.io/docs/mcp/integrations/zed)
- [Roo Code](https://scrapfly.io/docs/mcp/integrations/roo-code)
- [VS Code](https://scrapfly.io/docs/mcp/integrations/vscode)
- [LangChain](https://scrapfly.io/docs/mcp/integrations/langchain)
- [LlamaIndex](https://scrapfly.io/docs/mcp/integrations/llamaindex)
- [CrewAI](https://scrapfly.io/docs/mcp/integrations/crewai)
- [OpenAI](https://scrapfly.io/docs/mcp/integrations/openai)
- [n8n](https://scrapfly.io/docs/mcp/integrations/n8n)
- [Make](https://scrapfly.io/docs/mcp/integrations/make)
- [Zapier](https://scrapfly.io/docs/mcp/integrations/zapier)
- [Vapi AI](https://scrapfly.io/docs/mcp/integrations/vapi)
- [Agent Builder](https://scrapfly.io/docs/mcp/integrations/agent-builder)
- [Custom Client](https://scrapfly.io/docs/mcp/integrations/custom-client)


#### Web Scraping API

- [Getting Started](https://scrapfly.io/docs/scrape-api/getting-started)
- [API Specification]()
- [Monitoring](https://scrapfly.io/docs/monitoring)
- [Customize Request](https://scrapfly.io/docs/scrape-api/custom)
- [Debug](https://scrapfly.io/docs/scrape-api/debug)
- [Anti Scraping Protection](https://scrapfly.io/docs/scrape-api/anti-scraping-protection)
- [Proxy](https://scrapfly.io/docs/scrape-api/proxy)
- [Proxy Mode](https://scrapfly.io/docs/scrape-api/proxy-mode)
- [Proxy Mode - Screaming Frog](https://scrapfly.io/docs/scrape-api/proxy-mode/screaming-frog)
- [Proxy Mode - Apify](https://scrapfly.io/docs/scrape-api/proxy-mode/apify)
- [(Auto) Data Extraction](https://scrapfly.io/docs/scrape-api/extraction)
- [Javascript Rendering](https://scrapfly.io/docs/scrape-api/javascript-rendering)
- [Javascript Scenario](https://scrapfly.io/docs/scrape-api/javascript-scenario)
- [SSL](https://scrapfly.io/docs/scrape-api/ssl)
- [DNS](https://scrapfly.io/docs/scrape-api/dns)
- [Cache](https://scrapfly.io/docs/scrape-api/cache)
- [Session](https://scrapfly.io/docs/scrape-api/session)
- [Webhook](https://scrapfly.io/docs/scrape-api/webhook)
- [Screenshot](https://scrapfly.io/docs/scrape-api/screenshot)
- [Errors](https://scrapfly.io/docs/scrape-api/errors)
- [Timeout](https://scrapfly.io/docs/scrape-api/understand-timeout)
- [Throttling](https://scrapfly.io/docs/throttling)
- [Troubleshoot](https://scrapfly.io/docs/scrape-api/troubleshoot)
- [Billing](https://scrapfly.io/docs/scrape-api/billing)
- [FAQ](https://scrapfly.io/docs/scrape-api/faq)

#### Crawler API

- [Getting Started](https://scrapfly.io/docs/crawler-api/getting-started)
- [API Specification]()
- [Retrieving Results](https://scrapfly.io/docs/crawler-api/results)
- [WARC Format](https://scrapfly.io/docs/crawler-api/warc-format)
- [Data Extraction](https://scrapfly.io/docs/crawler-api/extraction-rules)
- [Webhook](https://scrapfly.io/docs/crawler-api/webhook)
- [Billing](https://scrapfly.io/docs/crawler-api/billing)
- [Errors](https://scrapfly.io/docs/crawler-api/errors)
- [Troubleshoot](https://scrapfly.io/docs/crawler-api/troubleshoot)
- [FAQ](https://scrapfly.io/docs/crawler-api/faq)

#### Screenshot API

- [Getting Started](https://scrapfly.io/docs/screenshot-api/getting-started)
- [API Specification]()
- [Accessibility Testing](https://scrapfly.io/docs/screenshot-api/accessibility)
- [Webhook](https://scrapfly.io/docs/screenshot-api/webhook)
- [Billing](https://scrapfly.io/docs/screenshot-api/billing)
- [Errors](https://scrapfly.io/docs/screenshot-api/errors)

#### Extraction API

- [Getting Started](https://scrapfly.io/docs/extraction-api/getting-started)
- [API Specification]()
- [Rules Template](https://scrapfly.io/docs/extraction-api/rules-and-template)
- [LLM Extraction](https://scrapfly.io/docs/extraction-api/llm-prompt)
- [AI Auto Extraction](https://scrapfly.io/docs/extraction-api/automatic-ai)
- [Webhook](https://scrapfly.io/docs/extraction-api/webhook)
- [Billing](https://scrapfly.io/docs/extraction-api/billing)
- [Errors](https://scrapfly.io/docs/extraction-api/errors)
- [FAQ](https://scrapfly.io/docs/extraction-api/faq)

#### Proxy Saver

- [Getting Started](https://scrapfly.io/docs/proxy-saver/getting-started)
- [Fingerprints](https://scrapfly.io/docs/proxy-saver/fingerprints)
- [Optimizations](https://scrapfly.io/docs/proxy-saver/optimizations)
- [SSL Certificates](https://scrapfly.io/docs/proxy-saver/certificates)
- [Protocols](https://scrapfly.io/docs/proxy-saver/protocols)
- [Pacfile](https://scrapfly.io/docs/proxy-saver/pacfile)
- [Secure Credentials](https://scrapfly.io/docs/proxy-saver/security)
- [Billing](https://scrapfly.io/docs/proxy-saver/billing)

#### Cloud Browser API

- [Getting Started](https://scrapfly.io/docs/cloud-browser-api/getting-started)
- [Proxy & Geo-Targeting](https://scrapfly.io/docs/cloud-browser-api/proxy)
- [Unblock API](https://scrapfly.io/docs/cloud-browser-api/unblock)
- [File Downloads](https://scrapfly.io/docs/cloud-browser-api/file-downloads)
- [Session Resume](https://scrapfly.io/docs/cloud-browser-api/session-resume)
- [Human-in-the-Loop](https://scrapfly.io/docs/cloud-browser-api/human-in-the-loop)
- [Debug Mode](https://scrapfly.io/docs/cloud-browser-api/debug-mode)
- [Bring Your Own Proxy](https://scrapfly.io/docs/cloud-browser-api/bring-your-own-proxy)
- [Browser Extensions](https://scrapfly.io/docs/cloud-browser-api/extensions)
##### Integrations

- [Puppeteer](https://scrapfly.io/docs/cloud-browser-api/puppeteer)
- [Playwright](https://scrapfly.io/docs/cloud-browser-api/playwright)
- [Selenium](https://scrapfly.io/docs/cloud-browser-api/selenium)
- [Vercel Agent Browser](https://scrapfly.io/docs/cloud-browser-api/agent-browser)
- [Browser Use](https://scrapfly.io/docs/cloud-browser-api/browser-use)
- [Stagehand](https://scrapfly.io/docs/cloud-browser-api/stagehand)

- [Billing](https://scrapfly.io/docs/cloud-browser-api/billing)
- [Errors](https://scrapfly.io/docs/cloud-browser-api/errors)


### Tools

- [Antibot Detector](https://scrapfly.io/docs/tools/antibot-detector)

### SDK

- [Golang](https://scrapfly.io/docs/sdk/golang)
- [Python](https://scrapfly.io/docs/sdk/python)
- [Rust](https://scrapfly.io/docs/sdk/rust)
- [TypeScript](https://scrapfly.io/docs/sdk/typescript)
- [Scrapy](https://scrapfly.io/docs/sdk/scrapy)

### Integrations

- [Getting Started](https://scrapfly.io/docs/integration/getting-started)
- [LangChain](https://scrapfly.io/docs/integration/langchain)
- [LlamaIndex](https://scrapfly.io/docs/integration/llamaindex)
- [CrewAI](https://scrapfly.io/docs/integration/crewai)
- [Zapier](https://scrapfly.io/docs/integration/zapier)
- [Make](https://scrapfly.io/docs/integration/make)
- [n8n](https://scrapfly.io/docs/integration/n8n)

### Academy

- [Overview](https://scrapfly.io/academy)
- [Web Scraping Overview](https://scrapfly.io/academy/scraping-overview)
- [Tools](https://scrapfly.io/academy/tools-overview)
- [Reverse Engineering](https://scrapfly.io/academy/reverse-engineering)
- [Static Scraping](https://scrapfly.io/academy/static-scraping)
- [HTML Parsing](https://scrapfly.io/academy/html-parsing)
- [Dynamic Scraping](https://scrapfly.io/academy/dynamic-scraping)
- [Hidden API Scraping](https://scrapfly.io/academy/hidden-api-scraping)
- [Headless Browsers](https://scrapfly.io/academy/headless-browsers)
- [Hidden Web Data](https://scrapfly.io/academy/hidden-web-data)
- [JSON Parsing](https://scrapfly.io/academy/json-parsing)
- [Data Processing](https://scrapfly.io/academy/data-processing)
- [Scaling](https://scrapfly.io/academy/scaling)
- [Walkthrough Summary](https://scrapfly.io/academy/walkthrough-summary)
- [Scraper Blocking](https://scrapfly.io/academy/scraper-blocking)
- [Proxies](https://scrapfly.io/academy/proxies)

---

#   Extraction Rules 

 [  View as markdown ](https://scrapfly.io/?view=markdown)   Copy for LLM    Copy for LLM  [     Open in ChatGPT ](https://chatgpt.com/?hints=search&prompt=Read%20from%20https%3A%2F%2Fscrapfly.io%2Fdocs%2Fcrawler-api%2Fextraction-rules%20so%20I%20can%20ask%20questions%20about%20it.) [     Open in Claude ](https://claude.ai/new?q=Read%20from%20https%3A%2F%2Fscrapfly.io%2Fdocs%2Fcrawler-api%2Fextraction-rules%20so%20I%20can%20ask%20questions%20about%20it.) [     Open in Perplexity ](https://www.perplexity.ai/search/new?q=Read%20from%20https%3A%2F%2Fscrapfly.io%2Fdocs%2Fcrawler-api%2Fextraction-rules%20so%20I%20can%20ask%20questions%20about%20it.) 

 

 

 Automatically extract structured data from crawled pages by mapping URL patterns to extraction methods. Combine the power of recursive crawling with intelligent data extraction for fully automated web scraping pipelines.

> #####  Pattern-Based Extraction
> 
>  Extraction rules allow you to apply different extraction strategies to different page types within the same crawl. For example, extract product data from `/products/*` pages and article content from `/blog/*` pages - all in a single crawler configuration.

## How Extraction Rules Work

 The `extraction_rules` parameter maps URL patterns to extraction configurations. As the crawler visits each page, it checks if the URL matches any defined patterns and automatically applies the corresponding extraction method.

  

## Configuration Syntax

 The `extraction_rules` parameter accepts a JSON object mapping URL patterns to extraction configurations:

 ```
{
  "extraction_rules": {
    "/products/*": {
      "type": "model",
      "value": "product"
    },
    "/blog/*": {
      "type": "prompt",
      "value": "Extract the article title, author, publish date, and main content"
    },
    "/reviews/*": {
      "type": "template",
      "value": "ephemeral:<base64_encoded_template>"
    }
  }
}</base64_encoded_template>
```

 

   

 

 

 

### Pattern Format

- **Exact match**: `"/products/special-page"` matches only that specific URL path
- **Wildcard**: `"/products/*"` matches all pages under /products/
- **Multi-level**: `"/category/*/products/*"` matches nested paths
- **Maximum length**: 1000 characters per pattern
 
> ######  Pattern Matching Rules
> 
> - Patterns are matched against the URL path only (not domain or query parameters)
> - The **first matching pattern** is used - order matters!
> - If no pattern matches, the page is crawled but not extracted

## Extraction Methods

 Extraction rules support the same three extraction methods available in the [Extraction API](https://scrapfly.io/docs/extraction-api/getting-started):

 

#####   Auto Model 

 `type: "model"`

 Use pre-trained AI models to extract common data types automatically.

 **Value**: Model name (e.g., `"product"`, `"article"`, `"review_list"`)

 [ Auto Model Documentation](https://scrapfly.io/docs/extraction-api/automatic-ai)

 

 

 

#####   LLM Prompt 

 `type: "prompt"`

 Provide natural language instructions for what data to extract.

 **Value**: Prompt text (max 10,000 characters)

 [ LLM Prompt Documentation](https://scrapfly.io/docs/extraction-api/llm-prompt)

 

 

 

#####   Template 

 `type: "template"`

 Define precise extraction rules using CSS, XPath, or regex selectors.

 **Value**: `ephemeral:<base64>`

 [ Template Documentation](https://scrapfly.io/docs/extraction-api/rules-and-template)

 

 

 

 

 

## Usage Examples

    E-commerce Site    Blog with LLM    Mixed Methods  

 ### E-commerce Site with Auto Models

 Crawl an e-commerce site and extract structured data from different page types using pre-trained AI models:

 ```
curl -X POST "https://api.scrapfly.io/crawl?key=" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://web-scraping.dev/products",
    "page_limit": 10,
    "extraction_rules": {
      "/product/*": {
        "type": "model",
        "value": "product"
      },
      "/products": {
        "type": "model",
        "value": "product_listing"
      }
    }
  }'
```

 

   

 

 

 

#### What This Does

- **Product detail pages** (`/product/*`): Extracts full product data including name, price, variants, description, specifications, reviews, and images
- **Product listing page** (`/products`): Extracts array of products with name, price, image, and link from the paginated catalog
 
> **Example Output:**- Product page extracts: `{"name": "Box of Chocolate Candy", "price": {"amount": "9.99", "currency": "USD"}, "rating": 4.7, ...}`
> - Listing page extracts: `{"products": [{"name": "Box of Chocolate...", "price": "$24.99", ...}, ...]}`

  **Why this works:** Auto models are pre-trained on thousands of e-commerce sites, automatically detecting standard fields like price, name, description without configuration.

 

### Blog with LLM Prompt

 Use LLM prompts to extract blog articles with custom metadata and content analysis:

 ```
curl -X POST "https://api.scrapfly.io/crawl?key=" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://scrapfly.io/blog/",
    "page_limit": 10,
    "extraction_rules": {
      "/blog/*": {
        "type": "prompt",
        "value": "Extract the article data as JSON with: title, author_name, publish_date (YYYY-MM-DD format), reading_time_minutes (as number), main_topic, article_summary (max 200 chars), and primary_code_language (if tutorial includes code examples, otherwise null)"
      },
      "/blog/": {
        "type": "model",
        "value": "article"
      }
    }
  }'
```

 

   

 

 

 

#### What This Does

- **Blog articles** (`/blog/*`): Uses LLM prompt to extract article metadata plus custom fields like reading time, topic classification, and code language detection
- **Blog index** (`/blog/`): Uses `article` model for fast extraction of the article list page
 
> **Example Output:** `{"title": "How to Scrape Amazon Product Data", "author_name": "Scrapfly Team", "publish_date": "2024-03-15", "reading_time_minutes": 12, "main_topic": "web scraping tutorial", "article_summary": "Learn how to extract Amazon product data using...", "primary_code_language": "Python"}`

  **Why use prompts:** LLM prompts can extract standard fields, derive new insights (topic classification, reading time), and transform data formats (date normalization) in a single extraction pass.

 

### Mixed Extraction Methods

 Combine auto models for standard pages and templates for complex nested structures:

 ```
curl -X POST "https://api.scrapfly.io/crawl?key=" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://web-scraping.dev/products",
    "page_limit": 10,
    "extraction_rules": {
      "/product/*": {
        "type": "template",
        "value": {
          "source": "html",
          "selectors": [
            {
              "name": "name",
              "query": "h3.product-title::text",
              "type": "css"
            },
            {
              "name": "price",
              "query": ".product-price::text",
              "type": "css"
            },
            {
              "name": "image",
              "query": ".product-img::attr(src)",
              "type": "css"
            },
            {
              "name": "specifications",
              "query": ".product-description dl",
              "type": "css",
              "nested": [
                {
                  "name": "key",
                  "query": "dt::text",
                  "type": "css"
                },
                {
                  "name": "value",
                  "query": "dd::text",
                  "type": "css"
                }
              ]
            },
            {
              "name": "variants",
              "query": ".variant-options .variant",
              "type": "css",
              "multiple": true,
              "nested": [
                {
                  "name": "color",
                  "query": ".color-name::text",
                  "type": "css"
                },
                {
                  "name": "size",
                  "query": ".size-value::text",
                  "type": "css"
                },
                {
                  "name": "in_stock",
                  "query": ".stock-status::attr(data-available)",
                  "type": "css"
                }
              ]
            }
          ]
        }
      },
      "/products": {
        "type": "model",
        "value": "product_listing"
      }
    }
  }'
```

 

   

 

 

 

#### What This Does

- **Product pages** (`/product/*`): Uses template to extract product details plus nested specs and variants arrays
- **Product listing** (`/products`): Uses `product_listing` model for fast extraction of list pages
 
> **Example Output:** `{"name": "Box of Chocolate Candy", "price": "$9.99", "specifications": [{"key": "Weight", "value": "500g"}, {"key": "Material", "value": "Chocolate"}], "variants": [{"color": "Dark", "size": "Medium", "in_stock": "true"}]}`

  **Why mix methods:** Templates provide precision for complex nested structures (specs, variants) while models offer speed for simple list pages - optimizing both accuracy and cost.

 

 

## Accessing Extracted Data

 When using extraction rules, extracted data is included in the crawler results alongside the raw HTML content. The extracted data appears in the `extracted_data` field for each matched URL.

### Query Extracted Content via API

 ```
curl "https://api.scrapfly.io/crawl/<span class="snippet-var" data-var="crawler_uuid" title="Click to edit: Crawler UUID">{crawler_uuid}</span>/contents?key=&format=json"
```

 

   

 

 

 

Response example:

 ```
{
  "pages": [
    {
      "url": "https://web-scraping.dev/product/1",
      "status_code": 200,
      "content": "...",
      "extracted_data": {
        "name": "Box of Chocolate Candy",
        "price": "$9.99",
        "image": "https://web-scraping.dev/assets/products/orange-chocolate-box-medium.png",
        "specifications": [
          {"key": "Weight", "value": "500g"},
          {"key": "Material", "value": "Chocolate"}
        ],
        "variants": [
          {
            "color": "Dark",
            "size": "Medium",
            "in_stock": "true"
          }
        ]
      }
    }
  ]
}
```

 

   

 

 

 

### Download as Artifact

 For large crawls, download extracted data as part of the WARC artifact. The extracted data is stored in `conversion` records with `Content-Type: application/json`.

 ```
curl "https://api.scrapfly.io/crawl/<span class="snippet-var" data-var="crawler_uuid" title="Click to edit: Crawler UUID">{crawler_uuid}</span>/artifact?key=&type=warc" -o crawl.warc.gz
```

 

   

 

 

 

  See [WARC Format](https://scrapfly.io/docs/crawler-api/warc-format) documentation for parsing WARC files with extracted data.

## Best Practices

> #####  Recommended Practices
> 
> - **Order patterns from specific to general**: Place more specific patterns before wildcards 
>     Example: `"/products/featured"` before `"/products/*"`
> - **Use appropriate extraction methods**: Choose auto models for standard data types, prompts for custom fields, templates for complex structures
> - **Test extraction on sample URLs first**: Use the [standalone Extraction API](https://scrapfly.io/docs/extraction-api/getting-started) to validate extraction configs before crawling
> - **Keep prompts focused**: Shorter, specific prompts yield better extraction results than lengthy instructions
> - **Monitor extraction success**: Check the `extracted_data` field in results to ensure extraction worked as expected

> #####  Common Pitfalls
> 
> - **Pattern order matters**: The first matching pattern wins - avoid overlapping patterns where order is ambiguous
> - **URL encoding in patterns**: Patterns match decoded URL paths, not encoded ones
> - **Extraction adds cost**: Each extracted page uses additional API credits - see [billing documentation](https://scrapfly.io/docs/crawler-api/billing)
> - **Template complexity**: Very complex templates may slow down extraction - consider breaking into multiple simpler rules

## Billing &amp; Credits

 Extraction rules consume additional API credits on top of the base crawling cost:

- **Auto Model**: +5 credits per extracted page
- **LLM Prompt**: +10 credits per extracted page
- **Template**: +1 credit per extracted page
 
  Only pages matching extraction rules incur extraction costs. Non-matched pages are crawled at standard rates. For detailed pricing, see [Crawler API Billing](https://scrapfly.io/docs/crawler-api/billing).

## Limitations

 | Limit | Value | Description |
|---|---|---|
| Max patterns per crawler | 50 | Maximum number of extraction rules |
| Pattern max length | 1000 chars | Maximum characters per URL pattern |
| Prompt max length | 10,000 chars | Maximum characters per LLM prompt |
| Template max size | 100 KB | Maximum size of encoded template |

## Next Steps

- Learn about [Auto Model extraction](https://scrapfly.io/docs/extraction-api/automatic-ai) and available models
- Explore [LLM Prompt extraction](https://scrapfly.io/docs/extraction-api/llm-prompt) for custom data needs
- Master [Template extraction](https://scrapfly.io/docs/extraction-api/rules-and-template) for precise control
- Understand [how to retrieve crawler results](https://scrapfly.io/docs/crawler-api/results) with extracted data
- Check [crawler billing](https://scrapfly.io/docs/crawler-api/billing) to optimize extraction costs
 
## External Resources

- [Guide: Using Web Scraping for building LLMs and RAG applications](https://scrapfly.io/blog/posts/how-to-use-web-scaping-for-rag-applications)
- [Extraction API Documentation](https://scrapfly.io/docs/extraction-api/getting-started)
- [Base64 Encoding Tool](https://scrapfly.io/dashboard/tools/base64) for template encoding