# Scrapfly Documentation

## Table of Contents

### Dashboard

- [Intro](https://scrapfly.io/docs)
- [Project](https://scrapfly.io/docs/project)
- [Account](https://scrapfly.io/docs/account)
- [Workspace & Team](https://scrapfly.io/docs/workspace-and-team)
- [Billing](https://scrapfly.io/docs/billing)

### Products

#### MCP Server

- [Getting Started](https://scrapfly.io/docs/mcp/getting-started)
- [Tools & API Spec](https://scrapfly.io/docs/mcp/tools)
- [Authentication](https://scrapfly.io/docs/mcp/authentication)
- [Examples & Use Cases](https://scrapfly.io/docs/mcp/examples)
- [FAQ](https://scrapfly.io/docs/mcp/faq)
##### Integrations

- [Overview](https://scrapfly.io/docs/mcp/integrations)
- [Claude Desktop](https://scrapfly.io/docs/mcp/integrations/claude-desktop)
- [Claude Code](https://scrapfly.io/docs/mcp/integrations/claude-code)
- [ChatGPT](https://scrapfly.io/docs/mcp/integrations/chatgpt)
- [Cursor](https://scrapfly.io/docs/mcp/integrations/cursor)
- [Cline](https://scrapfly.io/docs/mcp/integrations/cline)
- [Windsurf](https://scrapfly.io/docs/mcp/integrations/windsurf)
- [Zed](https://scrapfly.io/docs/mcp/integrations/zed)
- [Roo Code](https://scrapfly.io/docs/mcp/integrations/roo-code)
- [VS Code](https://scrapfly.io/docs/mcp/integrations/vscode)
- [LangChain](https://scrapfly.io/docs/mcp/integrations/langchain)
- [LlamaIndex](https://scrapfly.io/docs/mcp/integrations/llamaindex)
- [CrewAI](https://scrapfly.io/docs/mcp/integrations/crewai)
- [OpenAI](https://scrapfly.io/docs/mcp/integrations/openai)
- [n8n](https://scrapfly.io/docs/mcp/integrations/n8n)
- [Make](https://scrapfly.io/docs/mcp/integrations/make)
- [Zapier](https://scrapfly.io/docs/mcp/integrations/zapier)
- [Vapi AI](https://scrapfly.io/docs/mcp/integrations/vapi)
- [Agent Builder](https://scrapfly.io/docs/mcp/integrations/agent-builder)
- [Custom Client](https://scrapfly.io/docs/mcp/integrations/custom-client)


#### Web Scraping API

- [Getting Started](https://scrapfly.io/docs/scrape-api/getting-started)
- [API Specification]()
- [Monitoring](https://scrapfly.io/docs/monitoring)
- [Customize Request](https://scrapfly.io/docs/scrape-api/custom)
- [Debug](https://scrapfly.io/docs/scrape-api/debug)
- [Anti Scraping Protection](https://scrapfly.io/docs/scrape-api/anti-scraping-protection)
- [Proxy](https://scrapfly.io/docs/scrape-api/proxy)
- [Proxy Mode](https://scrapfly.io/docs/scrape-api/proxy-mode)
- [Proxy Mode - Screaming Frog](https://scrapfly.io/docs/scrape-api/proxy-mode/screaming-frog)
- [Proxy Mode - Apify](https://scrapfly.io/docs/scrape-api/proxy-mode/apify)
- [(Auto) Data Extraction](https://scrapfly.io/docs/scrape-api/extraction)
- [Javascript Rendering](https://scrapfly.io/docs/scrape-api/javascript-rendering)
- [Javascript Scenario](https://scrapfly.io/docs/scrape-api/javascript-scenario)
- [SSL](https://scrapfly.io/docs/scrape-api/ssl)
- [DNS](https://scrapfly.io/docs/scrape-api/dns)
- [Cache](https://scrapfly.io/docs/scrape-api/cache)
- [Session](https://scrapfly.io/docs/scrape-api/session)
- [Webhook](https://scrapfly.io/docs/scrape-api/webhook)
- [Screenshot](https://scrapfly.io/docs/scrape-api/screenshot)
- [Errors](https://scrapfly.io/docs/scrape-api/errors)
- [Timeout](https://scrapfly.io/docs/scrape-api/understand-timeout)
- [Throttling](https://scrapfly.io/docs/throttling)
- [Troubleshoot](https://scrapfly.io/docs/scrape-api/troubleshoot)
- [Billing](https://scrapfly.io/docs/scrape-api/billing)
- [FAQ](https://scrapfly.io/docs/scrape-api/faq)

#### Crawler API

- [Getting Started](https://scrapfly.io/docs/crawler-api/getting-started)
- [API Specification]()
- [Retrieving Results](https://scrapfly.io/docs/crawler-api/results)
- [WARC Format](https://scrapfly.io/docs/crawler-api/warc-format)
- [Data Extraction](https://scrapfly.io/docs/crawler-api/extraction-rules)
- [Webhook](https://scrapfly.io/docs/crawler-api/webhook)
- [Billing](https://scrapfly.io/docs/crawler-api/billing)
- [Errors](https://scrapfly.io/docs/crawler-api/errors)
- [Troubleshoot](https://scrapfly.io/docs/crawler-api/troubleshoot)
- [FAQ](https://scrapfly.io/docs/crawler-api/faq)

#### Screenshot API

- [Getting Started](https://scrapfly.io/docs/screenshot-api/getting-started)
- [API Specification]()
- [Accessibility Testing](https://scrapfly.io/docs/screenshot-api/accessibility)
- [Webhook](https://scrapfly.io/docs/screenshot-api/webhook)
- [Billing](https://scrapfly.io/docs/screenshot-api/billing)
- [Errors](https://scrapfly.io/docs/screenshot-api/errors)

#### Extraction API

- [Getting Started](https://scrapfly.io/docs/extraction-api/getting-started)
- [API Specification]()
- [Rules Template](https://scrapfly.io/docs/extraction-api/rules-and-template)
- [LLM Extraction](https://scrapfly.io/docs/extraction-api/llm-prompt)
- [AI Auto Extraction](https://scrapfly.io/docs/extraction-api/automatic-ai)
- [Webhook](https://scrapfly.io/docs/extraction-api/webhook)
- [Billing](https://scrapfly.io/docs/extraction-api/billing)
- [Errors](https://scrapfly.io/docs/extraction-api/errors)
- [FAQ](https://scrapfly.io/docs/extraction-api/faq)

#### Proxy Saver

- [Getting Started](https://scrapfly.io/docs/proxy-saver/getting-started)
- [Fingerprints](https://scrapfly.io/docs/proxy-saver/fingerprints)
- [Optimizations](https://scrapfly.io/docs/proxy-saver/optimizations)
- [SSL Certificates](https://scrapfly.io/docs/proxy-saver/certificates)
- [Protocols](https://scrapfly.io/docs/proxy-saver/protocols)
- [Pacfile](https://scrapfly.io/docs/proxy-saver/pacfile)
- [Secure Credentials](https://scrapfly.io/docs/proxy-saver/security)
- [Billing](https://scrapfly.io/docs/proxy-saver/billing)

#### Cloud Browser API

- [Getting Started](https://scrapfly.io/docs/cloud-browser-api/getting-started)
- [Proxy & Geo-Targeting](https://scrapfly.io/docs/cloud-browser-api/proxy)
- [Unblock API](https://scrapfly.io/docs/cloud-browser-api/unblock)
- [File Downloads](https://scrapfly.io/docs/cloud-browser-api/file-downloads)
- [Session Resume](https://scrapfly.io/docs/cloud-browser-api/session-resume)
- [Human-in-the-Loop](https://scrapfly.io/docs/cloud-browser-api/human-in-the-loop)
- [Debug Mode](https://scrapfly.io/docs/cloud-browser-api/debug-mode)
- [Bring Your Own Proxy](https://scrapfly.io/docs/cloud-browser-api/bring-your-own-proxy)
- [Browser Extensions](https://scrapfly.io/docs/cloud-browser-api/extensions)
##### Integrations

- [Puppeteer](https://scrapfly.io/docs/cloud-browser-api/puppeteer)
- [Playwright](https://scrapfly.io/docs/cloud-browser-api/playwright)
- [Selenium](https://scrapfly.io/docs/cloud-browser-api/selenium)
- [Vercel Agent Browser](https://scrapfly.io/docs/cloud-browser-api/agent-browser)
- [Browser Use](https://scrapfly.io/docs/cloud-browser-api/browser-use)
- [Stagehand](https://scrapfly.io/docs/cloud-browser-api/stagehand)

- [Billing](https://scrapfly.io/docs/cloud-browser-api/billing)
- [Errors](https://scrapfly.io/docs/cloud-browser-api/errors)


### Tools

- [Antibot Detector](https://scrapfly.io/docs/tools/antibot-detector)

### SDK

- [Golang](https://scrapfly.io/docs/sdk/golang)
- [Python](https://scrapfly.io/docs/sdk/python)
- [Rust](https://scrapfly.io/docs/sdk/rust)
- [TypeScript](https://scrapfly.io/docs/sdk/typescript)
- [Scrapy](https://scrapfly.io/docs/sdk/scrapy)

### Integrations

- [Getting Started](https://scrapfly.io/docs/integration/getting-started)
- [LangChain](https://scrapfly.io/docs/integration/langchain)
- [LlamaIndex](https://scrapfly.io/docs/integration/llamaindex)
- [CrewAI](https://scrapfly.io/docs/integration/crewai)
- [Zapier](https://scrapfly.io/docs/integration/zapier)
- [Make](https://scrapfly.io/docs/integration/make)
- [n8n](https://scrapfly.io/docs/integration/n8n)

### Academy

- [Overview](https://scrapfly.io/academy)
- [Web Scraping Overview](https://scrapfly.io/academy/scraping-overview)
- [Tools](https://scrapfly.io/academy/tools-overview)
- [Reverse Engineering](https://scrapfly.io/academy/reverse-engineering)
- [Static Scraping](https://scrapfly.io/academy/static-scraping)
- [HTML Parsing](https://scrapfly.io/academy/html-parsing)
- [Dynamic Scraping](https://scrapfly.io/academy/dynamic-scraping)
- [Hidden API Scraping](https://scrapfly.io/academy/hidden-api-scraping)
- [Headless Browsers](https://scrapfly.io/academy/headless-browsers)
- [Hidden Web Data](https://scrapfly.io/academy/hidden-web-data)
- [JSON Parsing](https://scrapfly.io/academy/json-parsing)
- [Data Processing](https://scrapfly.io/academy/data-processing)
- [Scaling](https://scrapfly.io/academy/scaling)
- [Walkthrough Summary](https://scrapfly.io/academy/walkthrough-summary)
- [Scraper Blocking](https://scrapfly.io/academy/scraper-blocking)
- [Proxies](https://scrapfly.io/academy/proxies)

---

# Crawler API Troubleshooting

 [  View as markdown ](https://scrapfly.io/?view=markdown)   Copy for LLM    Copy for LLM  [     Open in ChatGPT ](https://chatgpt.com/?hints=search&prompt=Read%20from%20https%3A%2F%2Fscrapfly.io%2Fdocs%2Fcrawler-api%2Ftroubleshoot%20so%20I%20can%20ask%20questions%20about%20it.) [     Open in Claude ](https://claude.ai/new?q=Read%20from%20https%3A%2F%2Fscrapfly.io%2Fdocs%2Fcrawler-api%2Ftroubleshoot%20so%20I%20can%20ask%20questions%20about%20it.) [     Open in Perplexity ](https://www.perplexity.ai/search/new?q=Read%20from%20https%3A%2F%2Fscrapfly.io%2Fdocs%2Fcrawler-api%2Ftroubleshoot%20so%20I%20can%20ask%20questions%20about%20it.) 

 

 

 This guide covers common issues when using the Crawler API and how to resolve them. For API errors and error codes, see the [Errors page](https://scrapfly.io/docs/crawler-api/errors).

> **Pro Tip:** Always check the [monitoring dashboard](https://scrapfly.io/docs/monitoring) to inspect crawler status, failed URLs, and detailed error information.

## Crawler Not Discovering URLs 

 If your crawler isn't discovering the URLs you expect, this is usually a path filtering issue. Here's how to diagnose and fix it:

### Check Path Filters

The most common cause is overly restrictive `include_only_paths` or `exclude_paths` filters.

##### Debugging Steps:

1. **Test without filters first** - Run a small crawl (e.g., `page_limit=10`) without any path filters to verify URL discovery works
2. **Add filters incrementally** - Start with broad patterns and gradually make them more specific
3. **Check pattern syntax** - Ensure patterns use correct wildcards: 
    - `*` matches any characters within a path segment
    - `**` matches across multiple path segments
    - Example: `/products/**` matches all product pages
4. **Review crawled URLs** - Use `/crawl/{crawler_uuid}/urls` endpoint to see which URLs were discovered
 
 

 

### Enable Sitemaps

 If your target website has a sitemap, enable `use_sitemaps=true` for better URL discovery. Sitemaps provide a comprehensive list of URLs that might not be linked from the homepage.

### Verify Starting URL

 Ensure your starting URL is accessible and contains the links you expect. Test it manually in a browser to verify.

## Crawler Not Following External Links 

 If you expect the crawler to follow links to external domains but it's not happening, here's what to check:

##### Common Issues:

1. **Missing `follow_external_links=true`** - By default, the crawler stays within the starting domain. You must explicitly enable external link following.
2. **Too restrictive `allowed_external_domains`** - If you specify this parameter, ONLY domains matching the patterns will be followed. Check your fnmatch patterns (e.g., `*.web-scraping.dev`).
3. **External pages not being re-crawled** - This is expected behavior! External pages are scraped (content extracted, credits consumed), but their links are NOT followed. The crawler only goes "one hop" into external domains.
 
 

 

### Understanding External Link Behavior

> **Important: External Domain Crawling** When `follow_external_links=true`:
> 
> - **With no `allowed_external_domains`:** ANY external domain is followed (except social media)
> - **With `allowed_external_domains`:** Only matching domains are followed (supports fnmatch patterns)
>  
>  **Key limitation:** External pages ARE scraped but their outbound links are NOT followed. 
>  *Example:* Crawling `web-scraping.dev` → finds link to `wikipedia.org/page1` → scrapes wikipedia.org/page1 → does NOT follow links from wikipedia.org/page1

## Crawler Following Unwanted Subdomains 

 If the crawler is visiting subdomains you don't want (e.g., `blog.web-scraping.dev` when crawling `web-scraping.dev`), here's how to control it:

##### Solutions:

1. **Set `follow_internal_subdomains=false`** — completely disables subdomain crawling. Only the seed URL's exact hostname is crawled, no subdomains at all.
2. **Use `allowed_internal_subdomains`** — keep `follow_internal_subdomains=true` but restrict to specific subdomains (e.g., `["blog.web-scraping.dev", "*.cdn.web-scraping.dev"]`). Only the seed hostname and listed subdomains are crawled.
 
 

 

### Understanding Subdomain Filtering

> **Behavior Summary** | `follow_internal_subdomains` | `allowed_internal_subdomains` | Result |
> |---|---|---|
> | `true` (default) | `[]` (empty) | **All subdomains** are crawled |
> | `true` | `["blog.web-scraping.dev"]` | Only **seed hostname** + **listed subdomains** are crawled |
> | `false` | *(ignored)* | Only the **seed hostname** is crawled — no subdomains at all |
> 
>  **Note:** Patterns must be subdomains of the seed URL's domain. Supports fnmatch wildcard patterns (`*`). 
>  *Example:* Crawling `www.web-scraping.dev` with `allowed_internal_subdomains=["blog.web-scraping.dev"]` → crawls `www.web-scraping.dev` + `blog.web-scraping.dev`, skips `shop.web-scraping.dev`

## High Failure Rate 

 If many pages are failing to crawl, check the error codes to identify the root cause:

 ```
# Get all failed URLs with error details
curl https://api.scrapfly.io/crawl/<span class="snippet-var" data-var="crawler_uuid" title="Click to edit: Crawler UUID">{crawler_uuid}</span>/urls?key=&status=failed
```

 

   

 

 

 

### Common Causes and Solutions

 | Error Pattern | Solution |
|---|---|
| [`ERR::ASP::SHIELD_PROTECTION_FAILED`](https://scrapfly.io/docs/crawler-api/error/ERR::ASP::SHIELD_PROTECTION_FAILED) | Enable `asp=true` to bypass anti-bot protection  This activates Anti-Scraping Protection |
| [`ERR::THROTTLE::MAX_CONCURRENT_REQUEST_EXCEEDED`](https://scrapfly.io/docs/crawler-api/error/ERR::THROTTLE::MAX_CONCURRENT_REQUEST_EXCEEDED) | Reduce `max_concurrency` to avoid overwhelming the target server  Try starting with max\_concurrency=2 or 3 |
| [`ERR::SCRAPE::UPSTREAM_TIMEOUT`](https://scrapfly.io/docs/crawler-api/error/ERR::SCRAPE::UPSTREAM_TIMEOUT) | Increase `timeout` parameter or reduce `rendering_wait`  Default timeout is 30 seconds, increase if needed |
| [`ERR::SCRAPE::BAD_UPSTREAM_RESPONSE`](https://scrapfly.io/docs/crawler-api/error/ERR::SCRAPE::BAD_UPSTREAM_RESPONSE) | Verify the target domain is accessible and DNS is working correctly  Check if the website is online |

 For complete error definitions and solutions, see the [Crawler API Errors page](https://scrapfly.io/docs/crawler-api/errors).

## Crawler Taking Too Long 

 Crawler performance depends on several factors. Here's how to optimize speed:

### Increase Concurrency

 The `max_concurrency` parameter controls how many pages are crawled simultaneously. Higher values = faster crawls, but stay within your account limits.

 **Recommended values:**- Small sites (&lt; 100 pages): `max_concurrency=5`
- Medium sites (100-1000 pages): `max_concurrency=10`
- Large sites (1000+ pages): `max_concurrency=20+` (if account allows)
 
 

 

### Optimize Feature Usage

 | Feature | Performance Impact | When to Disable |
|---|---|---|
| `asp` | **5× slower** | Disable if the site doesn't have anti-bot protection |
| `rendering_wait` | Adds delay per page | Reduce or remove if pages load quickly |
| `proxy_pool=public_residential_pool` | Slower than datacenter | Use datacenter proxies when residential IPs aren't required |

### Set Time Limits

 Use `max_duration` to prevent indefinite crawls. The crawler will stop gracefully when this limit is reached:

 ```
{
  "url": "https://web-scraping.dev",
  "max_duration": 3600,
  "page_limit": 1000
}
```

 

   

 

 

 

This crawler will stop after 1 hour or 1000 pages, whichever comes first

## Budget Control Issues 

 Controlling costs is critical when crawling large websites. Use these strategies to stay within budget:

### Set Credit Limits

 Use `max_api_credit` to automatically stop crawling when your budget is reached:

 ```
{
  "url": "https://web-scraping.dev",
  "max_api_credit": 1000,
  "page_limit": 10000
}
```

 

   

 

 

 

This crawler will stop after spending 1000 credits or 10000 pages, whichever comes first

### Monitor Costs in Real-Time

 Check the crawler status endpoint to see current credit usage:

 ```
curl https://api.scrapfly.io/crawl/<span class="snippet-var" data-var="crawler_uuid" title="Click to edit: Crawler UUID">{crawler_uuid}</span>/status?key=
```

 

   

 

 

 

 The response includes `api_credit_used` showing total credits consumed so far.

### Reduce Per-Page Costs

- **Disable ASP** if not needed - saves significant credits per page
- **Use datacenter proxies** instead of residential when possible
- **Enable caching** for re-crawls to avoid re-scraping unchanged pages
- **Use stricter path filtering** to crawl only necessary pages
- **Choose efficient formats** - markdown and text are cheaper than full HTML
 
 For detailed pricing information, see [Crawler API Billing](https://scrapfly.io/docs/crawler-api/billing).

## Debugging Tips 

### Check Crawler Status

 The status endpoint provides real-time information about your crawler:

 ```
curl https://api.scrapfly.io/crawl/<span class="snippet-var" data-var="crawler_uuid" title="Click to edit: Crawler UUID">{crawler_uuid}</span>/status?key=
```

 

   

 

 

 

**Key fields to monitor** (all nested under `state` in the response):

- `status` - `PENDING`, `RUNNING`, `DONE`, or `CANCELLED`
- `state.urls_extracted` - Total URLs discovered by the crawler
- `state.urls_visited` - Total URLs successfully crawled
- `state.urls_failed` - Total URLs that failed to crawl
- `state.api_credit_used` - Credits consumed so far
 
### Inspect Failed URLs

 Get detailed error information for failed pages:

 ```
curl https://api.scrapfly.io/crawl/<span class="snippet-var" data-var="crawler_uuid" title="Click to edit: Crawler UUID">{crawler_uuid}</span>/urls?key=&status=failed
```

 

   

 

 

 

### Test with Small Crawls First

 Before running a large crawl, test with `page_limit=10` to:

- Verify path filters are working correctly
- Check that target pages are accessible
- Confirm content extraction is working
- Estimate costs for the full crawl
 
## Getting Help 

 If you're still experiencing issues after trying these solutions:

- Check the [monitoring dashboard](https://scrapfly.io/docs/monitoring) for detailed logs
- Review the [error codes reference](https://scrapfly.io/docs/crawler-api/errors) for specific errors
- Contact [support](https://scrapfly.io/docs/support) with your crawler UUID for personalized assistance
 
## Related Documentation 

- [Crawler API Getting Started](https://scrapfly.io/docs/crawler-api/getting-started)
- [Crawler API Errors](https://scrapfly.io/docs/crawler-api/errors)
- [Crawler API Billing](https://scrapfly.io/docs/crawler-api/billing)
- [Monitoring Dashboard](https://scrapfly.io/docs/monitoring)