Crawler API Troubleshooting

View as markdown

This guide covers common issues when using the Crawler API and how to resolve them. For API errors and error codes, see the Errors page.

Pro Tip: Always check the monitoring dashboard to inspect crawler status, failed URLs, and detailed error information.

Crawler Not Discovering URLs

If your crawler isn't discovering the URLs you expect, this is usually a path filtering issue. Here's how to diagnose and fix it:

Check Path Filters

The most common cause is overly restrictive include_only_paths or exclude_paths filters.

Debugging Steps:

Test without filters first - Run a small crawl (e.g., max_pages=10) without any path filters to verify URL discovery works
Add filters incrementally - Start with broad patterns and gradually make them more specific
Check pattern syntax - Ensure patterns use correct wildcards:
- * matches any characters within a path segment
- ** matches across multiple path segments
- Example: /products/** matches all product pages
Review crawled URLs - Use /crawl/{uuid}/urls endpoint to see which URLs were discovered

Enable Sitemaps

If your target website has a sitemap, enable use_sitemaps=true for better URL discovery. Sitemaps provide a comprehensive list of URLs that might not be linked from the homepage.

Verify Starting URL

Ensure your starting URL is accessible and contains the links you expect. Test it manually in a browser to verify.

Crawler Not Following External Links

If you expect the crawler to follow links to external domains but it's not happening, here's what to check:

Common Issues:

Missing follow_external_links=true - By default, the crawler stays within the starting domain. You must explicitly enable external link following.
Too restrictive allowed_external_domains - If you specify this parameter, ONLY domains matching the patterns will be followed. Check your fnmatch patterns (e.g., *.example.com).
External pages not being re-crawled - This is expected behavior! External pages are scraped (content extracted, credits consumed), but their links are NOT followed. The crawler only goes "one hop" into external domains.

Understanding External Link Behavior

Important: External Domain Crawling

When follow_external_links=true:

With no allowed_external_domains: ANY external domain is followed (except social media)
With allowed_external_domains: Only matching domains are followed (supports fnmatch patterns)

Key limitation: External pages ARE scraped but their outbound links are NOT followed.
Example: Crawling example.com → finds link to wikipedia.org/page1 → scrapes wikipedia.org/page1 → does NOT follow links from wikipedia.org/page1

High Failure Rate

If many pages are failing to crawl, check the error codes to identify the root cause:

# Get all failed URLs with error details
curl https://api.scrapfly.io/crawl/{uuid}/urls?key=&status=failed

Common Causes and Solutions

Error Pattern	Solution
`ERR::ASP::SHIELD_PROTECTION_FAILED`	Enable `asp=true` to bypass anti-bot protection This activates Anti-Scraping Protection
`ERR::THROTTLE::MAX_CONCURRENT_REQUEST_EXCEEDED`	Reduce `max_concurrency` to avoid overwhelming the target server Try starting with max_concurrency=2 or 3
`ERR::SCRAPE::UPSTREAM_TIMEOUT`	Increase `timeout` parameter or reduce `rendering_wait` Default timeout is 30 seconds, increase if needed
`ERR::SCRAPE::BAD_UPSTREAM_RESPONSE`	Verify the target domain is accessible and DNS is working correctly Check if the website is online

For complete error definitions and solutions, see the Crawler API Errors page.

Crawler Taking Too Long

Crawler performance depends on several factors. Here's how to optimize speed:

Increase Concurrency

The max_concurrency parameter controls how many pages are crawled simultaneously. Higher values = faster crawls, but stay within your account limits.

Recommended values:

Small sites (< 100 pages): max_concurrency=5
Medium sites (100-1000 pages): max_concurrency=10
Large sites (1000+ pages): max_concurrency=20+ (if account allows)

Optimize Feature Usage

Feature	Performance Impact	When to Disable
`asp`	5× slower	Disable if the site doesn't have anti-bot protection
`rendering_wait`	Adds delay per page	Reduce or remove if pages load quickly
`proxy_pool=public_residential_pool`	Slower than datacenter	Use datacenter proxies when residential IPs aren't required

Set Time Limits

Use max_duration to prevent indefinite crawls. The crawler will stop gracefully when this limit is reached:

{
  "url": "https://example.com",
  "max_duration": 3600,
  "max_pages": 1000
}

This crawler will stop after 1 hour or 1000 pages, whichever comes first

Budget Control Issues

Controlling costs is critical when crawling large websites. Use these strategies to stay within budget:

Set Credit Limits

Use max_api_credit to automatically stop crawling when your budget is reached:

{
  "url": "https://example.com",
  "max_api_credit": 1000,
  "max_pages": 10000
}

This crawler will stop after spending 1000 credits or 10000 pages, whichever comes first

Monitor Costs in Real-Time

Check the crawler status endpoint to see current credit usage:

curl https://api.scrapfly.io/crawl/{uuid}/status?key=

The response includes api_credit_used showing total credits consumed so far.

Reduce Per-Page Costs

Disable ASP if not needed - saves significant credits per page
Use datacenter proxies instead of residential when possible
Enable caching for re-crawls to avoid re-scraping unchanged pages
Use stricter path filtering to crawl only necessary pages
Choose efficient formats - markdown and text are cheaper than full HTML

For detailed pricing information, see Crawler API Billing.

Debugging Tips

Check Crawler Status

The status endpoint provides real-time information about your crawler:

curl https://api.scrapfly.io/crawl/{uuid}/status?key=

Key fields to monitor:

status - RUNNING, COMPLETED, FAILED, CANCELLED
urls_discovered - Total URLs found by the crawler
urls_crawled - Total URLs successfully crawled
urls_failed - Total URLs that failed to crawl
api_credit_used - Credits consumed so far

Inspect Failed URLs

Get detailed error information for failed pages:

curl https://api.scrapfly.io/crawl/{uuid}/urls?key=&status=failed

Test with Small Crawls First

Before running a large crawl, test with max_pages=10 to:

Verify path filters are working correctly
Check that target pages are accessible
Confirm content extraction is working
Estimate costs for the full crawl

Getting Help

If you're still experiencing issues after trying these solutions:

Check the monitoring dashboard for detailed logs
Review the error codes reference for specific errors
Contact support with your crawler UUID for personalized assistance