Crawler API Troubleshooting

This guide covers common issues when using the Crawler API and how to resolve them. For API errors and error codes, see the Errors page.

Crawler Not Discovering URLs

If your crawler isn't discovering the URLs you expect, this is usually a path filtering issue. Here's how to diagnose and fix it:

Check Path Filters

The most common cause is overly restrictive include_only_paths or exclude_paths filters.

Debugging Steps:
  1. Test without filters first - Run a small crawl (e.g., max_pages=10) without any path filters to verify URL discovery works
  2. Add filters incrementally - Start with broad patterns and gradually make them more specific
  3. Check pattern syntax - Ensure patterns use correct wildcards:
    • * matches any characters within a path segment
    • ** matches across multiple path segments
    • Example: /products/** matches all product pages
  4. Review crawled URLs - Use /crawl/{uuid}/urls endpoint to see which URLs were discovered

Enable Sitemaps

If your target website has a sitemap, enable use_sitemaps=true for better URL discovery. Sitemaps provide a comprehensive list of URLs that might not be linked from the homepage.

Verify Starting URL

Ensure your starting URL is accessible and contains the links you expect. Test it manually in a browser to verify.

If you expect the crawler to follow links to external domains but it's not happening, here's what to check:

Common Issues:
  1. Missing follow_external_links=true - By default, the crawler stays within the starting domain. You must explicitly enable external link following.
  2. Too restrictive allowed_external_domains - If you specify this parameter, ONLY domains matching the patterns will be followed. Check your fnmatch patterns (e.g., *.example.com).
  3. External pages not being re-crawled - This is expected behavior! External pages are scraped (content extracted, credits consumed), but their links are NOT followed. The crawler only goes "one hop" into external domains.

Understanding External Link Behavior

High Failure Rate

If many pages are failing to crawl, check the error codes to identify the root cause:

Common Causes and Solutions

Error Pattern Solution
ERR::ASP::SHIELD_PROTECTION_FAILED Enable asp=true to bypass anti-bot protection
This activates Anti-Scraping Protection
ERR::THROTTLE::MAX_CONCURRENT_REQUEST_EXCEEDED Reduce max_concurrency to avoid overwhelming the target server
Try starting with max_concurrency=2 or 3
ERR::SCRAPE::UPSTREAM_TIMEOUT Increase timeout parameter or reduce rendering_wait
Default timeout is 30 seconds, increase if needed
ERR::SCRAPE::BAD_UPSTREAM_RESPONSE Verify the target domain is accessible and DNS is working correctly
Check if the website is online

For complete error definitions and solutions, see the Crawler API Errors page.

Crawler Taking Too Long

Crawler performance depends on several factors. Here's how to optimize speed:

Increase Concurrency

The max_concurrency parameter controls how many pages are crawled simultaneously. Higher values = faster crawls, but stay within your account limits.

Recommended values:
  • Small sites (< 100 pages): max_concurrency=5
  • Medium sites (100-1000 pages): max_concurrency=10
  • Large sites (1000+ pages): max_concurrency=20+ (if account allows)

Optimize Feature Usage

Feature Performance Impact When to Disable
asp 5× slower Disable if the site doesn't have anti-bot protection
rendering_wait Adds delay per page Reduce or remove if pages load quickly
proxy_pool=public_residential_pool Slower than datacenter Use datacenter proxies when residential IPs aren't required

Set Time Limits

Use max_duration to prevent indefinite crawls. The crawler will stop gracefully when this limit is reached:

This crawler will stop after 1 hour or 1000 pages, whichever comes first

Budget Control Issues

Controlling costs is critical when crawling large websites. Use these strategies to stay within budget:

Set Credit Limits

Use max_api_credit to automatically stop crawling when your budget is reached:

This crawler will stop after spending 1000 credits or 10000 pages, whichever comes first

Monitor Costs in Real-Time

Check the crawler status endpoint to see current credit usage:

The response includes api_credit_used showing total credits consumed so far.

Reduce Per-Page Costs

  • Disable ASP if not needed - saves significant credits per page
  • Use datacenter proxies instead of residential when possible
  • Enable caching for re-crawls to avoid re-scraping unchanged pages
  • Use stricter path filtering to crawl only necessary pages
  • Choose efficient formats - markdown and text are cheaper than full HTML

For detailed pricing information, see Crawler API Billing.

Debugging Tips

Check Crawler Status

The status endpoint provides real-time information about your crawler:

Key fields to monitor:

  • status - RUNNING, COMPLETED, FAILED, CANCELLED
  • urls_discovered - Total URLs found by the crawler
  • urls_crawled - Total URLs successfully crawled
  • urls_failed - Total URLs that failed to crawl
  • api_credit_used - Credits consumed so far

Inspect Failed URLs

Get detailed error information for failed pages:

Test with Small Crawls First

Before running a large crawl, test with max_pages=10 to:

  • Verify path filters are working correctly
  • Check that target pages are accessible
  • Confirm content extraction is working
  • Estimate costs for the full crawl

Getting Help

If you're still experiencing issues after trying these solutions:

Summary