Crawler API Troubleshooting
This guide covers common issues when using the Crawler API and how to resolve them. For API errors and error codes, see the Errors page.
Crawler Not Discovering URLs
If your crawler isn't discovering the URLs you expect, this is usually a path filtering issue. Here's how to diagnose and fix it:
Check Path Filters
The most common cause is overly restrictive include_only_paths or exclude_paths filters.
Debugging Steps:
- Test without filters first - Run a small crawl (e.g.,
max_pages=10) without any path filters to verify URL discovery works - Add filters incrementally - Start with broad patterns and gradually make them more specific
- Check pattern syntax - Ensure patterns use correct wildcards:
*matches any characters within a path segment**matches across multiple path segments- Example:
/products/**matches all product pages
- Review crawled URLs - Use
/crawl/{uuid}/urlsendpoint to see which URLs were discovered
Enable Sitemaps
If your target website has a sitemap, enable use_sitemaps=true for better URL discovery.
Sitemaps provide a comprehensive list of URLs that might not be linked from the homepage.
Verify Starting URL
Ensure your starting URL is accessible and contains the links you expect. Test it manually in a browser to verify.
Crawler Not Following External Links
If you expect the crawler to follow links to external domains but it's not happening, here's what to check:
Common Issues:
-
Missing
follow_external_links=true- By default, the crawler stays within the starting domain. You must explicitly enable external link following. -
Too restrictive
allowed_external_domains- If you specify this parameter, ONLY domains matching the patterns will be followed. Check your fnmatch patterns (e.g.,*.example.com). - External pages not being re-crawled - This is expected behavior! External pages are scraped (content extracted, credits consumed), but their links are NOT followed. The crawler only goes "one hop" into external domains.
Understanding External Link Behavior
When follow_external_links=true:
- With no
allowed_external_domains: ANY external domain is followed (except social media) - With
allowed_external_domains: Only matching domains are followed (supports fnmatch patterns)
Key limitation: External pages ARE scraped but their outbound links are NOT followed.
Example: Crawling example.com → finds link to wikipedia.org/page1 →
scrapes wikipedia.org/page1 → does NOT follow links from wikipedia.org/page1
High Failure Rate
If many pages are failing to crawl, check the error codes to identify the root cause:
Common Causes and Solutions
| Error Pattern | Solution |
|---|---|
ERR::ASP::SHIELD_PROTECTION_FAILED |
Enable asp=true to bypass anti-bot protectionThis activates Anti-Scraping Protection |
ERR::THROTTLE::MAX_CONCURRENT_REQUEST_EXCEEDED |
Reduce max_concurrency to avoid overwhelming the target serverTry starting with max_concurrency=2 or 3 |
ERR::SCRAPE::UPSTREAM_TIMEOUT |
Increase timeout parameter or reduce rendering_waitDefault timeout is 30 seconds, increase if needed |
ERR::SCRAPE::BAD_UPSTREAM_RESPONSE |
Verify the target domain is accessible and DNS is working correctly Check if the website is online |
For complete error definitions and solutions, see the Crawler API Errors page.
Crawler Taking Too Long
Crawler performance depends on several factors. Here's how to optimize speed:
Increase Concurrency
The max_concurrency parameter controls how many pages are crawled simultaneously.
Higher values = faster crawls, but stay within your account limits.
- Small sites (< 100 pages):
max_concurrency=5 - Medium sites (100-1000 pages):
max_concurrency=10 - Large sites (1000+ pages):
max_concurrency=20+(if account allows)
Optimize Feature Usage
| Feature | Performance Impact | When to Disable |
|---|---|---|
asp |
5× slower | Disable if the site doesn't have anti-bot protection |
rendering_wait |
Adds delay per page | Reduce or remove if pages load quickly |
proxy_pool=public_residential_pool |
Slower than datacenter | Use datacenter proxies when residential IPs aren't required |
Set Time Limits
Use max_duration to prevent indefinite crawls. The crawler will stop gracefully
when this limit is reached:
This crawler will stop after 1 hour or 1000 pages, whichever comes first
Budget Control Issues
Controlling costs is critical when crawling large websites. Use these strategies to stay within budget:
Set Credit Limits
Use max_api_credit to automatically stop crawling when your budget is reached:
This crawler will stop after spending 1000 credits or 10000 pages, whichever comes first
Monitor Costs in Real-Time
Check the crawler status endpoint to see current credit usage:
The response includes api_credit_used showing total credits consumed so far.
Reduce Per-Page Costs
- Disable ASP if not needed - saves significant credits per page
- Use datacenter proxies instead of residential when possible
- Enable caching for re-crawls to avoid re-scraping unchanged pages
- Use stricter path filtering to crawl only necessary pages
- Choose efficient formats - markdown and text are cheaper than full HTML
For detailed pricing information, see Crawler API Billing.
Debugging Tips
Check Crawler Status
The status endpoint provides real-time information about your crawler:
Key fields to monitor:
status- RUNNING, COMPLETED, FAILED, CANCELLEDurls_discovered- Total URLs found by the crawlerurls_crawled- Total URLs successfully crawledurls_failed- Total URLs that failed to crawlapi_credit_used- Credits consumed so far
Inspect Failed URLs
Get detailed error information for failed pages:
Test with Small Crawls First
Before running a large crawl, test with max_pages=10 to:
- Verify path filters are working correctly
- Check that target pages are accessible
- Confirm content extraction is working
- Estimate costs for the full crawl
Getting Help
If you're still experiencing issues after trying these solutions:
- Check the monitoring dashboard for detailed logs
- Review the error codes reference for specific errors
- Contact support with your crawler UUID for personalized assistance