Getting Started with Scrapfly Crawler API
The Scrapfly Crawler API enables recursive website crawling at scale. We leverage WARC, Parquet format for large scale scraping and you can easily visualize using HAR artifact. Crawl entire websites with configurable limits, extract content in multiple formats simultaneously, and retrieve results as industry-standard artifacts.
Early Access FeatureThe Crawler API is currently in early access. Features and API may evolve based on user feedback.
Quick Start: Choose Your Workflow
The Crawler API supports two integration patterns. Choose the approach that best fits your use case:
Polling Workflow
Schedule a crawl, poll the status endpoint to monitor progress, and retrieve results when complete. Best for batch processing, testing, and simple integrations.
-
Schedule Crawl
Create a crawler with a single API call. The API returns immediately with a crawler UUID:
Response includes crawler UUID and status:
-
Monitor Progress
Poll the status endpoint to track crawl progress:
Status response shows real-time progress:
Understanding the Status Response
Field Values Description statusPENDING
RUNNING
DONE
CANCELLEDCurrent crawler state - actively running or completed is_finishedtrue/falseWhether crawler has stopped (regardless of success/failure) is_successtrue- Success
false- Failed
null- RunningOutcome of the crawl (only set when finished) stop_reasonSee table below Why the crawler stopped (only set when finished) Stop Reasons:
Stop Reason Description no_more_urlsAll discovered URLs have been crawled - normal completion page_limitReached the configured page_limitmax_durationExceeded the max_durationtime limitmax_api_creditReached the max_api_creditlimitseed_url_failedThe starting URL failed to crawl - no URLs visited user_cancelledUser manually cancelled the crawl via API crawler_errorInternal crawler error occurred no_api_credit_leftAccount ran out of API credits during crawl storage_errorAn error occurred while saving the content -
Retrieve Results
Once
is_finished: true, download artifacts or query content:For comprehensive retrieval options, see Retrieving Crawler Results.
Real-Time Webhook Workflow
Schedule a crawl with webhook configuration, receive instant HTTP callbacks as events occur, and process results in real-time. Best for real-time data ingestion, streaming pipelines, and event-driven architectures.
Webhook Setup RequiredBefore using webhooks, you must configure a webhook in your dashboard with your endpoint URL and authentication. Then reference it by name in your API call.
-
Schedule Crawl with Webhook
Create a crawler and specify the webhook name configured in your dashboard:
Response includes crawler UUID:
-
Receive Real-Time Webhooks
Your endpoint receives HTTP POST callbacks as events occur during the crawl:
Webhook Headers:
Header Purpose X-Scrapfly-Crawl-Event-NameEvent type (e.g., crawler_url_visited) for fast routingX-Scrapfly-Webhook-Job-IdCrawler UUID for tracking X-Scrapfly-Webhook-SignatureHMAC-SHA256 signature for verification -
Process Events in Real-Time
Handle webhook callbacks to stream data to your database, trigger pipelines, or process results:
For detailed webhook documentation and all available events, see Crawler Webhook Documentation.
Error Handling
Crawler API uses standard HTTP response codes and provides detailed error information:
200
- OK |
Request successful |
|---|---|
201
- Created |
Crawler job created successfully |
400
- Bad Request |
Invalid parameters or configuration |
401
- Unauthorized |
Invalid or missing API key |
404
- Not Found |
Crawler job not found |
429
- Too Many Requests |
Rate limit or concurrency limit exceeded |
500
- Server Error |
Internal server error |
| See the full error list for more details. | |
API Specification
Get Crawler Status
Retrieve the current status and progress of a crawler job. Use this endpoint to poll for updates while the crawler is running.
https://api.scrapfly.io/crawl/{crawler_uuid}/status
Response includes:
-
status- Current status (PENDING, RUNNING, COMPLETED, FAILED, CANCELLED) -
state.urls_discovered- Total URLs discovered -
state.urls_crawled- URLs successfully crawled -
state.urls_pending- URLs waiting to be crawled -
state.urls_failed- URLs that failed to crawl -
state.api_credits_used- Total API credits consumed
Get Crawled URLs
Retrieve a list of all URLs discovered and crawled during the job, with metadata about each URL.
https://api.scrapfly.io/crawl/{crawler_uuid}/urls
Query Parameters:
-
key- Your API key (required) -
status- Filter by URL status:visited,pending,failed -
page- Page number for pagination (default: 1) -
per_page- Results per page (default: 100, max: 1000)
Get Content
Retrieve extracted content from crawled pages in the format(s) specified in your crawl configuration.
Single URL or All Pages (GET)
https://api.scrapfly.io/crawl/{crawler_uuid}/contents
Query Parameters:
-
key- Your API key (required) -
format- Content format to retrieve (must be one of the formats specified in crawl config) -
url- Optional: Retrieve content for a specific URL only
Batch Content Retrieval (POST)
https://api.scrapfly.io/crawl/{crawler_uuid}/contents/batch
Retrieve content for multiple specific URLs in a single request. More efficient than making individual GET requests for each URL. Maximum 100 URLs per request.
Query Parameters:
-
key- Your API key (required) -
formats- Comma-separated list of formats (e.g.,markdown,text,html)
Request Body:
-
Content-Type: text/plain- Plain text with URLs separated by newlines - Maximum 100 URLs per request
Response Format:
-
Content-Type: multipart/related- Standard HTTP multipart format (RFC 2387) -
X-Scrapfly-Requested-URLsheader - Number of URLs in the request -
X-Scrapfly-Found-URLsheader - Number of URLs found in the crawl results - Each part contains
Content-TypeandContent-Locationheaders identifying the format and URL
Efficient Streaming FormatThe multipart format eliminates JSON escaping overhead, providing ~50% bandwidth savings for text content and constant memory usage during streaming. See the Results documentation for parsing examples in Python, JavaScript, and Go.
Download Artifact
Download industry-standard archive files containing all crawled data, including HTTP requests, responses, headers, and extracted content. Perfect for storing bulk crawl results offline or in object storage (S3, Google Cloud Storage).
https://api.scrapfly.io/crawl/{crawler_uuid}/artifact
Query Parameters:
-
key- Your API key (required) -
type- Artifact type:-
warc- Web ARChive format (gzip compressed, industry standard) -
har- HTTP Archive format (JSON, browser-compatible)
-
Billing
Crawler API billing is simple: the cost equals the sum of all Web Scraping API calls made during the crawl. Each page crawled consumes credits based on enabled features (browser rendering, anti-scraping protection, proxy type, etc.).
For detailed billing information, see Crawler API Billing.