Getting Started with Scrapfly Crawler API

The Scrapfly Crawler API enables recursive website crawling at scale. We leverage WARC, Parquet format for large scale scraping and you can easily visualize using HAR artifact. Crawl entire websites with configurable limits, extract content in multiple formats simultaneously, and retrieve results as industry-standard artifacts.

Quick Start: Choose Your Workflow

The Crawler API supports two integration patterns. Choose the approach that best fits your use case:

Polling Workflow

Schedule a crawl, poll the status endpoint to monitor progress, and retrieve results when complete. Best for batch processing, testing, and simple integrations.

SCHEDULE POST /crawl Returns UUID POLL STATUS GET /status Until finished RETRIEVE GET /artifact Download results
  1. Schedule Crawl

    Create a crawler with a single API call. The API returns immediately with a crawler UUID:

    Response includes crawler UUID and status:

  2. Monitor Progress

    Poll the status endpoint to track crawl progress:

    Status response shows real-time progress:

    Understanding the Status Response

    Field Values Description
    status PENDING
    RUNNING
    DONE
    CANCELLED
    Current crawler state - actively running or completed
    is_finished true / false Whether crawler has stopped (regardless of success/failure)
    is_success true - Success
    false - Failed
    null - Running
    Outcome of the crawl (only set when finished)
    stop_reason See table below Why the crawler stopped (only set when finished)

    Stop Reasons:

    Stop Reason Description
    no_more_urls All discovered URLs have been crawled - normal completion
    page_limit Reached the configured page_limit
    max_duration Exceeded the max_duration time limit
    max_api_credit Reached the max_api_credit limit
    seed_url_failed The starting URL failed to crawl - no URLs visited
    user_cancelled User manually cancelled the crawl via API
    crawler_error Internal crawler error occurred
    no_api_credit_left Account ran out of API credits during crawl
  3. Retrieve Results

    Once is_finished: true, download artifacts or query content:

    For comprehensive retrieval options, see Retrieving Crawler Results.

Real-Time Webhook Workflow

Schedule a crawl with webhook configuration, receive instant HTTP callbacks as events occur, and process results in real-time. Best for real-time data ingestion, streaming pipelines, and event-driven architectures.

SCHEDULE + WEBHOOK POST /crawl webhook_name set RECEIVE WEBHOOKS HTTP callbacks As events occur PROCESS REAL-TIME Stream to DB Process live
  1. Schedule Crawl with Webhook

    Create a crawler and specify the webhook name configured in your dashboard:

    Response includes crawler UUID:

  2. Receive Real-Time Webhooks

    Your endpoint receives HTTP POST callbacks as events occur during the crawl:

    Webhook Headers:

    Header Purpose
    X-Scrapfly-Crawl-Event-Name Event type (e.g., crawler_url_visited) for fast routing
    X-Scrapfly-Webhook-Job-Id Crawler UUID for tracking
    X-Scrapfly-Webhook-Signature HMAC-SHA256 signature for verification
  3. Process Events in Real-Time

    Handle webhook callbacks to stream data to your database, trigger pipelines, or process results:

For detailed webhook documentation and all available events, see Crawler Webhook Documentation.

Error Handling

Crawler API uses standard HTTP response codes and provides detailed error information:

200 - OK Request successful
201 - Created Crawler job created successfully
400 - Bad Request Invalid parameters or configuration
401 - Unauthorized Invalid or missing API key
404 - Not Found Crawler job not found
429 - Too Many Requests Rate limit or concurrency limit exceeded
500 - Server Error Internal server error
See the full error list for more details.

API Specification

Get Crawler Status

Retrieve the current status and progress of a crawler job. Use this endpoint to poll for updates while the crawler is running.

GET https://api.scrapfly.io/crawl/{uuid}/status

Response includes:

  • status - Current status (PENDING, RUNNING, COMPLETED, FAILED, CANCELLED)
  • state.urls_discovered - Total URLs discovered
  • state.urls_crawled - URLs successfully crawled
  • state.urls_pending - URLs waiting to be crawled
  • state.urls_failed - URLs that failed to crawl
  • state.api_credits_used - Total API credits consumed

Get Crawled URLs

Retrieve a list of all URLs discovered and crawled during the job, with metadata about each URL.

GET https://api.scrapfly.io/crawl/{uuid}/urls

Query Parameters:

  • key - Your API key (required)
  • status - Filter by URL status: visited, pending, failed
  • page - Page number for pagination (default: 1)
  • per_page - Results per page (default: 100, max: 1000)

Get Content

Retrieve extracted content from crawled pages in the format(s) specified in your crawl configuration.

Single URL or All Pages (GET)

GET https://api.scrapfly.io/crawl/{uuid}/contents

Query Parameters:

  • key - Your API key (required)
  • format - Content format to retrieve (must be one of the formats specified in crawl config)
  • url - Optional: Retrieve content for a specific URL only

Batch Content Retrieval (POST)

POST https://api.scrapfly.io/crawl/{uuid}/contents/batch

Retrieve content for multiple specific URLs in a single request. More efficient than making individual GET requests for each URL. Maximum 100 URLs per request.

Query Parameters:

  • key - Your API key (required)
  • formats - Comma-separated list of formats (e.g., markdown,text,html)

Request Body:

  • Content-Type: text/plain - Plain text with URLs separated by newlines
  • Maximum 100 URLs per request

Response Format:

  • Content-Type: multipart/related - Standard HTTP multipart format (RFC 2387)
  • X-Scrapfly-Requested-URLs header - Number of URLs in the request
  • X-Scrapfly-Found-URLs header - Number of URLs found in the crawl results
  • Each part contains Content-Type and Content-Location headers identifying the format and URL

Download Artifact

Download industry-standard archive files containing all crawled data, including HTTP requests, responses, headers, and extracted content. Perfect for storing bulk crawl results offline or in object storage (S3, Google Cloud Storage).

GET https://api.scrapfly.io/crawl/{uuid}/artifact

Query Parameters:

  • key - Your API key (required)
  • type - Artifact type:
    • warc - Web ARChive format (gzip compressed, industry standard)
    • har - HTTP Archive format (JSON, browser-compatible)

Billing

Crawler API billing is simple: the cost equals the sum of all Web Scraping API calls made during the crawl. Each page crawled consumes credits based on enabled features (browser rendering, anti-scraping protection, proxy type, etc.).

For detailed billing information, see Crawler API Billing.

Summary