Webhook

Scrapfly's webhook feature is ideal for managing crawler jobs asynchronously. When webhook is specified through the webhook_name parameter, Scrapfly will notify your HTTP endpoint about crawl events in real-time, eliminating the need for polling.

To start using webhooks, first one must be created using the webhook web interface.

webhook management page

The webhook will be called for each event you subscribe to during the crawl lifecycle. For reconciliation, you will receive the crawler_uuid and webhook_uuid in the response headers.

webhook status report on monitoring log page
Webhook Queue Size

The webhook queue size indicates the maximum number of queued webhooks that can be scheduled. After the crawler event is processed and your application is notified, the queue size is reduced. This allows you to schedule additional crawler jobs beyond the concurrency limit of your subscription. The scheduler will handle this and ensure that your concurrency limit is met.

FREE
$0.00/mo
DISCOVERY
$30.00/mo
PRO
$100.00/mo
STARTUP
$250.00/mo
ENTERPRISE
$500.00/mo
500 500 2,000 5,000 10,000

See in Your Dashboard

Scope

Webhooks are scoped per Scrapfly projects and environments. Make sure to create a webhook for each of your projects and environments (test/live).

Usage

Webhooks can be used for multiple purposes. In the context of the Crawler API, to ensure you received a crawler event, you must check the header X-Scrapfly-Webhook-Resource-Type and verify the value is crawler.

To enable webhook callbacks, specify the webhook_name parameter in your crawler requests and optionally provide a list of webhook_events you want to be notified about. Scrapfly will then call your webhook endpoint as crawl events occur.

Note that your webhook endpoint must respond with a 2xx status code for the webhook to be considered successful. The 3xx redirect responses will be followed, and response codes 4xx and 5xx are considered failures and will be retried as per the retry policy.

The below examples assume you have a webhook named my-crawler-webhook registered. You can create webhooks via the web dashboard.

Webhook Events & Payloads

The Crawler API supports multiple webhook events that notify you about different stages of the crawl lifecycle. Each event sends a JSON payload with the crawler state and event-specific data.

Default Subscription

If you don't specify webhook_events, you'll receive: crawler_started, crawler_stopped, crawler_cancelled, and crawler_finished.

HTTP Headers

Every webhook request includes these HTTP headers for easy routing and verification:

Header Purpose Example Value
X-Scrapfly-Crawl-Event-Name Fast routing - Use this to route events without parsing JSON crawler_started
X-Scrapfly-Webhook-Resource-Type Resource type (always crawler for crawler webhooks) crawler
X-Scrapfly-Webhook-Job-Id Crawler UUID for tracking and reconciliation 550e8400-e29b...
X-Scrapfly-Webhook-Signature HMAC-SHA256 signature for verification a3f2b1c...

Event Types & Examples

Click each tab below to see the event description and full JSON payload example:

crawler_started

When: Crawler execution begins

Use case: Track when crawls start, log crawler UUID, initialize tracking systems

Frequency: Once per crawl

Key Fields: crawler_uuid, seed_url, links.status
crawler_url_visited

When: Each URL is successfully crawled

Use case: Real-time progress tracking, streaming results, monitoring performance

Frequency: High - Fires for every successfully crawled URL (can be thousands per crawl)

Performance Warning: Your endpoint must handle high throughput. Use X-Scrapfly-Crawl-Event-Name header for fast routing without parsing JSON body.
crawler_url_failed

When: A URL fails to crawl (network error, timeout, block, etc.)

Use case: Error monitoring, retry logic, debugging failed scrapes

Frequency: Per failed URL

Debugging Features:
  • error - Error code for classification
  • links.log - Direct link to scrape log for debugging
  • scrape_config - Complete configuration to replay the scrape
  • links.scrape - Ready-to-use retry URL with same configuration
crawler_url_skipped

When: URLs are skipped (already visited, filtered, depth limit, etc.)

Use case: Monitor filtering effectiveness, track duplicate discovery

Frequency: Per batch of skipped URLs

Key Fields: urls contains a map of each skipped URL to its skip reason
crawler_url_discovered

When: New URLs are discovered from crawled pages

Use case: Track crawl expansion, monitor discovery patterns, sitemap building

Frequency: High - Fires for each batch of discovered URLs

Key Fields: origin (source URL where links were found), discovered_urls (list of new URLs)
crawler_finished

When: Crawler completes successfully (at least one URL visited)

Use case: Trigger post-processing, download results, send completion notifications

Frequency: Once per successful crawl

Success Indicators: state.urls_visited > 0 confirms at least one URL was crawled. Check state.stop_reason to understand why the crawler completed (e.g., no_more_urls, page_limit).
crawler_stopped

When: Crawler stops due to failure (seed URL failed, errors, no URLs visited)

Use case: Error alerting, failure logging, retry automation

Frequency: Once per failed crawl

Failure Reasons: Check state.stop_reason for the exact cause:
  • seed_url_failed - Initial URL couldn't be crawled
  • crawler_error - Internal crawler error occurred
  • no_api_credit_left - Account ran out of API credits mid-crawl
  • max_api_credit - Configured credit limit reached
crawler_cancelled

When: User manually cancels the crawl via API or dashboard

Use case: Update tracking systems, release resources, log cancellations

Frequency: Once per user cancellation

Cancellation State: state.stop_reason will be user_cancelled. Partial crawl results are available via the status endpoint and can be retrieved normally.

Development

Useful tools for local webhook development:

Security

Webhooks are signed using HMAC (Hash-based Message Authentication Code) with the SHA-256 algorithm to ensure the integrity of the webhook content and verify its authenticity. This mechanism helps prevent tampering and ensures that webhook payloads are from trusted sources.

HMAC Overview

HMAC is a cryptographic technique that combines a secret key with a hash function (in this case, SHA-256) to produce a fixed-size hash value known as the HMAC digest. This digest is unique to both the original message and the secret key, providing a secure way to verify the integrity and authenticity of the message.

Signature in HTTP Header

When Scrapfly sends a webhook notification, it includes an HMAC signature in the X-Scrapfly-Webhook-Signature HTTP header. This signature is generated by applying the HMAC-SHA256 algorithm to the entire request body using your webhook's secret key (configured in the webhook settings).

Verification Example

To verify the authenticity of a webhook notification, compute the HMAC-SHA256 signature of the request body using your secret key and compare it with the signature provided in the X-Scrapfly-Webhook-Signature header:

Security Best Practices
  • Always verify the HMAC signature before processing webhook payloads
  • Keep your webhook secret key confidential and rotate it periodically
  • Use HTTPS endpoints for webhook URLs to encrypt data in transit
  • Implement rate limiting on your webhook endpoint to handle high-frequency events

Next Steps

Summary