The
Scrapfly Crawler API
enables recursive website crawling at scale. We leverage
WARC, Parquet format for large scale
scraping and you can easily visualize using HAR artifact.
Crawl entire websites with configurable limits, extract content in multiple formats simultaneously,
and retrieve results as industry-standard artifacts.
Early Access Feature
The Crawler API is currently in early access. Features and API may evolve based on user feedback.
Quick Start: Choose Your Workflow
The Crawler API supports two integration patterns. Choose the approach that best fits your use case:
Polling Workflow
Schedule a crawl, poll the status endpoint to monitor progress, and retrieve results when complete.
Best for batch processing, testing, and simple integrations.
Schedule Crawl
Create a crawler with a single API call. The API returns immediately with a crawler UUID:
Schedule a crawl with webhook configuration, receive instant HTTP callbacks as events occur, and process results in real-time.
Best for real-time data ingestion, streaming pipelines, and event-driven architectures.
Webhook Setup Required
Before using webhooks, you must
configure a webhook
in your dashboard
with your endpoint URL and authentication. Then reference it by name in your API call.
Schedule Crawl with Webhook
Create a crawler and specify the webhook name configured in your dashboard:
Create a new crawler job with custom configuration. The API returns immediately with a crawler UUID
that you can use to monitor progress and retrieve results.
Query Parameters (Authentication)
These parameters must be passed in the
URL query string, not in the request body.
Your Scrapfly API key for authentication. You can find your key on your
dashboard.
Query Parameter OnlyMust be passed as a URL query parameter
(e.g.,
?key=YOUR_KEY),
never in the POST request body. This applies to all Crawler API endpoints.
?key=16eae084cff64841be193a95fc8fa67d
Append to endpoint URL
Request Body (Crawler Configuration)
These parameters configure the crawler behavior and must be sent in the
JSON request body.
Starting URL for the crawl. Must be a valid HTTP/HTTPS URL. The crawler will begin discovering
and crawling linked pages from this seed URL.
Must be URL encoded
Maximum number of pages to crawl. Must be non-negative. Set to
0
for unlimited
(subject to subscription limits). Use this to limit crawl scope and control costs.
Maximum link depth from starting URL. Must be non-negative. Depth 0 is the starting URL,
depth 1 is links from the starting page, etc. Set to
0
for unlimited depth.
Use lower values for focused crawls, higher values for comprehensive site crawling.
Only crawl URLs matching these path patterns. Supports wildcards (*).
Maximum 100 paths.
Mutually exclusive with
exclude_paths.
Useful for focusing on specific sections like blogs or product pages.
include_only_paths=["/blog/*"]
include_only_paths=["/blog/*", "/articles/*"]
include_only_paths=["/products/*/reviews"]
Show Advanced Crawl Configuration
(domain restrictions, delays, headers, sitemaps...)
By default, the crawler only follows links within the same base path as the starting URL.
For example, starting from
https://example.com/blog
restricts crawling to
/blog/*. Enable this to allow crawling any path on the same domain.
Whitelist of external domains to crawl when follow_external_links=true.
Maximum 250 domains.
Supports fnmatch-style wildcards (*) for flexible pattern matching.
Pattern Matching Examples:
*.example.com - Matches all subdomains of example.com
specific.org - Exact domain match only
blog.*.com - Matches blog.anything.com
Scraping vs. Crawling External Pages
When a page contains a link to an allowed external domain:
The crawler WILL: Scrape the external page (extract content, consume credits)
The crawler WILL NOT: Follow links found on that external page
Example: Crawling example.com with allowed_external_domains=["*.wikipedia.org"]
will scrape Wikipedia pages linked from example.com, but will NOT crawl additional links discovered on Wikipedia.
allowed_external_domains=["cdn.example.com"]
Only follow links to cdn.example.com
allowed_external_domains=["*.example.com"]
Follow all subdomains of example.com
allowed_external_domains=["blog.example.com", "docs.example.com"]
Follow multiple specific domains
Wait time in milliseconds after page load before extraction. Set to
0
to disable
browser rendering (HTTP-only mode). Range:
0 or 1-25000ms (max 25 seconds).
Only applies when browser rendering is enabled. Use this for pages that load content dynamically.
Maximum number of concurrent scrape requests. Controls crawl speed and resource usage.
Limited by your account's concurrency limit. Set to
0
to use account/project default.
Add a delay between requests in milliseconds. Range:
0-15000ms (max 15 seconds).
Use this to be polite to target servers and avoid overwhelming them with requests.
Value must be provided as a string.
Custom User-Agent string to use for all requests. If not specified, Scrapfly will use appropriate
User-Agent headers automatically. This is a shorthand for setting the
User-Agent
header.
Important: ASP Compatibility
When
asp=true
(Anti-Scraping Protection is enabled), this parameter is
ignored.
ASP manages User-Agent headers automatically for optimal bypass performance.
Choose one approach:
Use ASP
(asp=true) - Automatic User-Agent management with advanced bypass
Use custom User-Agent
(user_agent=...) - Manual control, ASP disabled
Cache time-to-live in seconds. Range:
0-604800 seconds (max 7 days).
Only applies when
cache=true. Set to
0
to use default TTL.
After this duration, cached pages will be considered stale and re-crawled.
Ignore
rel="nofollow"
attributes on links. By default, links with
nofollow
are not crawled. Enable this to crawl all links regardless of the nofollow attribute.
List of content formats to extract from each crawled page. You can specify multiple formats
to extract different representations simultaneously. Extracted content is available via the
/contents
endpoint or in downloaded artifacts.
Available formats:
html
- Raw HTML content
clean_html
- HTML with boilerplate removed
markdown
- Markdown format (ideal for LLM training)
Maximum crawl duration in seconds. Range:
15-10800 seconds (15s to 3 hours).
The crawler will stop after this time limit is reached, even if there are more pages to crawl.
Use this to prevent long-running crawls.
Maximum API credits to spend on this crawl. Must be non-negative. The crawler will stop when
this credit limit is reached. Set to
0
for no credit limit. Useful for controlling costs on large crawls.
Extraction rules to extract structured data from each page.
Maximum 100 rules.
Each rule maps a URL pattern (max 1000 chars) to an extraction config with type and value.
Supported types:
prompt
- AI extraction prompt (max 10000 chars)
model
- Pre-defined extraction model
template
- Extraction template (name or JSON)
Comprehensive Guide: See the Extraction Rules documentation for detailed examples, pattern matching rules, and best practices.
List of webhook events to subscribe to. If webhook name is provided but events list is empty,
defaults to basic events:
crawler_started,
crawler_stopped,
crawler_cancelled,
crawler_finished.
Select the proxy pool. A proxy pool is a network of proxies grouped by quality range and network type.
The price varies based on the pool used. See
proxy dashboard
for available pools.
Proxy country location in ISO 3166-1 alpha-2 (2 letters) country codes.
The available countries are listed on your
proxy dashboard.
Supports exclusions (minus prefix) and weighted distribution (colon suffix with weight 0-255).
Anti Scraping Protection
- Enable advanced anti-bot bypass features including browser rendering, fingerprinting, and automatic
retry with upgraded configurations. When enabled, the crawler will automatically use headless browsers
and adapt to bypass protections.
Note
When ASP is enabled, any custom
user_agent
parameter is ignored.
ASP manages User-Agent headers automatically for optimal bypass performance.
asp=true
asp=false
Get Crawler Status
Retrieve the current status and progress of a crawler job. Use this endpoint to poll for updates
while the crawler is running.
status
- Current status (PENDING, RUNNING, COMPLETED, FAILED, CANCELLED)
state.urls_discovered
- Total URLs discovered
state.urls_crawled
- URLs successfully crawled
state.urls_pending
- URLs waiting to be crawled
state.urls_failed
- URLs that failed to crawl
state.api_credits_used
- Total API credits consumed
Get Crawled URLs
Retrieve a list of all URLs discovered and crawled during the job, with metadata about each URL.
GEThttps://api.scrapfly.io/crawl/{uuid}/urls
# Get all visited URLs
curl "https://api.scrapfly.io/crawl/{uuid}/urls?key=&status=visited"
# Get failed URLs with pagination
curl "https://api.scrapfly.io/crawl/{uuid}/urls?key=&status=failed&page=1&per_page=100"
Query Parameters:
key
- Your API key (required)
status
- Filter by URL status:
visited,
pending,
failed
page
- Page number for pagination (default: 1)
per_page
- Results per page (default: 100, max: 1000)
Get Content
Retrieve extracted content from crawled pages in the format(s) specified in your crawl configuration.
Single URL or All Pages (GET)
GEThttps://api.scrapfly.io/crawl/{uuid}/contents
# Get all content in markdown format
curl "https://api.scrapfly.io/crawl/{uuid}/contents?key=&format=markdown"
# Get content for a specific URL
curl "https://api.scrapfly.io/crawl/{uuid}/contents?key=&format=html&url=https://example.com/page"
Query Parameters:
key
- Your API key (required)
format
- Content format to retrieve (must be one of the formats specified in crawl config)
url
- Optional: Retrieve content for a specific URL only
Retrieve content for multiple specific URLs in a single request. More efficient than
making individual GET requests for each URL. Maximum 100 URLs per request.
# Batch retrieve content for multiple URLs
curl -X POST "https://api.scrapfly.io/crawl/{uuid}/contents/batch?key=&formats=markdown,text" \
-H "Content-Type: text/plain" \
-d "https://example.com/page1
https://example.com/page2
https://example.com/page3"
Query Parameters:
key
- Your API key (required)
formats
- Comma-separated list of formats (e.g., markdown,text,html)
Request Body:
Content-Type: text/plain
- Plain text with URLs separated by newlines
Maximum 100 URLs per request
Response Format:
Content-Type: multipart/related
- Standard HTTP multipart format (RFC 2387)
X-Scrapfly-Requested-URLs
header - Number of URLs in the request
X-Scrapfly-Found-URLs
header - Number of URLs found in the crawl results
Each part contains Content-Type and Content-Location headers identifying the format and URL
Efficient Streaming Format
The multipart format eliminates JSON escaping overhead, providing ~50% bandwidth savings for text content
and constant memory usage during streaming. See the Results documentation
for parsing examples in Python, JavaScript, and Go.
Download Artifact
Download industry-standard archive files containing all crawled data, including HTTP requests,
responses, headers, and extracted content. Perfect for storing bulk crawl results offline
or in object storage (S3, Google Cloud Storage).
GEThttps://api.scrapfly.io/crawl/{uuid}/artifact
# Download WARC artifact (gzip compressed, recommended for large crawls)
curl "https://api.scrapfly.io/crawl/{uuid}/artifact?key=&type=warc" -o crawl.warc.gz
# Download HAR artifact (JSON format)
curl "https://api.scrapfly.io/crawl/{uuid}/artifact?key=&type=har" -o crawl.har
Query Parameters:
key
- Your API key (required)
type
- Artifact type:
warc
- Web ARChive format (gzip compressed, industry standard)
har
- HTTP Archive format (JSON, browser-compatible)
Billing
Crawler API billing is simple:
the cost equals the sum of all Web Scraping API calls
made during the crawl.
Each page crawled consumes credits based on enabled features (browser rendering, anti-scraping protection, proxy type, etc.).