Retrieving Crawler Results
Once your crawler has completed, you have multiple options for retrieving the results. Choose the method that best fits your use case: individual URLs, content queries, or complete artifacts.
Results become available in
near-realtime
as pages are crawled. You can query content immediately
while the crawler is
RUNNING. Artifacts (WARC/HAR) are only finalized when
is_finished: true.
Poll the
/crawl/{uuid}/status
endpoint to monitor progress and check
is_success
to determine the outcome.
Choosing the Right Method
Select the retrieval method that best matches your use case. Consider your crawl size, processing needs, and infrastructure.
List URLs
Best for:
- URL discovery & mapping
- Failed URL analysis
- Sitemap generation
- Crawl auditing
Scale: Any size
Query Specific
Best for:
- Selective retrieval
- Real-time processing
- On-demand access
- API integration
Scale: Any size (per-page)
Get All Content
Best for:
- Small crawls
- Testing & development
- Quick prototyping
- Simple integration
Scale: Best for <100 pages
Download Artifacts
Best for:
- Large crawls (100s-1000s+)
- Long-term archival
- Offline processing
- Data pipelines
Scale: Unlimited
Retrieval Methods
The Crawler API provides four complementary methods for accessing your crawled data. Choose the method that best fits your use case:
List Crawled URLs
Get a comprehensive list of all URLs discovered and crawled during the job, with detailed metadata for each URL including status codes, depth, and timestamps.
Filter by status:
Response includes URL metadata:
Use case: Audit which pages were crawled, identify failed URLs, or build a sitemap.
For completed crawlers (is_finished: true), all retrieval endpoints return
Cache-Control: public, max-age=3600, immutable headers. This enables:
- Browser caching: Automatically cache responses for 1 hour
- CDN acceleration: Content can be cached by intermediate proxies
- Reduced API calls: Repeat requests served from cache without counting against limits
- Immutable guarantee: Content won't change, safe to cache aggressively
Query Specific Page Content
Retrieve extracted content for specific URLs from the crawl. Perfect for selective content retrieval without downloading the entire dataset.
Single URL Query
Retrieve content for one specific URL using the url query parameter:
Response contains the extracted content for the specified URL:
Plain Mode Efficient
Return raw content directly without JSON wrapper by adding plain=true. Perfect for shell scripts and direct file piping:
- Must specify
urlparameter (single URL only) - Must specify exactly one format in
formatsparameter - Response Content-Type matches format (e.g.,
text/markdown,text/html) - No JSON parsing needed - raw content in response body
Multipart Response Format
Request a multipart response for single URLs by setting the Accept header. Same efficiency benefits as batch queries:
Response returns multiple formats for the same URL as separate parts:
- Multiple formats efficiently: Get markdown + text + HTML for the same URL without JSON escaping overhead
- Streaming processing: Process formats as they arrive in the multipart stream
- Bandwidth savings: ~50% smaller than JSON for text content due to no escaping
Batch URL Query Efficient
Retrieve content for multiple URLs in a single request. Maximum 100 URLs per request.
Response format: multipart/related (RFC 2387) - Each URL's content is returned as a separate part in the multipart response.
The multipart format provides ~50% bandwidth savings compared to JSON for text content by eliminating JSON escaping overhead. The response streams efficiently with constant memory usage, making it ideal for large content batches.
Parsing Multipart Responses
Use standard HTTP multipart libraries to parse the response:
Batch Query Parameters
| Parameter | Type | Description |
|---|---|---|
key |
Query Param | Your API key (required) |
formats |
Query Param | Comma-separated list of formats for batch query (e.g., markdown,text,html) |
| Request Body | Plain Text | URLs separated by newlines (for batch query, max 100 URLs) |
Response Headers
| Header | Description |
|---|---|
Content-Type |
multipart/related; boundary=<random> - Standard HTTP multipart format (RFC 2387) |
X-Scrapfly-Requested-URLs |
Number of URLs in your request |
X-Scrapfly-Found-URLs |
Number of URLs found in crawl results (may be less if some URLs were not crawled) |
Multipart Part Headers
Each part in the multipart response contains:
| Header | Description |
|---|---|
Content-Type |
MIME type of the content (e.g., text/markdown, text/plain, text/html) |
Content-Location |
The URL this content belongs to |
Available formats:
-
html- Raw HTML content -
clean_html- HTML with boilerplate removed -
markdown- Markdown format (ideal for LLM training data) -
text- Plain text only -
json- Structured JSON representation -
extracted_data- AI-extracted structured data -
page_metadata- Page metadata (title, description, etc.)
Use cases:
- Single query: Fetch content for individual pages via API for real-time processing
- Batch query: Efficiently retrieve content for multiple specific URLs (e.g., product pages, article URLs)
Get All Crawled Contents
Retrieve all extracted contents in the specified format. Returns a JSON object mapping URLs to their extracted content in your chosen format.
Response contains contents mapped by URL:
Available formats:
-
html- Raw HTML content -
clean_html- HTML with boilerplate removed -
markdown- Markdown format (ideal for LLM training data) -
text- Plain text only -
json- Structured JSON representation -
extracted_data- AI-extracted structured data -
page_metadata- Page metadata (title, description, etc.)
For crawls with hundreds or thousands of pages, this endpoint may return large responses. Consider using artifacts or querying specific URLs instead.
Use case: Small to medium crawls where you need all content via API, or testing/development.
Download Artifacts (Recommended for Large Crawls)
Download industry-standard archive formats containing all crawled data. This is the most efficient method for large crawls, avoiding multiple API calls and handling huge datasets with ease.
Why Use Artifacts?
- Massive Scale - Handle crawls with thousands or millions of pages efficiently
- Single Download - Get the entire crawl in one compressed file, avoiding pagination and rate limits
- Offline Processing - Query and analyze data locally without additional API calls
- Cost Effective - One-time download instead of per-page API requests
- Flexible Storage - Store artifacts in S3, object storage, or local disk for long-term archival
- Industry Standard - WARC and HAR formats are universally supported by analysis tools
Available Artifact Types
WARC (Web ARChive Format)
Industry-standard format for web archiving. Contains complete HTTP request/response pairs, headers, and extracted content. Compressed with gzip for efficient storage.
Use case: Long-term archival, offline analysis with standard tools, research datasets.
See our complete WARC format guide for custom headers, reading libraries in multiple languages, and code examples.
HAR (HTTP Archive Format)
JSON-based format with detailed HTTP transaction data. Ideal for performance analysis, debugging, and browser replay tools.
Use case: Performance analysis, browser DevTools import, debugging HTTP transactions.
Complete Retrieval Workflow
Here's a complete example showing how to wait for completion and retrieve results:
Next Steps
- Learn about webhook integration for real-time notifications
- Understand billing and costs
- Review the full API specification