Retrieving Crawler Results

Once your crawler has completed, you have multiple options for retrieving the results. Choose the method that best fits your use case: individual URLs, content queries, or complete artifacts.

Choosing the Right Method

Select the retrieval method that best matches your use case. Consider your crawl size, processing needs, and infrastructure.

List URLs

Best for:

  • URL discovery & mapping
  • Failed URL analysis
  • Sitemap generation
  • Crawl auditing

Scale: Any size

Query Specific

Best for:

  • Selective retrieval
  • Real-time processing
  • On-demand access
  • API integration

Scale: Any size (per-page)

Get All Content

Best for:

  • Small crawls
  • Testing & development
  • Quick prototyping
  • Simple integration

Scale: Best for <100 pages

Recommended
Download Artifacts

Best for:

  • Large crawls (100s-1000s+)
  • Long-term archival
  • Offline processing
  • Data pipelines

Scale: Unlimited

Retrieval Methods

The Crawler API provides four complementary methods for accessing your crawled data. Choose the method that best fits your use case:

List Crawled URLs

Get a comprehensive list of all URLs discovered and crawled during the job, with detailed metadata for each URL including status codes, depth, and timestamps.

Filter by status:

Response includes URL metadata:

Use case: Audit which pages were crawled, identify failed URLs, or build a sitemap.

Query Specific Page Content

Retrieve extracted content for specific URLs from the crawl. Perfect for selective content retrieval without downloading the entire dataset.

Single URL Query

Retrieve content for one specific URL using the url query parameter:

Response contains the extracted content for the specified URL:

Plain Mode Efficient

Return raw content directly without JSON wrapper by adding plain=true. Perfect for shell scripts and direct file piping:

Multipart Response Format

Request a multipart response for single URLs by setting the Accept header. Same efficiency benefits as batch queries:

Response returns multiple formats for the same URL as separate parts:

Batch URL Query Efficient

Retrieve content for multiple URLs in a single request. Maximum 100 URLs per request.

Response format: multipart/related (RFC 2387) - Each URL's content is returned as a separate part in the multipart response.

Parsing Multipart Responses

Use standard HTTP multipart libraries to parse the response:

Batch Query Parameters

Parameter Type Description
key Query Param Your API key (required)
formats Query Param Comma-separated list of formats for batch query (e.g., markdown,text,html)
Request Body Plain Text URLs separated by newlines (for batch query, max 100 URLs)
Response Headers
Header Description
Content-Type multipart/related; boundary=<random> - Standard HTTP multipart format (RFC 2387)
X-Scrapfly-Requested-URLs Number of URLs in your request
X-Scrapfly-Found-URLs Number of URLs found in crawl results (may be less if some URLs were not crawled)
Multipart Part Headers

Each part in the multipart response contains:

Header Description
Content-Type MIME type of the content (e.g., text/markdown, text/plain, text/html)
Content-Location The URL this content belongs to

Available formats:

  • html - Raw HTML content
  • clean_html - HTML with boilerplate removed
  • markdown - Markdown format (ideal for LLM training data)
  • text - Plain text only
  • json - Structured JSON representation
  • extracted_data - AI-extracted structured data
  • page_metadata - Page metadata (title, description, etc.)

Use cases:

  • Single query: Fetch content for individual pages via API for real-time processing
  • Batch query: Efficiently retrieve content for multiple specific URLs (e.g., product pages, article URLs)

Get All Crawled Contents

Retrieve all extracted contents in the specified format. Returns a JSON object mapping URLs to their extracted content in your chosen format.

Response contains contents mapped by URL:

Available formats:

  • html - Raw HTML content
  • clean_html - HTML with boilerplate removed
  • markdown - Markdown format (ideal for LLM training data)
  • text - Plain text only
  • json - Structured JSON representation
  • extracted_data - AI-extracted structured data
  • page_metadata - Page metadata (title, description, etc.)

Use case: Small to medium crawls where you need all content via API, or testing/development.

Download Artifacts (Recommended for Large Crawls)

Download industry-standard archive formats containing all crawled data. This is the most efficient method for large crawls, avoiding multiple API calls and handling huge datasets with ease.

Why Use Artifacts?

  • Massive Scale - Handle crawls with thousands or millions of pages efficiently
  • Single Download - Get the entire crawl in one compressed file, avoiding pagination and rate limits
  • Offline Processing - Query and analyze data locally without additional API calls
  • Cost Effective - One-time download instead of per-page API requests
  • Flexible Storage - Store artifacts in S3, object storage, or local disk for long-term archival
  • Industry Standard - WARC and HAR formats are universally supported by analysis tools

Available Artifact Types

WARC (Web ARChive Format)

Industry-standard format for web archiving. Contains complete HTTP request/response pairs, headers, and extracted content. Compressed with gzip for efficient storage.

Use case: Long-term archival, offline analysis with standard tools, research datasets.

HAR (HTTP Archive Format)

JSON-based format with detailed HTTP transaction data. Ideal for performance analysis, debugging, and browser replay tools.

Use case: Performance analysis, browser DevTools import, debugging HTTP transactions.

Complete Retrieval Workflow

Here's a complete example showing how to wait for completion and retrieve results:

Next Steps

Summary