Retrieving Crawler Results

View as markdown

Once your crawler has completed, you have multiple options for retrieving the results. Choose the method that best fits your use case: individual URLs, content queries, or complete artifacts.

Near-Realtime Results

Results become available in near-realtime as pages are crawled. You can query content immediately while the crawler is RUNNING. Artifacts (WARC/HAR) are only finalized when is_finished: true. Poll the /crawl/{uuid}/status endpoint to monitor progress and check is_success to determine the outcome.

Choosing the Right Method

Select the retrieval method that best matches your use case. Consider your crawl size, processing needs, and infrastructure.

List URLs

Best for:

URL discovery & mapping
Failed URL analysis
Sitemap generation
Crawl auditing

Scale: Any size

Query Specific

Best for:

Selective retrieval
Real-time processing
On-demand access
API integration

Scale: Any size (per-page)

Get All Content

Best for:

Small crawls
Testing & development
Quick prototyping
Simple integration

Scale: Best for <100 pages

Recommended

Download Artifacts

Best for:

Large crawls (100s-1000s+)
Long-term archival
Offline processing
Data pipelines

Scale: Unlimited

Retrieval Methods

The Crawler API provides four complementary methods for accessing your crawled data. Choose the method that best fits your use case:

List Crawled URLs

Get a comprehensive list of all URLs discovered and crawled during the job, with detailed metadata for each URL including status codes, depth, and timestamps.

curl https://api.scrapfly.io/crawl/{uuid}/urls?key=

Filter by status:

# Get all visited URLs
curl https://api.scrapfly.io/crawl/{uuid}/urls?key=&status=visited

# Get all failed URLs
curl https://api.scrapfly.io/crawl/{uuid}/urls?key=&status=failed

Response includes URL metadata:

{
  "urls": [
    {
      "url": "https://example.com",
      "status": "visited",
      "depth": 0,
      "status_code": 200,
      "crawled_at": "2025-01-15T10:30:20Z"
    },
    {
      "url": "https://example.com/about",
      "status": "visited",
      "depth": 1,
      "status_code": 200,
      "crawled_at": "2025-01-15T10:30:45Z"
    }
  ],
  "total": 847,
  "page": 1,
  "per_page": 100
}

Use case: Audit which pages were crawled, identify failed URLs, or build a sitemap.

HTTP Caching Optimization

For completed crawlers (is_finished: true), all retrieval endpoints return Cache-Control: public, max-age=3600, immutable headers. This enables:

Browser caching: Automatically cache responses for 1 hour
CDN acceleration: Content can be cached by intermediate proxies
Reduced API calls: Repeat requests served from cache without counting against limits
Immutable guarantee: Content won't change, safe to cache aggressively

Query Specific Page Content

Retrieve extracted content for specific URLs from the crawl. Perfect for selective content retrieval without downloading the entire dataset.

Single URL Query

Retrieve content for one specific URL using the url query parameter:

curl https://api.scrapfly.io/crawl/{uuid}/contents?key=&url=https://example.com/page&format=markdown

Response contains the extracted content for the specified URL:

# Homepage

Welcome to our site! We provide the best products and services for your needs.

## Our Services

- Web Development
- Mobile Apps
- Cloud Solutions

Contact us today to get started!

Plain Mode Efficient

Return raw content directly without JSON wrapper by adding plain=true. Perfect for shell scripts and direct file piping:

# Get raw markdown content (no JSON wrapper)
curl https://api.scrapfly.io/crawl/{uuid}/contents?key=&url=https://example.com&formats=markdown&plain=true

# Direct output - pure markdown, no JSON parsing needed:
# Homepage
#
# Welcome to our site...

# Pipe directly to file
curl https://api.scrapfly.io/crawl/{uuid}/contents?key=&url=https://example.com&formats=markdown&plain=true > page.md

Plain Mode Requirements

Must specify url parameter (single URL only)
Must specify exactly one format in formats parameter
Response Content-Type matches format (e.g., text/markdown, text/html)
No JSON parsing needed - raw content in response body

Multipart Response Format

Request a multipart response for single URLs by setting the Accept header. Same efficiency benefits as batch queries:

# Request multipart format for single URL
curl "https://api.scrapfly.io/crawl/{uuid}/contents?key=&url=https://example.com&formats=markdown,text" \
  -H "Accept: multipart/related; boundary=custom123"

Response returns multiple formats for the same URL as separate parts:

HTTP/1.1 200 OK
Content-Type: multipart/related; boundary=custom123
Content-Location: https://example.com

--custom123
Content-Type: text/markdown

# Homepage

Welcome to our site...
--custom123
Content-Type: text/plain

Homepage

Welcome to our site...
--custom123--

Use Cases for Single URL Multipart

Multiple formats efficiently: Get markdown + text + HTML for the same URL without JSON escaping overhead
Streaming processing: Process formats as they arrive in the multipart stream
Bandwidth savings: ~50% smaller than JSON for text content due to no escaping

Batch URL Query Efficient

Retrieve content for multiple URLs in a single request. Maximum 100 URLs per request.

curl -X POST "https://api.scrapfly.io/crawl/{uuid}/contents/batch?key=&formats=markdown,text" \
  -H "Content-Type: text/plain" \
  -d "https://example.com/page1
https://example.com/page2
https://example.com/page3"

Response format: multipart/related (RFC 2387) - Each URL's content is returned as a separate part in the multipart response.

HTTP/1.1 200 OK
Content-Type: multipart/related; boundary=abc123
X-Scrapfly-Requested-URLs: 3
X-Scrapfly-Found-URLs: 3

--abc123
Content-Type: text/markdown
Content-Location: https://example.com/page1

# Page 1

Content here...
--abc123
Content-Type: text/plain
Content-Location: https://example.com/page1

Page 1 Content here...
--abc123
Content-Type: text/markdown
Content-Location: https://example.com/page2

# Page 2

Different content...
--abc123--

Performance & Efficiency

The multipart format provides ~50% bandwidth savings compared to JSON for text content by eliminating JSON escaping overhead. The response streams efficiently with constant memory usage, making it ideal for large content batches.

Parsing Multipart Responses

Use standard HTTP multipart libraries to parse the response:

from email import message_from_bytes
from email.policy import HTTP
import requests

response = requests.post(
    f"https://api.scrapfly.io/crawl/{uuid}/contents/batch",
    params={"key": api_key, "formats": "markdown,text"},
    headers={"Content-Type": "text/plain"},
    data="https://example.com/page1\nhttps://example.com/page2"
)

# Parse multipart response
msg = message_from_bytes(
    f"Content-Type: {response.headers['Content-Type']}\r\n\r\n".encode() + response.content,
    policy=HTTP
)

# Iterate through parts
for part in msg.iter_parts():
    url = part['Content-Location']
    content_type = part['Content-Type']
    content = part.get_content()

    print(f"{url} ({content_type}): {len(content)} bytes")

    # Store content by URL and format
    if content_type == "text/markdown":
        save_markdown(url, content)
    elif content_type == "text/plain":
        save_text(url, content)

// Node.js with node-fetch and mailparser
import fetch from 'node-fetch';
import { simpleParser } from 'mailparser';

const response = await fetch(
    `https://api.scrapfly.io/crawl/{uuid}/contents/batch?key=${apiKey}&formats=markdown,text`,
    {
        method: 'POST',
        headers: { 'Content-Type': 'text/plain' },
        body: 'https://example.com/page1\nhttps://example.com/page2'
    }
);

const contentType = response.headers.get('content-type');
const buffer = await response.buffer();

// Parse multipart
const parsed = await simpleParser(
    `Content-Type: ${contentType}\r\n\r\n${buffer.toString('binary')}`
);

// Process each attachment (part)
for (const attachment of parsed.attachments) {
    const url = attachment.headers.get('content-location');
    const contentType = attachment.contentType;
    const content = attachment.content.toString();

    console.log(`${url} (${contentType}): ${content.length} bytes`);
}

package main

import (
    "io"
    "mime"
    "mime/multipart"
    "net/http"
    "strings"
)

func fetchBatchContents(crawlerUUID, apiKey string, urls []string) error {
    body := strings.Join(urls, "\n")

    resp, err := http.Post(
        "https://api.scrapfly.io/crawl/" + crawlerUUID + "/contents/batch?key=" + apiKey + "&formats=markdown,text",
        "text/plain",
        strings.NewReader(body),
    )
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    // Parse multipart boundary
    mediaType, params, err := mime.ParseMediaType(resp.Header.Get("Content-Type"))
    if err != nil || !strings.HasPrefix(mediaType, "multipart/") {
        return err
    }

    // Read multipart parts
    mr := multipart.NewReader(resp.Body, params["boundary"])
    for {
        part, err := mr.NextPart()
        if err == io.EOF {
            break
        }
        if err != nil {
            return err
        }

        url := part.Header.Get("Content-Location")
        contentType := part.Header.Get("Content-Type")
        content, _ := io.ReadAll(part)

        // Process content
        println(url, contentType, len(content), "bytes")
    }

    return nil
}

Batch Query Parameters

Parameter	Type	Description
`key`	Query Param	Your API key (required)
`formats`	Query Param	Comma-separated list of formats for batch query (e.g., `markdown,text,html`)
Request Body	Plain Text	URLs separated by newlines (for batch query, max 100 URLs)

Response Headers

Header	Description
`Content-Type`	`multipart/related; boundary=<random>` - Standard HTTP multipart format (RFC 2387)
`X-Scrapfly-Requested-URLs`	Number of URLs in your request
`X-Scrapfly-Found-URLs`	Number of URLs found in crawl results (may be less if some URLs were not crawled)

Multipart Part Headers

Each part in the multipart response contains:

Header	Description
`Content-Type`	MIME type of the content (e.g., `text/markdown`, `text/plain`, `text/html`)
`Content-Location`	The URL this content belongs to

Available formats:

html - Raw HTML content
clean_html - HTML with boilerplate removed
markdown - Markdown format (ideal for LLM training data)
text - Plain text only
json - Structured JSON representation
extracted_data - AI-extracted structured data
page_metadata - Page metadata (title, description, etc.)

Use cases:

Single query: Fetch content for individual pages via API for real-time processing
Batch query: Efficiently retrieve content for multiple specific URLs (e.g., product pages, article URLs)

Get All Crawled Contents

Retrieve all extracted contents in the specified format. Returns a JSON object mapping URLs to their extracted content in your chosen format.

curl https://api.scrapfly.io/crawl/{uuid}/contents?key=&format=markdown

Response contains contents mapped by URL:

{
  "contents": {
    "https://example.com": "# Homepage\n\nWelcome to our site...",
    "https://example.com/about": "# About Us\n\nWe are a company...",
    "https://example.com/contact": "# Contact\n\nReach us at..."
  }
}

Available formats:

html - Raw HTML content
clean_html - HTML with boilerplate removed
markdown - Markdown format (ideal for LLM training data)
text - Plain text only
json - Structured JSON representation
extracted_data - AI-extracted structured data
page_metadata - Page metadata (title, description, etc.)

Large Crawls

For crawls with hundreds or thousands of pages, this endpoint may return large responses. Consider using artifacts or querying specific URLs instead.

Use case: Small to medium crawls where you need all content via API, or testing/development.

Download Artifacts (Recommended for Large Crawls)

Download industry-standard archive formats containing all crawled data. This is the most efficient method for large crawls, avoiding multiple API calls and handling huge datasets with ease.

Why Use Artifacts?

Massive Scale - Handle crawls with thousands or millions of pages efficiently
Single Download - Get the entire crawl in one compressed file, avoiding pagination and rate limits
Offline Processing - Query and analyze data locally without additional API calls
Cost Effective - One-time download instead of per-page API requests
Flexible Storage - Store artifacts in S3, object storage, or local disk for long-term archival
Industry Standard - WARC and HAR formats are universally supported by analysis tools

Available Artifact Types

WARC (Web ARChive Format)

Industry-standard format for web archiving. Contains complete HTTP request/response pairs, headers, and extracted content. Compressed with gzip for efficient storage.

curl https://api.scrapfly.io/crawl/{uuid}/artifact?key=&type=warc -o crawl.warc.gz

Use case: Long-term archival, offline analysis with standard tools, research datasets.

Learn More About WARC Format

See our complete WARC format guide for custom headers, reading libraries in multiple languages, and code examples.

HAR (HTTP Archive Format)

JSON-based format with detailed HTTP transaction data. Ideal for performance analysis, debugging, and browser replay tools.

curl https://api.scrapfly.io/crawl/{uuid}/artifact?key=&type=har -o crawl.har

Use case: Performance analysis, browser DevTools import, debugging HTTP transactions.

Complete Retrieval Workflow

Here's a complete example showing how to wait for completion and retrieve results:

#!/bin/bash

# Step 1: Create crawler
RESPONSE=$(curl -X POST https://api.scrapfly.io/crawl?key= \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://web-scraping.dev/products",
    "page_limit": 25
  }')

# Extract crawler UUID
UUID=$(echo $RESPONSE | jq -r '.crawler_uuid')
echo "Crawler UUID: $UUID"

# Step 2: Poll status until complete
while true; do
  RESPONSE=$(curl -s https://api.scrapfly.io/crawl/$UUID/status?key=)
  IS_FINISHED=$(echo $RESPONSE | jq -r '.is_finished')
  IS_SUCCESS=$(echo $RESPONSE | jq -r '.is_success')

  echo "Status check: is_finished=$IS_FINISHED, is_success=$IS_SUCCESS"

  if [ "$IS_FINISHED" = "true" ]; then
    if [ "$IS_SUCCESS" = "true" ]; then
      echo "Crawler completed successfully!"
      break
    else
      echo "Crawler failed!"
      exit 1
    fi
  fi

  sleep 5
done

# Step 3: Download results
echo "Downloading WARC artifact..."
curl https://api.scrapfly.io/crawl/$UUID/artifact?key=&type=warc -o crawl.warc.gz

echo "Getting markdown content..."
curl https://api.scrapfly.io/crawl/$UUID/contents?key=&format=markdown > content.json

echo "Done!"

import requests
import time

API_KEY = ""
BASE_URL = "https://api.scrapfly.io"

# Step 1: Create crawler
response = requests.post(
    f"{BASE_URL}/crawl",
    params={"key": API_KEY},
    json={
        "url": "https://web-scraping.dev/products",
        "page_limit": 25
    }
)
crawler_data = response.json()
uuid = crawler_data["crawler_uuid"]
print(f"Crawler UUID: {uuid}")

# Step 2: Poll status until complete
while True:
    response = requests.get(
        f"{BASE_URL}/crawl/{uuid}/status",
        params={"key": API_KEY}
    )
    status = response.json()

    is_finished = status.get("is_finished", False)
    is_success = status.get("is_success", False)

    print(f"Status check: is_finished={is_finished}, is_success={is_success}")

    if is_finished:
        if is_success:
            print("Crawler completed successfully!")
            break
        else:
            print("Crawler failed!")
            exit(1)

    time.sleep(5)

# Step 3: Download results
print("Downloading WARC artifact...")
warc_response = requests.get(
    f"{BASE_URL}/crawl/{uuid}/artifact",
    params={"key": API_KEY, "type": "warc"}
)
with open("crawl.warc.gz", "wb") as f:
    f.write(warc_response.content)

print("Getting markdown content...")
content_response = requests.get(
    f"{BASE_URL}/crawl/{uuid}/contents",
    params={"key": API_KEY, "format": "markdown"}
)
with open("content.json", "w") as f:
    f.write(content_response.text)

print("Done!")

const API_KEY = "";
const BASE_URL = "https://api.scrapfly.io";

async function runCrawler() {
    // Step 1: Create crawler
    const createResponse = await fetch(`${BASE_URL}/crawl?key=${API_KEY}`, {
        method: "POST",
        headers: {
            "Content-Type": "application/json"
        },
        body: JSON.stringify({
            url: "https://web-scraping.dev/products",
            page_limit: 25
        })
    });

    const crawlerData = await createResponse.json();
    const uuid = crawlerData.crawler_uuid;
    console.log(`Crawler UUID: ${uuid}`);

    // Step 2: Poll status until complete
    while (true) {
        const statusResponse = await fetch(
            `${BASE_URL}/crawl/${uuid}/status?key=${API_KEY}`
        );
        const status = await statusResponse.json();

        const isFinished = status.is_finished || false;
        const isSuccess = status.is_success || false;

        console.log(`Status check: is_finished=${isFinished}, is_success=${isSuccess}`);

        if (isFinished) {
            if (isSuccess) {
                console.log("Crawler completed successfully!");
                break;
            } else {
                console.log("Crawler failed!");
                process.exit(1);
            }
        }

        await new Promise(resolve => setTimeout(resolve, 5000));
    }

    // Step 3: Download results
    console.log("Downloading WARC artifact...");
    const warcResponse = await fetch(
        `${BASE_URL}/crawl/${uuid}/artifact?key=${API_KEY}&type=warc`
    );
    const warcBlob = await warcResponse.blob();
    // In Node.js, use fs.writeFileSync to save
    // In browser, use URL.createObjectURL to download

    console.log("Getting markdown content...");
    const contentResponse = await fetch(
        `${BASE_URL}/crawl/${uuid}/contents?key=${API_KEY}&format=markdown`
    );
    const content = await contentResponse.json();
    // Save content.json to file

    console.log("Done!");
}

runCrawler().catch(console.error);

Next Steps

Learn about webhook integration for real-time notifications
Understand billing and costs
Review the full API specification

Retrieving Crawler Results

Choosing the Right Method

List URLs

Query Specific

Get All Content

Download Artifacts

Retrieval Methods

List Crawled URLs

Query Specific Page Content

Single URL Query

Plain Mode Efficient

Multipart Response Format

Batch URL Query Efficient

Parsing Multipart Responses

Batch Query Parameters

Response Headers

Multipart Part Headers

Get All Crawled Contents

Download Artifacts (Recommended for Large Crawls)

Why Use Artifacts?

Available Artifact Types

WARC (Web ARChive Format)

HAR (HTTP Archive Format)

Complete Retrieval Workflow

Next Steps

Summary