# Scrapfly Documentation

## Table of Contents

### Dashboard

- [Intro](https://scrapfly.io/docs)
- [Project](https://scrapfly.io/docs/project)
- [Account](https://scrapfly.io/docs/account)
- [Workspace & Team](https://scrapfly.io/docs/workspace-and-team)
- [Billing](https://scrapfly.io/docs/billing)

### Products

#### MCP Server

- [Getting Started](https://scrapfly.io/docs/mcp/getting-started)
- [Tools & API Spec](https://scrapfly.io/docs/mcp/tools)
- [Authentication](https://scrapfly.io/docs/mcp/authentication)
- [Examples & Use Cases](https://scrapfly.io/docs/mcp/examples)
- [FAQ](https://scrapfly.io/docs/mcp/faq)
##### Integrations

- [Overview](https://scrapfly.io/docs/mcp/integrations)
- [Claude Desktop](https://scrapfly.io/docs/mcp/integrations/claude-desktop)
- [Claude Code](https://scrapfly.io/docs/mcp/integrations/claude-code)
- [ChatGPT](https://scrapfly.io/docs/mcp/integrations/chatgpt)
- [Cursor](https://scrapfly.io/docs/mcp/integrations/cursor)
- [Cline](https://scrapfly.io/docs/mcp/integrations/cline)
- [Windsurf](https://scrapfly.io/docs/mcp/integrations/windsurf)
- [Zed](https://scrapfly.io/docs/mcp/integrations/zed)
- [Roo Code](https://scrapfly.io/docs/mcp/integrations/roo-code)
- [VS Code](https://scrapfly.io/docs/mcp/integrations/vscode)
- [LangChain](https://scrapfly.io/docs/mcp/integrations/langchain)
- [LlamaIndex](https://scrapfly.io/docs/mcp/integrations/llamaindex)
- [CrewAI](https://scrapfly.io/docs/mcp/integrations/crewai)
- [OpenAI](https://scrapfly.io/docs/mcp/integrations/openai)
- [n8n](https://scrapfly.io/docs/mcp/integrations/n8n)
- [Make](https://scrapfly.io/docs/mcp/integrations/make)
- [Zapier](https://scrapfly.io/docs/mcp/integrations/zapier)
- [Vapi AI](https://scrapfly.io/docs/mcp/integrations/vapi)
- [Agent Builder](https://scrapfly.io/docs/mcp/integrations/agent-builder)
- [Custom Client](https://scrapfly.io/docs/mcp/integrations/custom-client)


#### Web Scraping API

- [Getting Started](https://scrapfly.io/docs/scrape-api/getting-started)
- [API Specification]()
- [Monitoring](https://scrapfly.io/docs/monitoring)
- [Customize Request](https://scrapfly.io/docs/scrape-api/custom)
- [Debug](https://scrapfly.io/docs/scrape-api/debug)
- [Anti Scraping Protection](https://scrapfly.io/docs/scrape-api/anti-scraping-protection)
- [Proxy](https://scrapfly.io/docs/scrape-api/proxy)
- [Proxy Mode](https://scrapfly.io/docs/scrape-api/proxy-mode)
- [Proxy Mode - Screaming Frog](https://scrapfly.io/docs/scrape-api/proxy-mode/screaming-frog)
- [Proxy Mode - Apify](https://scrapfly.io/docs/scrape-api/proxy-mode/apify)
- [(Auto) Data Extraction](https://scrapfly.io/docs/scrape-api/extraction)
- [Javascript Rendering](https://scrapfly.io/docs/scrape-api/javascript-rendering)
- [Javascript Scenario](https://scrapfly.io/docs/scrape-api/javascript-scenario)
- [SSL](https://scrapfly.io/docs/scrape-api/ssl)
- [DNS](https://scrapfly.io/docs/scrape-api/dns)
- [Cache](https://scrapfly.io/docs/scrape-api/cache)
- [Session](https://scrapfly.io/docs/scrape-api/session)
- [Webhook](https://scrapfly.io/docs/scrape-api/webhook)
- [Screenshot](https://scrapfly.io/docs/scrape-api/screenshot)
- [Errors](https://scrapfly.io/docs/scrape-api/errors)
- [Timeout](https://scrapfly.io/docs/scrape-api/understand-timeout)
- [Throttling](https://scrapfly.io/docs/throttling)
- [Troubleshoot](https://scrapfly.io/docs/scrape-api/troubleshoot)
- [Billing](https://scrapfly.io/docs/scrape-api/billing)
- [FAQ](https://scrapfly.io/docs/scrape-api/faq)

#### Crawler API

- [Getting Started](https://scrapfly.io/docs/crawler-api/getting-started)
- [API Specification]()
- [Retrieving Results](https://scrapfly.io/docs/crawler-api/results)
- [WARC Format](https://scrapfly.io/docs/crawler-api/warc-format)
- [Data Extraction](https://scrapfly.io/docs/crawler-api/extraction-rules)
- [Webhook](https://scrapfly.io/docs/crawler-api/webhook)
- [Billing](https://scrapfly.io/docs/crawler-api/billing)
- [Errors](https://scrapfly.io/docs/crawler-api/errors)
- [Troubleshoot](https://scrapfly.io/docs/crawler-api/troubleshoot)
- [FAQ](https://scrapfly.io/docs/crawler-api/faq)

#### Screenshot API

- [Getting Started](https://scrapfly.io/docs/screenshot-api/getting-started)
- [API Specification]()
- [Accessibility Testing](https://scrapfly.io/docs/screenshot-api/accessibility)
- [Webhook](https://scrapfly.io/docs/screenshot-api/webhook)
- [Billing](https://scrapfly.io/docs/screenshot-api/billing)
- [Errors](https://scrapfly.io/docs/screenshot-api/errors)

#### Extraction API

- [Getting Started](https://scrapfly.io/docs/extraction-api/getting-started)
- [API Specification]()
- [Rules Template](https://scrapfly.io/docs/extraction-api/rules-and-template)
- [LLM Extraction](https://scrapfly.io/docs/extraction-api/llm-prompt)
- [AI Auto Extraction](https://scrapfly.io/docs/extraction-api/automatic-ai)
- [Webhook](https://scrapfly.io/docs/extraction-api/webhook)
- [Billing](https://scrapfly.io/docs/extraction-api/billing)
- [Errors](https://scrapfly.io/docs/extraction-api/errors)
- [FAQ](https://scrapfly.io/docs/extraction-api/faq)

#### Proxy Saver

- [Getting Started](https://scrapfly.io/docs/proxy-saver/getting-started)
- [Fingerprints](https://scrapfly.io/docs/proxy-saver/fingerprints)
- [Optimizations](https://scrapfly.io/docs/proxy-saver/optimizations)
- [SSL Certificates](https://scrapfly.io/docs/proxy-saver/certificates)
- [Protocols](https://scrapfly.io/docs/proxy-saver/protocols)
- [Pacfile](https://scrapfly.io/docs/proxy-saver/pacfile)
- [Secure Credentials](https://scrapfly.io/docs/proxy-saver/security)
- [Billing](https://scrapfly.io/docs/proxy-saver/billing)

#### Cloud Browser API

- [Getting Started](https://scrapfly.io/docs/cloud-browser-api/getting-started)
- [Proxy & Geo-Targeting](https://scrapfly.io/docs/cloud-browser-api/proxy)
- [Unblock API](https://scrapfly.io/docs/cloud-browser-api/unblock)
- [File Downloads](https://scrapfly.io/docs/cloud-browser-api/file-downloads)
- [Session Resume](https://scrapfly.io/docs/cloud-browser-api/session-resume)
- [Human-in-the-Loop](https://scrapfly.io/docs/cloud-browser-api/human-in-the-loop)
- [Debug Mode](https://scrapfly.io/docs/cloud-browser-api/debug-mode)
- [Bring Your Own Proxy](https://scrapfly.io/docs/cloud-browser-api/bring-your-own-proxy)
- [Browser Extensions](https://scrapfly.io/docs/cloud-browser-api/extensions)
##### Integrations

- [Puppeteer](https://scrapfly.io/docs/cloud-browser-api/puppeteer)
- [Playwright](https://scrapfly.io/docs/cloud-browser-api/playwright)
- [Selenium](https://scrapfly.io/docs/cloud-browser-api/selenium)
- [Vercel Agent Browser](https://scrapfly.io/docs/cloud-browser-api/agent-browser)
- [Browser Use](https://scrapfly.io/docs/cloud-browser-api/browser-use)
- [Stagehand](https://scrapfly.io/docs/cloud-browser-api/stagehand)

- [Billing](https://scrapfly.io/docs/cloud-browser-api/billing)
- [Errors](https://scrapfly.io/docs/cloud-browser-api/errors)


### Tools

- [Antibot Detector](https://scrapfly.io/docs/tools/antibot-detector)

### SDK

- [Golang](https://scrapfly.io/docs/sdk/golang)
- [Python](https://scrapfly.io/docs/sdk/python)
- [Rust](https://scrapfly.io/docs/sdk/rust)
- [TypeScript](https://scrapfly.io/docs/sdk/typescript)
- [Scrapy](https://scrapfly.io/docs/sdk/scrapy)

### Integrations

- [Getting Started](https://scrapfly.io/docs/integration/getting-started)
- [LangChain](https://scrapfly.io/docs/integration/langchain)
- [LlamaIndex](https://scrapfly.io/docs/integration/llamaindex)
- [CrewAI](https://scrapfly.io/docs/integration/crewai)
- [Zapier](https://scrapfly.io/docs/integration/zapier)
- [Make](https://scrapfly.io/docs/integration/make)
- [n8n](https://scrapfly.io/docs/integration/n8n)

### Academy

- [Overview](https://scrapfly.io/academy)
- [Web Scraping Overview](https://scrapfly.io/academy/scraping-overview)
- [Tools](https://scrapfly.io/academy/tools-overview)
- [Reverse Engineering](https://scrapfly.io/academy/reverse-engineering)
- [Static Scraping](https://scrapfly.io/academy/static-scraping)
- [HTML Parsing](https://scrapfly.io/academy/html-parsing)
- [Dynamic Scraping](https://scrapfly.io/academy/dynamic-scraping)
- [Hidden API Scraping](https://scrapfly.io/academy/hidden-api-scraping)
- [Headless Browsers](https://scrapfly.io/academy/headless-browsers)
- [Hidden Web Data](https://scrapfly.io/academy/hidden-web-data)
- [JSON Parsing](https://scrapfly.io/academy/json-parsing)
- [Data Processing](https://scrapfly.io/academy/data-processing)
- [Scaling](https://scrapfly.io/academy/scaling)
- [Walkthrough Summary](https://scrapfly.io/academy/walkthrough-summary)
- [Scraper Blocking](https://scrapfly.io/academy/scraper-blocking)
- [Proxies](https://scrapfly.io/academy/proxies)

---

# Getting Started with Scrapfly Crawler API

 [  View as markdown ](https://scrapfly.io/?view=markdown)   Copy for LLM    Copy for LLM  [     Open in ChatGPT ](https://chatgpt.com/?hints=search&prompt=Read%20from%20https%3A%2F%2Fscrapfly.io%2Fdocs%2Fcrawler-api%2Fgetting-started%20so%20I%20can%20ask%20questions%20about%20it.) [     Open in Claude ](https://claude.ai/new?q=Read%20from%20https%3A%2F%2Fscrapfly.io%2Fdocs%2Fcrawler-api%2Fgetting-started%20so%20I%20can%20ask%20questions%20about%20it.) [     Open in Perplexity ](https://www.perplexity.ai/search/new?q=Read%20from%20https%3A%2F%2Fscrapfly.io%2Fdocs%2Fcrawler-api%2Fgetting-started%20so%20I%20can%20ask%20questions%20about%20it.) 

 

 

 The **Scrapfly Crawler API** enables recursive website crawling at scale. We leverage [WARC](https://scrapfly.io/docs/crawler-api/warc-format), Parquet format for large scale scraping and you can easily visualize using HAR artifact. Crawl entire websites with configurable limits, extract content in multiple formats simultaneously, and retrieve results as industry-standard artifacts.

> **Early Access Feature**The Crawler API is currently in early access. Features and API may evolve based on user feedback.

## Quick Start: Choose Your Workflow  

 The Crawler API supports two integration patterns. Choose the approach that best fits your use case:

    Polling Workflow Poll status endpoint    Real-Time Webhooks Instant notifications  

 ###   Polling Workflow

 Schedule a crawl, poll the status endpoint to monitor progress, and retrieve results when complete. **Best for batch processing, testing, and simple integrations.**

  

1. **Schedule Crawl**Create a crawler with a single API call. The API returns immediately with a crawler UUID:
    
    
    - [  cURL ](#pane-start-crawl-curl)
    - [  Python ](#pane-start-crawl-python)
    - [  TypeScript ](#pane-start-crawl-typescript)
    - [  Go ](#pane-start-crawl-go)
     
     ```bash
    curl -X POST "https://api.scrapfly.io/crawl?key={{ YOUR_API_KEY }}" \
      -H 'Content-Type: application/json' \
      -d '{
        "url": "https://web-scraping.dev/products",
        "page_limit": 5
      }'
    ```
    
     
    
     ```python
    from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
    
    client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
    crawl = Crawl(client, CrawlerConfig(
        url="https://web-scraping.dev/products",
        page_limit=5,
    ))
    crawl.crawl()
    print("Crawler UUID:", crawl.uuid)
    ```
    
     
    
     ```javascript
    import { ScrapflyClient, CrawlerConfig, Crawl } from 'scrapfly-sdk';
    
    const client = new ScrapflyClient({ key: '{{ YOUR_API_KEY }}' });
    const crawl = new Crawl(client, new CrawlerConfig({
        url: 'https://web-scraping.dev/products',
        page_limit: 5,
    }));
    await crawl.start();
    console.log('Crawler UUID:', crawl.uuid);
    ```
    
     
    
     ```go
    client, err := scrapfly.New("{{ YOUR_API_KEY }}")
    if err != nil {
        log.Fatal(err)
    }
    crawl := scrapfly.NewCrawl(client, &scrapfly.CrawlerConfig{
        URL:       "https://web-scraping.dev/products",
        PageLimit: 5,
    })
    if err := crawl.Start(); err != nil {
        log.Fatal(err)
    }
    fmt.Println("Crawler UUID:", crawl.UUID())
    ```
    
     
    
     
    
    
    
    Response includes crawler UUID and status:
    
     ```
    {"uuid": "550e8400-e29b-41d4-a716-446655440000", "status": "PENDING"}
    ```
2. **Monitor Progress**Poll the status endpoint to track crawl progress:
    
    
    - [  cURL ](#pane-monitor-status-curl)
    - [  Python ](#pane-monitor-status-python)
    - [  TypeScript ](#pane-monitor-status-typescript)
    - [  Go ](#pane-monitor-status-go)
     
     ```bash
    curl "https://api.scrapfly.io/crawl/{crawler_uuid}/status?key={{ YOUR_API_KEY }}"
    ```
    
     
    
     ```python
    # `crawl.status()` is the SDK convenience that polls the same endpoint.
    # `crawl.wait()` blocks until a terminal state is reached.
    crawl.wait()
    status = crawl.status()
    print(f"{status.status}: {status.state.urls_visited}/{status.state.urls_extracted} pages")
    ```
    
     
    
     ```javascript
    // `crawl.status()` is the SDK convenience that polls the same endpoint.
    // `crawl.wait()` blocks until a terminal state is reached.
    await crawl.wait();
    const status = await crawl.status();
    console.log(`${status.status}: ${status.state.urls_visited}/${status.state.urls_extracted} pages`);
    ```
    
     
    
     ```go
    // `crawl.Status(refresh)` is the SDK convenience that polls the same endpoint.
    // `crawl.Wait(nil)` blocks until a terminal state is reached.
    if err := crawl.Wait(nil); err != nil {
        log.Fatal(err)
    }
    status, err := crawl.Status(true)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("%s: %d/%d pages\n", status.Status, status.State.URLsVisited, status.State.URLsExtracted)
    ```
    
     
    
     
    
    
    
    Status response shows real-time progress. Example for a crawl still `RUNNING`:
    
     ```
    {
      "crawler_uuid": "550e8400-e29b-41d4-a716-446655440000",
      "status": "RUNNING",
      "is_finished": false,
      "is_success": null,
      "state": {
        "urls_visited": 847,
        "urls_extracted": 1523,
        "urls_failed": 12,
        "urls_skipped": 34,
        "urls_to_crawl": 676,
        "api_credit_used": 8470,
        "duration": 145,
        "start_time": 1775593563,
        "stop_time": null,
        "stop_reason": null
      }
    }
    ```
    
     
    
       
    
     
    
     
    
     
    
    While a crawl is in `PENDING` state (accepted but no worker has picked it up yet), the time-related state fields are `null` and all counters are zero:
    
     ```
    {
      "crawler_uuid": "550e8400-e29b-41d4-a716-446655440000",
      "status": "PENDING",
      "is_finished": false,
      "is_success": null,
      "state": {
        "urls_visited": 0,
        "urls_extracted": 0,
        "urls_failed": 0,
        "urls_skipped": 0,
        "urls_to_crawl": 0,
        "api_credit_used": 0,
        "duration": 0,
        "start_time": null,
        "stop_time": null,
        "stop_reason": null
      }
    }
    ```
    
     
    
       
    
     
    
     
    
     
    
    #### Understanding the Status Response
    
     | Field | Values | Description |
    |---|---|---|
    | `status` | `PENDING`  `RUNNING`  `DONE`  `CANCELLED` | Current crawler state - actively running or completed |
    | `is_finished` | `true` / `false` | Whether crawler has stopped (regardless of success/failure) |
    | `is_success` | `true` - Success  `false` - Failed  `null` - Running | Outcome of the crawl (only set when finished) |
    | `state.start_time` | `int` (Unix timestamp) / `null` | `null` while the crawler is in `PENDING` state. Set to the Unix timestamp when the first worker picks up the job. |
    | `state.stop_time` | `int` (Unix timestamp) / `null` | `null` while the crawler is still running. Set to the Unix timestamp when the crawler reaches a terminal state (`DONE` or `CANCELLED`). |
    | `stop_reason` | See table below / `null` | `null` while the crawler is still running. Set to one of the stop reasons below once the crawler stops. |
    
     **Stop Reasons:**
    
     | Stop Reason | Description |
    |---|---|
    | `no_more_urls` | All discovered URLs have been crawled - **normal completion** |
    | `page_limit` | Reached the configured `page_limit` |
    | `max_duration` | Exceeded the `max_duration` time limit |
    | `max_api_credit` | Reached the `max_api_credit` limit |
    | `seed_url_failed` | The starting URL failed to crawl - **no URLs visited** |
    | `user_cancelled` | User manually cancelled the crawl via API |
    | `crawler_error` | Internal crawler error occurred |
    | `no_api_credit_left` | Account ran out of API credits during crawl |
    | `storage_error` | An error occurred while saving the content |
3. **Retrieve Results**#### Cancel a Running Crawl
    
    You can stop a crawl at any time before it reaches a terminal state. Cancelled crawls move to the `CANCELLED` status with a `stop_reason` of `user_cancelled`; everything that was already crawled remains queryable via the contents and artifact endpoints.
    
    
    - [  cURL ](#pane-cancel-crawl-curl)
    - [  Python ](#pane-cancel-crawl-python)
    - [  TypeScript ](#pane-cancel-crawl-typescript)
    - [  Go ](#pane-cancel-crawl-go)
     
     ```bash
    curl -X POST "https://api.scrapfly.io/crawl/{crawler_uuid}/cancel?key={{ YOUR_API_KEY }}"
    ```
    
     
    
     ```python
    crawl.cancel()
    # Pass allow_cancelled=True so wait() returns normally on the cancellation
    # we just triggered ourselves, instead of raising ScrapflyCrawlerError.
    crawl.wait(allow_cancelled=True)
    print("stop_reason:", crawl.status().state.stop_reason)
    ```
    
     
    
     ```javascript
    await crawl.cancel();
    // Pass allowCancelled so wait() returns normally on the cancellation
    // we just triggered ourselves, instead of throwing ScrapflyCrawlerError.
    await crawl.wait({ allowCancelled: true });
    const status = await crawl.status();
    console.log('stop_reason:', status.state.stop_reason);
    ```
    
     
    
     ```go
    if err := crawl.Cancel(); err != nil {
        log.Fatal(err)
    }
    // Pass AllowCancelled so Wait returns nil on the cancellation we just
    // triggered ourselves, instead of returning ErrCrawlerCancelled.
    if err := crawl.Wait(&scrapfly.WaitOptions{AllowCancelled: true}); err != nil {
        log.Fatal(err)
    }
    status, _ := crawl.Status(true)
    if status.State.StopReason != nil {
        fmt.Println("stop_reason:", *status.State.StopReason)
    }
    ```
    
     
    
     
    
    
    
      Calling cancel on a crawl that has already finished is a no-op and returns success. The endpoint is idempotent.
    
    #### Retrieve Results
    
    Once `is_finished: true`, download artifacts or query content:
    
    
    - [  cURL ](#pane-retrieve-results-curl)
    - [  Python ](#pane-retrieve-results-python)
    - [  TypeScript ](#pane-retrieve-results-typescript)
    - [  Go ](#pane-retrieve-results-go)
     
     ```bash
    # Download WARC artifact (recommended for large crawls)
    curl "https://api.scrapfly.io/crawl/{crawler_uuid}/artifact?key={{ YOUR_API_KEY }}&type=warc" -o crawl.warc.gz
    
    # Query specific URL content
    curl "https://api.scrapfly.io/crawl/{crawler_uuid}/contents?key={{ YOUR_API_KEY }}&url=https://web-scraping.dev/products&format=markdown"
    
    # Or batch retrieve multiple URLs (max 100 per request)
    curl -X POST "https://api.scrapfly.io/crawl/{crawler_uuid}/contents/batch?key={{ YOUR_API_KEY }}&formats=markdown" \
      -H 'Content-Type: text/plain' \
      -d 'https://web-scraping.dev/products
    https://web-scraping.dev/product/1
    https://web-scraping.dev/product/2'
    ```
    
     
    
     ```python
    # Download WARC artifact and iterate response records
    warc = crawl.warc()
    for record in warc.iter_responses():
        print(record.status_code, record.url, len(record.content), "bytes")
    warc.save("crawl.warc.gz")
    
    # Read a single page's content (plain mode, no JSON envelope)
    content = crawl.read("https://web-scraping.dev/products", format="markdown")
    print(content.content[:200])
    
    # Batch retrieve multiple URLs in one call (max 100)
    batch = crawl.read_batch(
        urls=["https://web-scraping.dev/products", "https://web-scraping.dev/product/1"],
        formats=["markdown"],
    )
    for url, formats in batch.items():
        print(url, "->", len(formats["markdown"]), "chars")
    ```
    
     
    
     ```javascript
    // Download WARC artifact (raw bytes — TS SDK does not bundle a parser;
    // use `warcio` or similar from npm if you need to parse on the client).
    const warc = await crawl.warc();
    console.log(`WARC: ${warc.data.byteLength} bytes`);
    await warc.save('crawl.warc.gz');
    
    // Read a single page's content (plain mode, no JSON envelope)
    const md = await crawl.read('https://web-scraping.dev/products', 'markdown');
    console.log(md?.slice(0, 200));
    
    // Batch retrieve multiple URLs in one call (max 100)
    const batch = await crawl.readBatch(
        ['https://web-scraping.dev/products', 'https://web-scraping.dev/product/1'],
        ['markdown'],
    );
    for (const [url, formats] of Object.entries(batch)) {
        console.log(url, '->', formats.markdown.length, 'chars');
    }
    ```
    
     
    
     ```go
    // Download WARC artifact and parse with the bundled WarcParser
    warc, err := crawl.WARC()
    if err != nil {
        log.Fatal(err)
    }
    parser, _ := scrapfly.ParseWARC(warc.Data)
    parser.IterResponses(func(r *scrapfly.WarcRecord) bool {
        fmt.Printf("%d %s (%d bytes)\n", r.StatusCode, r.URL, len(r.Content))
        return true
    })
    _ = warc.Save("crawl.warc.gz")
    
    // Read a single page's content (plain mode)
    content, _ := crawl.Read("https://web-scraping.dev/products", scrapfly.CrawlerFormatMarkdown)
    if content != nil {
        fmt.Println(content.Content[:200])
    }
    
    // Batch retrieve multiple URLs in one call (max 100)
    batch, _ := crawl.ReadBatch(
        []string{"https://web-scraping.dev/products", "https://web-scraping.dev/product/1"},
        []scrapfly.CrawlerContentFormat{scrapfly.CrawlerFormatMarkdown},
    )
    for url, formats := range batch {
        fmt.Println(url, "->", len(formats[string(scrapfly.CrawlerFormatMarkdown)]), "chars")
    }
    ```
    
     
    
     
    
    
    
      For comprehensive retrieval options, see [Retrieving Crawler Results](https://scrapfly.io/docs/crawler-api/results).
 
 

###   Real-Time Webhook Workflow

 Schedule a crawl with webhook configuration, receive instant HTTP callbacks as events occur, and process results in real-time. **Best for real-time data ingestion, streaming pipelines, and event-driven architectures.**

  

> **Webhook Setup Required** Before using webhooks, you must [configure a webhook](https://scrapfly.io/dashboard/webhook) in your dashboard with your endpoint URL and authentication. Then reference it by name in your API call.

1. **Schedule Crawl with Webhook**Create a crawler and specify the webhook name configured in your dashboard:
    
    
    - [  cURL ](#pane-start-webhook-curl)
    - [  Python ](#pane-start-webhook-python)
    - [  TypeScript ](#pane-start-webhook-typescript)
    - [  Go ](#pane-start-webhook-go)
     
     ```bash
    curl -X POST "https://api.scrapfly.io/crawl?key={{ YOUR_API_KEY }}" \
      -H 'Content-Type: application/json' \
      -d '{
        "url": "https://web-scraping.dev/products",
        "page_limit": 5,
        "webhook_name": "my-crawler-webhook",
        "webhook_events": [
          "crawler_started",
          "crawler_url_visited",
          "crawler_finished"
        ]
      }'
    ```
    
     
    
     ```python
    from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
    
    client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
    crawl = Crawl(client, CrawlerConfig(
        url="https://web-scraping.dev/products",
        page_limit=5,
        # Replace with the name of a webhook you registered in your dashboard
        # at https://scrapfly.io/dashboard/webhook
        webhook_name="my-crawler-webhook",
        webhook_events=[
            "crawler_started",
            "crawler_url_visited",
            "crawler_finished",
        ],
    ))
    crawl.crawl()
    print("Crawler UUID:", crawl.uuid)
    ```
    
     
    
     ```javascript
    import { ScrapflyClient, CrawlerConfig, Crawl } from 'scrapfly-sdk';
    
    const client = new ScrapflyClient({ key: '{{ YOUR_API_KEY }}' });
    const crawl = new Crawl(client, new CrawlerConfig({
        url: 'https://web-scraping.dev/products',
        page_limit: 5,
        // Replace with the name of a webhook you registered in your dashboard
        // at https://scrapfly.io/dashboard/webhook
        webhook_name: 'my-crawler-webhook',
        webhook_events: ['crawler_started', 'crawler_url_visited', 'crawler_finished'],
    }));
    await crawl.start();
    console.log('Crawler UUID:', crawl.uuid);
    ```
    
     
    
     ```go
    client, _ := scrapfly.New("{{ YOUR_API_KEY }}")
    crawl := scrapfly.NewCrawl(client, &scrapfly.CrawlerConfig{
        URL:         "https://web-scraping.dev/products",
        PageLimit:   5,
        WebhookName: "my-crawler-webhook",
        WebhookEvents: []scrapfly.CrawlerWebhookEvent{
            scrapfly.WebhookCrawlerStarted,
            scrapfly.WebhookCrawlerURLVisited,
            scrapfly.WebhookCrawlerFinished,
        },
    })
    if err := crawl.Start(); err != nil {
        log.Fatal(err)
    }
    fmt.Println("Crawler UUID:", crawl.UUID())
    ```
    
     
    
     
    
    
    
    Response includes crawler UUID:
    
     ```
    {"uuid": "550e8400-e29b-41d4-a716-446655440000", "status": "PENDING"}
    ```
2. **Receive Real-Time Webhooks**Your endpoint receives HTTP POST callbacks as events occur during the crawl:
    
     ```
    {
      "event": "crawler_url_visited",
      "payload": {
        "crawler_uuid": "550e8400-e29b-41d4-a716-446655440000",
        "url": "https://web-scraping.dev/page",
        "status_code": 200,
        "depth": 1,
        "state": {
          "urls_visited": 42,
          "urls_to_crawl": 158,
          "api_credit_used": 420
        }
      }
    }
    ```
    
     
    
       
    
     
    
     
    
     
    
     **Webhook Headers:**
    
     | Header | Purpose |
    |---|---|
    | `X-Scrapfly-Crawl-Event-Name` | Event type (e.g., `crawler_url_visited`) for fast routing |
    | `X-Scrapfly-Webhook-Job-Id` | Crawler UUID for tracking |
    | `X-Scrapfly-Webhook-Signature` | HMAC-SHA256 signature for verification |
3. **Process Events in Real-Time**Handle webhook callbacks to stream data to your database, trigger pipelines, or process results:
    
    
    - [  Python ](#pane-webhook-handler-python)
    - [  TypeScript ](#pane-webhook-handler-typescript)
    - [  Go ](#pane-webhook-handler-go)
     
     ```python
    from flask import Flask, request
    from scrapfly import (
        webhook_from_payload,
        CrawlerLifecycleWebhook,
        CrawlerUrlVisitedWebhook,
        CrawlerWebhookEvent,
    )
    
    app = Flask(__name__)
    SIGNING_SECRETS = ("your-hex-secret",)  # rotate via the dashboard
    
    @app.post("/webhooks/crawler")
    def handle_crawler_webhook():
        wh = webhook_from_payload(
            request.json,
            signing_secrets=SIGNING_SECRETS,
            signature=request.headers.get("X-Scrapfly-Webhook-Signature"),
        )
    
        if isinstance(wh, CrawlerUrlVisitedWebhook):
            # Stream scraped content to database
            save_to_database(wh.url, wh.scrape.content)
        elif isinstance(wh, CrawlerLifecycleWebhook) and wh.event == CrawlerWebhookEvent.CRAWLER_FINISHED.value:
            trigger_data_pipeline(wh.crawler_uuid)
    
        return {"status": "ok"}
    ```
    
     
    
     ```javascript
    // The TypeScript SDK exposes `verifySignature()` for HMAC validation.
    // Typed dataclass parsing (like Python's `webhookFromPayload`) is on the
    // roadmap — for now, switch on the raw `event` field directly.
    import express from 'express';
    import { verifySignature } from 'scrapfly-sdk';
    
    const app = express();
    // IMPORTANT: capture raw body bytes for signature verification
    app.use(express.raw({ type: 'application/json' }));
    
    const SIGNING_SECRET = 'your-hex-secret'; // rotate via the dashboard
    
    app.post('/webhooks/crawler', (req, res) => {
        const sig = req.header('X-Scrapfly-Webhook-Signature') ?? '';
        if (!verifySignature(req.body as Buffer, sig, SIGNING_SECRET)) {
            return res.status(401).json({ error: 'invalid signature' });
        }
    
        const { event, payload } = JSON.parse((req.body as Buffer).toString());
    
        if (event === 'crawler_url_visited') {
            saveToDatabase(payload.url, payload.scrape.content);
        } else if (event === 'crawler_finished') {
            triggerDataPipeline(payload.crawler_uuid);
        }
    
        res.json({ status: 'ok' });
    });
    ```
    
     
    
     ```go
    // The Go SDK does not yet ship typed webhook dataclasses or a built-in
    // signature verifier — those are on the roadmap. For now, verify HMAC with
    // crypto/hmac and switch on the raw `event` field directly.
    package main
    
    import (
        "crypto/hmac"
        "crypto/sha256"
        "encoding/hex"
        "encoding/json"
        "io"
        "log"
        "net/http"
        "strings"
    )
    
    var signingSecretHex = "your-hex-secret" // rotate via the dashboard
    
    func verifyHMAC(body []byte, signatureHex string) bool {
        secret, err := hex.DecodeString(signingSecretHex)
        if err != nil {
            return false
        }
        mac := hmac.New(sha256.New, secret)
        mac.Write(body)
        expected := strings.ToUpper(hex.EncodeToString(mac.Sum(nil)))
        return hmac.Equal([]byte(expected), []byte(strings.ToUpper(signatureHex)))
    }
    
    type webhookEnvelope struct {
        Event   string         `json:"event"`
        Payload map[string]any `json:"payload"`
    }
    
    func handleCrawlerWebhook(w http.ResponseWriter, r *http.Request) {
        body, err := io.ReadAll(r.Body)
        if err != nil {
            http.Error(w, err.Error(), http.StatusBadRequest)
            return
        }
    
        if !verifyHMAC(body, r.Header.Get("X-Scrapfly-Webhook-Signature")) {
            http.Error(w, "invalid signature", http.StatusUnauthorized)
            return
        }
    
        var env webhookEnvelope
        if err := json.Unmarshal(body, &env); err != nil {
            http.Error(w, err.Error(), http.StatusBadRequest)
            return
        }
    
        switch env.Event {
        case "crawler_url_visited":
            url, _ := env.Payload["url"].(string)
            scrape, _ := env.Payload["scrape"].(map[string]any)
            content, _ := scrape["content"].(map[string]any)
            saveToDatabase(url, content)
        case "crawler_finished":
            uuid, _ := env.Payload["crawler_uuid"].(string)
            triggerDataPipeline(uuid)
        }
    
        w.WriteHeader(http.StatusOK)
        w.Write([]byte(`{"status":"ok"}`))
    }
    
    func main() {
        http.HandleFunc("/webhooks/crawler", handleCrawlerWebhook)
        log.Fatal(http.ListenAndServe(":8080", nil))
    }
    ```
 
  For detailed webhook documentation and all available events, see [Crawler Webhook Documentation](https://scrapfly.io/docs/crawler-api/webhook).

 

 

## Error Handling  

 Crawler API uses standard HTTP response codes and provides detailed error information:

 | `200` - OK | Request successful |
|---|---|
| `201` - Created | Crawler job created successfully |
| `400` - Bad Request | Invalid parameters or configuration |
| `401` - Unauthorized | Invalid or missing API key |
| `404` - Not Found | Crawler job not found |
| `429` - Too Many Requests | Rate limit or concurrency limit exceeded |
| `500` - Server Error | Internal server error |
| See the [full error list](https://scrapfly.io/docs/crawler-api/errors) for more details. |

 

 

 

 

---

## API Specification

     

  Authentication (Query Parameter) 

 [`key`](#api_param_key) 

  required 

    [  ](https://scrapfly.io/docs/project?iframe=1 "Preview") [  ](https://scrapfly.io/docs/project "Docs") 

 

API Key for authentication. Must be passed as URL query parameter (`?key=...`), not in request body.

`scp-live-xxx...`

 

 

  Crawl Configuration 

 [`url`](#api_param_url) 

  required 

    

 

Starting URL for the crawl. Must be a valid HTTP/HTTPS URL. [Must be URL encoded ](https://scrapfly.io/web-scraping-tools/urlencode)

`https://web-scraping.dev` `https://example.com/blog`

 

 

 [`page_limit`](#api_param_page_limit) 

  popular 

    

 

Maximum pages to crawl. Set to `0` for unlimited (subject to subscription limits)

`100` `1000` `0`

 

 

 [`max_depth`](#api_param_max_depth) 

  popular 

    

 

Maximum link depth from starting URL. Depth 0 is the starting URL, depth 1 is links from it

`2` `5` `0`

 

 

 [`exclude_paths`](#api_param_exclude_paths) 

  popular default: \[\] 

    

 

Exclude URLs matching these path patterns. Supports wildcards (`*`). Max 100 paths

`["/admin/*"]` `["*/login"]`

 

 

 [`include_only_paths`](#api_param_include_only_paths) 

  popular default: \[\] 

    

 

Only crawl URLs matching these patterns. Supports wildcards (`*`). Max 100 paths. Mutually exclusive with `exclude_paths`

`["/blog/*"]` `["/products/*"]`

 

 

  Advanced Configuration 

 [`ignore_base_path_restriction`](#api_param_ignore_base_path_restriction) 

 default: false 

    

 

By default, crawler stays within the starting URL's base path. Enable to crawl any path on the same domain

`true` `false`

 

 

 [`follow_external_links`](#api_param_follow_external_links) 

 default: false 

    

 

Allow crawler to follow links to external domains. External pages are scraped but their links are not followed

`true` `false`

 

 

 [`allowed_external_domains`](#api_param_allowed_external_domains) 

 default: \[\] 

    

 

Whitelist of external domains when `follow_external_links=true`. Max 250 domains. Supports wildcards (`*`)

`["*.web-scraping.dev"]` `["cdn.site.com"]`

 

 

 [`follow_internal_subdomains`](#api_param_follow_internal_subdomains) 

 default: true 

    

 

Allow crawler to follow links to subdomains of the starting domain.
`true` (default): all subdomains are followed. Use `allowed_internal_subdomains` to restrict to specific ones.
`false`: only the seed URL's exact hostname is crawled, no subdomains at all.

`true` `false`

 

 

 [`allowed_internal_subdomains`](#api_param_allowed_internal_subdomains) 

 default: \[\] 

    

 

Restrict crawling to specific subdomains when `follow_internal_subdomains=true`.
**If empty (default): all subdomains are allowed.**
If set: only the seed hostname and the listed subdomains are crawled.
Each pattern must be a subdomain of the seed URL's domain. Supports wildcard patterns (`*`). Maximum 250 entries.
Ignored when `follow_internal_subdomains=false`.

`["blog.web-scraping.dev"]` `["*.cdn.web-scraping.dev"]`

 

 

 [`rendering_delay`](#api_param_rendering_delay) 

 

    

 

Wait time in ms after page load before extraction. Set `0` to disable browser rendering. Range: 0-25000ms

`0` `2000` `5000`

 

 

 [`max_concurrency`](#api_param_max_concurrency) 

 default: account limit 

    

 

Maximum concurrent scrape requests. Limited by your account's concurrency limit. Set `0` for account default

`5` `10` `0`

 

 

 [`headers`](#api_param_headers) 

 default: {} 

    

 

Custom HTTP headers as JSON object. [Must be URL encoded](https://scrapfly.io/web-scraping-tools/urlencode)

`{"Authorization": "Bearer token"}`

 

 

 [`delay`](#api_param_delay) 

 default: "0" 

    

 

Delay between requests in milliseconds. Range: 0-15000ms. Be polite to target servers

`"1000"` `"5000"`

 

 

 [`user_agent`](#api_param_user_agent) 

 default: null 

    

 

Custom User-Agent string. Ignored when `asp=true` (ASP manages User-Agent automatically)

`MyBot/1.0 (+https://example.com/bot)`

 

 

 [`use_sitemaps`](#api_param_use_sitemaps) 

 default: false 

    

 

Use sitemap.xml for URL discovery if available

`true` `false`

 

 

 [`respect_robots_txt`](#api_param_respect_robots_txt) 

 default: true 

    

 

Respect robots.txt rules and `Disallow` directives

`true` `false`

 

 

  Caching Options 

 [`cache`](#api_param_cache) 

  popular default: false 

    [  ](https://scrapfly.io/docs/scrape-api/cache?iframe=1 "Preview") [  ](https://scrapfly.io/docs/scrape-api/cache "Docs") 

 

Enable cache layer for crawled pages. Cached versions are used instead of re-crawling

`true` `false`

 

 

 [`cache_ttl`](#api_param_cache_ttl) 

 default: 86400 

    [  ](https://scrapfly.io/docs/scrape-api/cache?iframe=1 "Preview") [  ](https://scrapfly.io/docs/scrape-api/cache "Docs") 

 

Cache time-to-live in seconds. Range: 0-604800 (max 7 days). Only applies when `cache=true`

`3600` `86400` `604800`

 

 

 [`cache_clear`](#api_param_cache_clear) 

 default: false 

    [  ](https://scrapfly.io/docs/scrape-api/cache?iframe=1 "Preview") [  ](https://scrapfly.io/docs/scrape-api/cache "Docs") 

 

Force refresh of cached pages. All pages will be re-crawled even if valid cache exists

`true` `false`

 

 

 [`ignore_no_follow`](#api_param_ignore_no_follow) 

 default: false 

    

 

Ignore `rel="nofollow"` attributes on links. Crawl all links regardless of nofollow

`true` `false`

 

 

  Content &amp; Extraction 

 [`content_formats`](#api_param_content_formats) 

  popular default: \["html"\] 

    

 

Content formats to extract: `html`, `clean_html`, `markdown`, `text`, `json`, `extracted_data`, `page_metadata`

`["markdown"]` `["html", "text"]`

 

   More details**Available formats:**

- `html` - Original HTML content
- `clean_html` - Cleaned and sanitized HTML
- `markdown` - LLM-ready markdown format
- `text` - Plain text extraction
- `json` - Parse as JSON (for API responses)
- `extracted_data` - Structured data from extraction rules
- `page_metadata` - Page title, description, etc.
 
 Multiple formats can be extracted simultaneously from each page. Markdown format is ideal for LLM processing.

 

  

 [`max_duration`](#api_param_max_duration) 

 default: 900 

    

 

Maximum crawl duration in seconds. Range: 15-10800 (15s to 3 hours). Crawler stops when reached

`900` `3600` `10800`

 

 

 [`max_api_credit`](#api_param_max_api_credit) 

 

    

 

Maximum API credits to spend. Set `0` for no limit. Useful for controlling costs

`1000` `5000` `0`

 

 

 [`extraction_rules`](#api_param_extraction_rules) 

 default: null 

    [  ](https://scrapfly.io/docs/crawler-api/extraction-rules?iframe=1 "Preview") [  ](https://scrapfly.io/docs/crawler-api/extraction-rules "Docs") 

 

Extraction rules for structured data. Types: `prompt`, `model`, `template`. Max 100 rules

`{"/products/*": {"type": "prompt", "value": "..."}}`

 

   More details**Rule types:**

- `prompt` - AI prompt instruction (max 10,000 chars)
- `model` - Pre-defined model (product, article, etc.)
- `template` - Custom extraction template
 
**Pattern matching:** Rules use URL path patterns with wildcards:

- `/products/*` - Match all product pages
- `/blog/*` - Match all blog posts
 
 Maximum 100 extraction rules per crawl. Rules only apply to pages matching the pattern.

 

  

  Webhooks 

 [`webhook_name`](#api_param_webhook_name) 

  popular default: null 

    

 

Name of webhook configured in [dashboard](https://scrapfly.io/dashboard/webhook). Not a URL - references webhook by name

`my-crawler-webhook`

 

 

 [`webhook_events`](#api_param_webhook_events) 

 default: basic events 

    

 

Events to subscribe: `crawler_started`, `crawler_finished`, `crawler_url_visited`, `crawler_url_failed`, etc.

`["crawler_finished"]` `["crawler_url_visited"]`

 

   More details**Available events:**

- `crawler_started` - Crawl job has begun
- `crawler_url_visited` - Page successfully scraped
- `crawler_url_skipped` - Page skipped (cached, excluded, etc.)
- `crawler_url_discovered` - New URL found
- `crawler_url_failed` - Page scrape failed
- `crawler_stopped` - Crawl stopped (limits reached)
- `crawler_cancelled` - Crawl manually cancelled
- `crawler_finished` - Crawl completed
 
 If `webhook_name` is set without specifying events, basic events are used by default.

 

  

  Proxy &amp; Protection 

 [`proxy_pool`](#api_param_proxy_pool) 

  popular default: public\_datacenter\_pool 

    

 

Select proxy pool. See [proxy dashboard](https://scrapfly.io/dashboard/proxy) for available pools and pricing

`public_datacenter_pool` `public_residential_pool`

 

 

 [`country`](#api_param_country) 

  popular default: random 

    [  ](https://scrapfly.io/docs/scrape-api/proxy?iframe=1 "Preview") [  ](https://scrapfly.io/docs/scrape-api/proxy "Docs") 

 

Proxy country (ISO 3166-1 alpha-2). Supports exclusions (`-gb`) and weighted distribution (`us:10,gb:5`)

`us` `us,ca,mx` `-gb`

 

 

 [`asp`](#api_param_asp) 

  popular default: false 

    [  ](https://scrapfly.io/docs/scrape-api/anti-scraping-protection?iframe=1 "Preview") [  ](https://scrapfly.io/docs/scrape-api/anti-scraping-protection "Docs") 

 

[Anti Scraping Protection](https://scrapfly.io/docs/scrape-api/anti-scraping-protection) - Enable advanced anti-bot bypass with browser rendering and fingerprinting. Ignores custom `user_agent`

`true` `false`

 

 

 

 

## Get Crawler Status  

 Retrieve the current status and progress of a crawler job. Use this endpoint to poll for updates while the crawler is running.

 GET `https://api.scrapfly.io/crawl/{crawler_uuid}/status`

- [  cURL ](#pane-ref-status-curl)
- [  Python ](#pane-ref-status-python)
- [  TypeScript ](#pane-ref-status-typescript)
- [  Go ](#pane-ref-status-go)
 
 ```bash
curl "https://api.scrapfly.io/crawl/{crawler_uuid}/status?key={{ YOUR_API_KEY }}"
```

 

 ```python
from scrapfly import ScrapflyClient
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
status = client.get_crawl_status("{crawler_uuid}")
print(status.status, status.state.urls_visited, "/", status.state.urls_extracted)
```

 

 ```javascript
import { ScrapflyClient } from 'scrapfly-sdk';
const client = new ScrapflyClient({ key: '{{ YOUR_API_KEY }}' });
const status = await client.crawlStatus('{crawler_uuid}');
console.log(status.status, status.state.urls_visited, '/', status.state.urls_extracted);
```

 

 ```go
client, _ := scrapfly.New("{{ YOUR_API_KEY }}")
status, err := client.CrawlStatus("{crawler_uuid}")
if err != nil {
    log.Fatal(err)
}
fmt.Println(status.Status, status.State.URLsVisited, "/", status.State.URLsExtracted)
```

 

 



 **Response includes:**

- `status` - Current status: `PENDING`, `RUNNING`, `DONE`, or `CANCELLED`. Failure is signalled by `status=DONE` with `is_success=false`.
- `state.urls_extracted` - Total URLs discovered (seed + links + sitemaps)
- `state.urls_visited` - URLs successfully crawled
- `state.urls_to_crawl` - URLs waiting to be crawled (derived: `urls_extracted - urls_skipped`)
- `state.urls_failed` - URLs that failed to crawl
- `state.urls_skipped` - URLs skipped by filters (exclude rules, robots.txt, etc.)
- `state.api_credit_used` - Total API credits consumed
 
## Get Crawled URLs  

 Retrieve a list of all URLs discovered and crawled during the job, with metadata about each URL.

 GET `https://api.scrapfly.io/crawl/{crawler_uuid}/urls`

- [  cURL ](#pane-ref-urls-curl)
- [  Python ](#pane-ref-urls-python)
- [  TypeScript ](#pane-ref-urls-typescript)
- [  Go ](#pane-ref-urls-go)
 
 ```bash
# Get all visited URLs
curl "https://api.scrapfly.io/crawl/{crawler_uuid}/urls?key={{ YOUR_API_KEY }}&status=visited"

# Get failed URLs with pagination
curl "https://api.scrapfly.io/crawl/{crawler_uuid}/urls?key={{ YOUR_API_KEY }}&status=failed&page=1&per_page=100"
```

 

 ```python
from scrapfly import ScrapflyClient
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")

# Stream visited URLs (`status` defaults to 'visited' on the server)
visited = client.get_crawl_urls("{crawler_uuid}", status="visited", page=1, per_page=100)
for entry in visited:
    print(entry.url)

# Failed URLs include the reason on each entry
for entry in client.get_crawl_urls("{crawler_uuid}", status="failed"):
    print(entry.url, "->", entry.reason)
```

 

 ```javascript
import { ScrapflyClient } from 'scrapfly-sdk';
const client = new ScrapflyClient({ key: '{{ YOUR_API_KEY }}' });

// Stream visited URLs
const visited = await client.crawlUrls('{crawler_uuid}', { status: 'visited', page: 1, per_page: 100 });
for (const entry of visited.urls) {
    console.log(entry.url);
}

// Failed URLs include the reason on each entry
const failed = await client.crawlUrls('{crawler_uuid}', { status: 'failed' });
for (const entry of failed.urls) {
    console.log(entry.url, '->', entry.reason);
}
```

 

 ```go
client, _ := scrapfly.New("{{ YOUR_API_KEY }}")

// Stream visited URLs
visited, err := client.CrawlURLs("{crawler_uuid}", &scrapfly.CrawlURLsOptions{
    Status: "visited", Page: 1, PerPage: 100,
})
if err != nil {
    log.Fatal(err)
}
for _, entry := range visited.URLs {
    fmt.Println(entry.URL)
}

// Failed URLs include the reason on each entry
failed, _ := client.CrawlURLs("{crawler_uuid}", &scrapfly.CrawlURLsOptions{Status: "failed"})
for _, entry := range failed.URLs {
    fmt.Println(entry.URL, "->", entry.Reason)
}
```

 

 



 **Query Parameters:**

- `key` - Your API key (required)
- `status` - Filter by URL status: `visited`, `pending`, `failed`
- `page` - Page number for pagination (default: 1)
- `per_page` - Results per page (default: 100, max: 1000)
 
## Get Content  

 Retrieve extracted content from crawled pages in the format(s) specified in your crawl configuration.

### Single URL or All Pages (GET)

 GET `https://api.scrapfly.io/crawl/{crawler_uuid}/contents`

- [  cURL ](#pane-ref-contents-curl)
- [  Python ](#pane-ref-contents-python)
- [  TypeScript ](#pane-ref-contents-typescript)
- [  Go ](#pane-ref-contents-go)
 
 ```bash
# Get all content in markdown format
curl "https://api.scrapfly.io/crawl/{crawler_uuid}/contents?key={{ YOUR_API_KEY }}&format=markdown"

# Get content for a specific URL (plain mode — returns raw text, no JSON envelope)
curl "https://api.scrapfly.io/crawl/{crawler_uuid}/contents?key={{ YOUR_API_KEY }}&format=html&url=https://web-scraping.dev/products&plain=true"
```

 

 ```python
from scrapfly import ScrapflyClient
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")

# All pages bundled in a single JSON response
contents = client.get_crawl_contents("{crawler_uuid}", format="markdown")
for url, payload in contents.contents.items():
    print(url, "->", len(payload["markdown"]), "chars")

# Single URL in plain mode (returns raw bytes/string, no JSON envelope)
md = client.get_crawl_contents("{crawler_uuid}", format="markdown",
                               url="https://web-scraping.dev/products", plain=True)
print(md[:200])
```

 

 ```javascript
import { ScrapflyClient } from 'scrapfly-sdk';
const client = new ScrapflyClient({ key: '{{ YOUR_API_KEY }}' });

// All pages bundled in a single JSON response
const contents = await client.crawlContents('{crawler_uuid}', { format: 'markdown' });
console.log(contents);

// Single URL in plain mode (returns raw string, no JSON envelope)
const md = await client.crawlContents('{crawler_uuid}', {
    format: 'markdown',
    url: 'https://web-scraping.dev/products',
    plain: true,
});
console.log(typeof md === 'string' ? md.slice(0, 200) : md);
```

 

 ```go
client, _ := scrapfly.New("{{ YOUR_API_KEY }}")

// All pages bundled in a single JSON response
contents, err := client.CrawlContentsJSON("{crawler_uuid}", scrapfly.CrawlerFormatMarkdown, nil)
if err != nil {
    log.Fatal(err)
}
for url, payload := range contents.Contents {
    fmt.Println(url, "->", len(payload[string(scrapfly.CrawlerFormatMarkdown)]), "chars")
}

// Single URL in plain mode (returns raw string, no JSON envelope)
md, _ := client.CrawlContentsPlain("{crawler_uuid}", "https://web-scraping.dev/products", scrapfly.CrawlerFormatMarkdown)
fmt.Println(md[:200])
```

 

 



 **Query Parameters:**

- `key` - Your API key (required)
- `format` - Content format to retrieve (must be one of the formats specified in crawl config)
- `url` - Optional: Retrieve content for a specific URL only
 
### Batch Content Retrieval (POST)

 POST `https://api.scrapfly.io/crawl/{crawler_uuid}/contents/batch` 

 Retrieve content for multiple specific URLs in a single request. More efficient than making individual GET requests for each URL. **Maximum 100 URLs per request.**

- [  cURL ](#pane-ref-batch-curl)
- [  Python ](#pane-ref-batch-python)
- [  TypeScript ](#pane-ref-batch-typescript)
- [  Go ](#pane-ref-batch-go)
 
 ```bash
# Batch retrieve content for up to 100 URLs in one round-trip
curl -X POST "https://api.scrapfly.io/crawl/{crawler_uuid}/contents/batch?key={{ YOUR_API_KEY }}&formats=markdown,text" \
  -H "Content-Type: text/plain" \
  -d "https://web-scraping.dev/products
https://web-scraping.dev/product/1
https://web-scraping.dev/product/2"
```

 

 ```python
from scrapfly import ScrapflyClient
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")

# Returns {url: {format: content_string}} — max 100 URLs per call
batch = client.get_crawl_contents_batch(
    "{crawler_uuid}",
    urls=[
        "https://web-scraping.dev/products",
        "https://web-scraping.dev/product/1",
        "https://web-scraping.dev/product/2",
    ],
    formats=["markdown", "text"],
)
for url, formats in batch.items():
    print(url, "->", len(formats["markdown"]), "chars")
```

 

 ```javascript
import { ScrapflyClient } from 'scrapfly-sdk';
const client = new ScrapflyClient({ key: '{{ YOUR_API_KEY }}' });

// Returns {url: {format: content_string}} — max 100 URLs per call
const batch = await client.crawlContentsBatch(
    '{crawler_uuid}',
    [
        'https://web-scraping.dev/products',
        'https://web-scraping.dev/product/1',
        'https://web-scraping.dev/product/2',
    ],
    ['markdown', 'text'],
);
for (const [url, formats] of Object.entries(batch)) {
    console.log(url, '->', formats.markdown.length, 'chars');
}
```

 

 ```go
client, _ := scrapfly.New("{{ YOUR_API_KEY }}")

// Returns map[url]map[format]content_string — max 100 URLs per call
batch, err := client.CrawlContentsBatch(
    "{crawler_uuid}",
    []string{
        "https://web-scraping.dev/products",
        "https://web-scraping.dev/product/1",
        "https://web-scraping.dev/product/2",
    },
    []scrapfly.CrawlerContentFormat{scrapfly.CrawlerFormatMarkdown, scrapfly.CrawlerFormatText},
)
if err != nil {
    log.Fatal(err)
}
for url, formats := range batch {
    fmt.Println(url, "->", len(formats[string(scrapfly.CrawlerFormatMarkdown)]), "chars")
}
```

 

 



 **Query Parameters:**

- `key` - Your API key (required)
- `formats` - Comma-separated list of formats (e.g., `markdown,text,html`)
 
 **Request Body:**

- `Content-Type: text/plain` - Plain text with URLs separated by newlines
- **Maximum 100 URLs per request**
 
 **Response Format:**

- `Content-Type: multipart/related` - Standard HTTP multipart format (RFC 2387)
- `X-Scrapfly-Requested-URLs` header - Number of URLs in the request
- `X-Scrapfly-Found-URLs` header - Number of URLs found in the crawl results
- Each part contains `Content-Type` and `Content-Location` headers identifying the format and URL
 
> **Efficient Streaming Format** The multipart format eliminates JSON escaping overhead, providing **~50% bandwidth savings** for text content and constant memory usage during streaming. See the [Results documentation](https://scrapfly.io/docs/crawler-api/results#query-content) for parsing examples in Python, JavaScript, and Go.

## Download Artifact  

 Download industry-standard archive files containing all crawled data, including HTTP requests, responses, headers, and extracted content. Perfect for storing bulk crawl results offline or in object storage (S3, Google Cloud Storage).

 GET `https://api.scrapfly.io/crawl/{crawler_uuid}/artifact`

- [  cURL ](#pane-ref-artifact-curl)
- [  Python ](#pane-ref-artifact-python)
- [  TypeScript ](#pane-ref-artifact-typescript)
- [  Go ](#pane-ref-artifact-go)
 
 ```bash
# Download WARC artifact (gzip compressed, recommended for large crawls)
curl "https://api.scrapfly.io/crawl/{crawler_uuid}/artifact?key={{ YOUR_API_KEY }}&type=warc" -o crawl.warc.gz

# Download HAR artifact (JSON format)
curl "https://api.scrapfly.io/crawl/{crawler_uuid}/artifact?key={{ YOUR_API_KEY }}&type=har" -o crawl.har
```

 

 ```python
from scrapfly import ScrapflyClient
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")

# WARC archive — built-in WarcParser iterates response records
warc = client.get_crawl_artifact("{crawler_uuid}", artifact_type="warc")
for record in warc.iter_responses():
    print(record.status_code, record.url, len(record.content), "bytes")
warc.save("crawl.warc.gz")

# HAR archive — built-in HarArchive parser with high-level filters
har = client.get_crawl_artifact("{crawler_uuid}", artifact_type="har")
for entry in har.parser.filter_by_status(200):
    print(entry.method, entry.url, entry.content_type)
```

 

 ```javascript
import { ScrapflyClient } from 'scrapfly-sdk';
const client = new ScrapflyClient({ key: '{{ YOUR_API_KEY }}' });

// WARC archive — raw bytes + save() helper. The TypeScript SDK does not
// bundle a WARC parser; compose with `warcio` from npm if you need parsing.
const warc = await client.crawlArtifact('{crawler_uuid}', 'warc');
console.log(`WARC: ${warc.data.byteLength} bytes`);
await warc.save('crawl.warc.gz');

// HAR archive — same shape, just a different artifact type.
const har = await client.crawlArtifact('{crawler_uuid}', 'har');
console.log(`HAR: ${har.data.byteLength} bytes`);
await har.save('crawl.har');
```

 

 ```go
client, _ := scrapfly.New("{{ YOUR_API_KEY }}")

// WARC archive — Go SDK ships first-party WarcParser
warc, err := client.CrawlArtifact("{crawler_uuid}", scrapfly.ArtifactTypeWARC)
if err != nil {
    log.Fatal(err)
}
parser, _ := scrapfly.ParseWARC(warc.Data)
parser.IterResponses(func(r *scrapfly.WarcRecord) bool {
    fmt.Printf("%d %s (%d bytes)\n", r.StatusCode, r.URL, len(r.Content))
    return true
})
_ = warc.Save("crawl.warc.gz")

// HAR archive — Go SDK also ships HarArchive with high-level filters
har, _ := client.CrawlArtifact("{crawler_uuid}", scrapfly.ArtifactTypeHAR)
archive, _ := scrapfly.ParseHAR(har.Data)
for _, entry := range archive.FilterByStatus(200) {
    fmt.Println(entry.Method(), entry.URL(), entry.ContentType())
}
```

 

 



 **Query Parameters:**

- `key` - Your API key (required)
- `type` - Artifact type: 
    - `warc` - Web ARChive format (gzip compressed, industry standard)
    - `har` - HTTP Archive format (JSON, browser-compatible)
 
## Billing  

 Crawler API billing is simple: **the cost equals the sum of all Web Scraping API calls** made during the crawl. Each page crawled consumes credits based on enabled features (browser rendering, anti-scraping protection, proxy type, etc.).

 For detailed billing information, see [Crawler API Billing](https://scrapfly.io/docs/crawler-api/billing).