# Scrapfly Documentation

## Table of Contents

### Dashboard

- [Intro](https://scrapfly.io/docs)
- [Project](https://scrapfly.io/docs/project)
- [Account](https://scrapfly.io/docs/account)
- [Workspace & Team](https://scrapfly.io/docs/workspace-and-team)
- [Billing](https://scrapfly.io/docs/billing)

### Products

#### MCP Server

- [Getting Started](https://scrapfly.io/docs/mcp/getting-started)
- [Tools & API Spec](https://scrapfly.io/docs/mcp/tools)
- [Authentication](https://scrapfly.io/docs/mcp/authentication)
- [Examples & Use Cases](https://scrapfly.io/docs/mcp/examples)
- [FAQ](https://scrapfly.io/docs/mcp/faq)
##### Integrations

- [Overview](https://scrapfly.io/docs/mcp/integrations)
- [Claude Desktop](https://scrapfly.io/docs/mcp/integrations/claude-desktop)
- [Claude Code](https://scrapfly.io/docs/mcp/integrations/claude-code)
- [ChatGPT](https://scrapfly.io/docs/mcp/integrations/chatgpt)
- [Cursor](https://scrapfly.io/docs/mcp/integrations/cursor)
- [Cline](https://scrapfly.io/docs/mcp/integrations/cline)
- [Windsurf](https://scrapfly.io/docs/mcp/integrations/windsurf)
- [Zed](https://scrapfly.io/docs/mcp/integrations/zed)
- [Roo Code](https://scrapfly.io/docs/mcp/integrations/roo-code)
- [VS Code](https://scrapfly.io/docs/mcp/integrations/vscode)
- [LangChain](https://scrapfly.io/docs/mcp/integrations/langchain)
- [LlamaIndex](https://scrapfly.io/docs/mcp/integrations/llamaindex)
- [CrewAI](https://scrapfly.io/docs/mcp/integrations/crewai)
- [OpenAI](https://scrapfly.io/docs/mcp/integrations/openai)
- [n8n](https://scrapfly.io/docs/mcp/integrations/n8n)
- [Make](https://scrapfly.io/docs/mcp/integrations/make)
- [Zapier](https://scrapfly.io/docs/mcp/integrations/zapier)
- [Vapi AI](https://scrapfly.io/docs/mcp/integrations/vapi)
- [Agent Builder](https://scrapfly.io/docs/mcp/integrations/agent-builder)
- [Custom Client](https://scrapfly.io/docs/mcp/integrations/custom-client)


#### Web Scraping API

- [Getting Started](https://scrapfly.io/docs/scrape-api/getting-started)
- [API Specification]()
- [Monitoring](https://scrapfly.io/docs/monitoring)
- [Customize Request](https://scrapfly.io/docs/scrape-api/custom)
- [Debug](https://scrapfly.io/docs/scrape-api/debug)
- [Anti Scraping Protection](https://scrapfly.io/docs/scrape-api/anti-scraping-protection)
- [Proxy](https://scrapfly.io/docs/scrape-api/proxy)
- [Proxy Mode](https://scrapfly.io/docs/scrape-api/proxy-mode)
- [Proxy Mode - Screaming Frog](https://scrapfly.io/docs/scrape-api/proxy-mode/screaming-frog)
- [Proxy Mode - Apify](https://scrapfly.io/docs/scrape-api/proxy-mode/apify)
- [(Auto) Data Extraction](https://scrapfly.io/docs/scrape-api/extraction)
- [Javascript Rendering](https://scrapfly.io/docs/scrape-api/javascript-rendering)
- [Javascript Scenario](https://scrapfly.io/docs/scrape-api/javascript-scenario)
- [SSL](https://scrapfly.io/docs/scrape-api/ssl)
- [DNS](https://scrapfly.io/docs/scrape-api/dns)
- [Cache](https://scrapfly.io/docs/scrape-api/cache)
- [Batch (Multi-URL Scraping)](https://scrapfly.io/docs/scrape-api/batch)
- [Session](https://scrapfly.io/docs/scrape-api/session)
- [Webhook](https://scrapfly.io/docs/scrape-api/webhook)
- [Schedule](https://scrapfly.io/docs/scrape-api/schedule)
- [Screenshot](https://scrapfly.io/docs/scrape-api/screenshot)
- [Errors](https://scrapfly.io/docs/scrape-api/errors)
- [Timeout](https://scrapfly.io/docs/scrape-api/understand-timeout)
- [Throttling](https://scrapfly.io/docs/throttling)
- [Troubleshoot](https://scrapfly.io/docs/scrape-api/troubleshoot)
- [Billing](https://scrapfly.io/docs/scrape-api/billing)
- [FAQ](https://scrapfly.io/docs/scrape-api/faq)

#### Crawler API

- [Getting Started](https://scrapfly.io/docs/crawler-api/getting-started)
- [API Specification]()
- [Retrieving Results](https://scrapfly.io/docs/crawler-api/results)
- [WARC Format](https://scrapfly.io/docs/crawler-api/warc-format)
- [Data Extraction](https://scrapfly.io/docs/crawler-api/extraction-rules)
- [Webhook](https://scrapfly.io/docs/crawler-api/webhook)
- [Schedule](https://scrapfly.io/docs/crawler-api/schedule)
- [Billing](https://scrapfly.io/docs/crawler-api/billing)
- [Errors](https://scrapfly.io/docs/crawler-api/errors)
- [Troubleshoot](https://scrapfly.io/docs/crawler-api/troubleshoot)
- [FAQ](https://scrapfly.io/docs/crawler-api/faq)

#### Screenshot API

- [Getting Started](https://scrapfly.io/docs/screenshot-api/getting-started)
- [API Specification]()
- [Accessibility Testing](https://scrapfly.io/docs/screenshot-api/accessibility)
- [Webhook](https://scrapfly.io/docs/screenshot-api/webhook)
- [Schedule](https://scrapfly.io/docs/screenshot-api/schedule)
- [Billing](https://scrapfly.io/docs/screenshot-api/billing)
- [Errors](https://scrapfly.io/docs/screenshot-api/errors)

#### Extraction API

- [Getting Started](https://scrapfly.io/docs/extraction-api/getting-started)
- [API Specification]()
- [Rules Template](https://scrapfly.io/docs/extraction-api/rules-and-template)
- [Saved Templates](https://scrapfly.io/docs/extraction-api/templates)
- [LLM Extraction](https://scrapfly.io/docs/extraction-api/llm-prompt)
- [AI Auto Extraction](https://scrapfly.io/docs/extraction-api/automatic-ai)
- [Webhook](https://scrapfly.io/docs/extraction-api/webhook)
- [Billing](https://scrapfly.io/docs/extraction-api/billing)
- [Errors](https://scrapfly.io/docs/extraction-api/errors)
- [FAQ](https://scrapfly.io/docs/extraction-api/faq)

#### Data API

- [Getting Started](https://scrapfly.io/docs/data-api/getting-started)

#### Proxy Saver

- [Getting Started](https://scrapfly.io/docs/proxy-saver/getting-started)
- [Fingerprints](https://scrapfly.io/docs/proxy-saver/fingerprints)
- [Optimizations](https://scrapfly.io/docs/proxy-saver/optimizations)
- [SSL Certificates](https://scrapfly.io/docs/proxy-saver/certificates)
- [Protocols](https://scrapfly.io/docs/proxy-saver/protocols)
- [Pacfile](https://scrapfly.io/docs/proxy-saver/pacfile)
- [Secure Credentials](https://scrapfly.io/docs/proxy-saver/security)
- [Billing](https://scrapfly.io/docs/proxy-saver/billing)

#### Cloud Browser API

- [Getting Started](https://scrapfly.io/docs/cloud-browser-api/getting-started)
- [Proxy & Geo-Targeting](https://scrapfly.io/docs/cloud-browser-api/proxy)
- [Unblock API](https://scrapfly.io/docs/cloud-browser-api/unblock)
- [Captcha Solver](https://scrapfly.io/docs/cloud-browser-api/captcha-solver)
- [File Downloads](https://scrapfly.io/docs/cloud-browser-api/file-downloads)
- [Session Resume](https://scrapfly.io/docs/cloud-browser-api/session-resume)
- [Human-in-the-Loop](https://scrapfly.io/docs/cloud-browser-api/human-in-the-loop)
- [Debug Mode](https://scrapfly.io/docs/cloud-browser-api/debug-mode)
- [Bring Your Own Proxy](https://scrapfly.io/docs/cloud-browser-api/bring-your-own-proxy)
- [Browser Extensions](https://scrapfly.io/docs/cloud-browser-api/extensions)
- [Native Browser MCP](https://scrapfly.io/docs/cloud-browser-api/mcp)
- [DevTools Protocol](https://scrapfly.io/docs/cloud-browser-api/cdp-reference)
##### Integrations

- [Puppeteer](https://scrapfly.io/docs/cloud-browser-api/puppeteer)
- [Playwright](https://scrapfly.io/docs/cloud-browser-api/playwright)
- [Selenium](https://scrapfly.io/docs/cloud-browser-api/selenium)
- [Vercel Agent Browser](https://scrapfly.io/docs/cloud-browser-api/agent-browser)
- [Browser Use](https://scrapfly.io/docs/cloud-browser-api/browser-use)
- [Stagehand](https://scrapfly.io/docs/cloud-browser-api/stagehand)

- [Billing](https://scrapfly.io/docs/cloud-browser-api/billing)
- [Errors](https://scrapfly.io/docs/cloud-browser-api/errors)


### Tools

- [Antibot Detector](https://scrapfly.io/docs/tools/antibot-detector)

### SDK

- [Golang](https://scrapfly.io/docs/sdk/golang)
- [Python](https://scrapfly.io/docs/sdk/python)
- [Rust](https://scrapfly.io/docs/sdk/rust)
- [TypeScript](https://scrapfly.io/docs/sdk/typescript)
- [Scrapy](https://scrapfly.io/docs/sdk/scrapy)

### Integrations

- [Getting Started](https://scrapfly.io/docs/integration/getting-started)
- [LangChain](https://scrapfly.io/docs/integration/langchain)
- [LlamaIndex](https://scrapfly.io/docs/integration/llamaindex)
- [CrewAI](https://scrapfly.io/docs/integration/crewai)
- [Zapier](https://scrapfly.io/docs/integration/zapier)
- [Make](https://scrapfly.io/docs/integration/make)
- [n8n](https://scrapfly.io/docs/integration/n8n)

### Academy

- [Overview](https://scrapfly.io/academy)
- [Web Scraping Overview](https://scrapfly.io/academy/scraping-overview)
- [Tools](https://scrapfly.io/academy/tools-overview)
- [Reverse Engineering](https://scrapfly.io/academy/reverse-engineering)
- [Static Scraping](https://scrapfly.io/academy/static-scraping)
- [HTML Parsing](https://scrapfly.io/academy/html-parsing)
- [Dynamic Scraping](https://scrapfly.io/academy/dynamic-scraping)
- [Hidden API Scraping](https://scrapfly.io/academy/hidden-api-scraping)
- [Headless Browsers](https://scrapfly.io/academy/headless-browsers)
- [Hidden Web Data](https://scrapfly.io/academy/hidden-web-data)
- [JSON Parsing](https://scrapfly.io/academy/json-parsing)
- [Data Processing](https://scrapfly.io/academy/data-processing)
- [Scaling](https://scrapfly.io/academy/scaling)
- [Walkthrough Summary](https://scrapfly.io/academy/walkthrough-summary)
- [Scraper Blocking](https://scrapfly.io/academy/scraper-blocking)
- [Proxies](https://scrapfly.io/academy/proxies)

---

# URL sources

 A crawl always starts from one URL source. The Crawler API supports three, and exactly one must be provided in the request body. Sending zero or more than one source returns [`ERR::CRAWLER::CONFIG_ERROR`](https://scrapfly.io/docs/crawler-api/error/ERR::CRAWLER::CONFIG_ERROR).

## How a crawl flows

    Three input sources feed the same backlog. The seed URL also enables discovery (sitemaps, robots.txt, link-following). The remote URL list adds a fetch step at crawl start. Results are then surfaced via polling, webhook, or scheduled runs.  ## Comparison

 | Source | Field | Discovery | Best for |
|---|---|---|---|
| Seed URL | `url` | On (sitemaps, robots.txt, link-following) | Walk a whole site or a subtree |
| URL list | `url_list` | Off | Re-crawl a known set of URLs (product catalog refresh, link-rot check) |
| Remote URL list | `remote_url_list` | Off | Crawl a list maintained outside Scrapfly (CMS export, gist, S3 manifest) |

## Pick a mode

- Seed URL
- URL List
- Remote URL List

### Seed URL

 Provide a single starting URL in `url`. The crawler discovers more URLs from sitemaps, `robots.txt` and page links. This is the default mode and the right choice when you want to walk a whole site or a subtree.

 ```
curl -X POST 'https://api.scrapfly.io/crawl?key=YOUR_API_KEY' \
    -H 'Content-Type: application/json' \
    -d '{
        "url": "https://web-scraping.dev/products"
    }'

```

 Tune discovery with `use_sitemaps`, `respect_robots_txt`, `follow_external_links`, `follow_internal_subdomains`, `allowed_external_domains`, `allowed_internal_subdomains`, `exclude_paths`, `include_only_paths`, `max_depth`. See the [getting started](https://scrapfly.io/docs/crawler-api/getting-started) page for the full parameter list.

### URL List

 Send an explicit array of URLs in `url_list`. The Crawler seeds its backlog with exactly those URLs and crawls them. **No discovery happens**. Use this when you already know the URLs you want, for example to refresh a product catalog or to re-fetch a known set of pages.

#### Request shape

 ```
curl -X POST 'https://api.scrapfly.io/crawl?key=YOUR_API_KEY' \
    -H 'Content-Type: application/json' \
    -d '{
        "url_list": [
            "https://web-scraping.dev/products/1",
            "https://web-scraping.dev/products/2",
            "https://web-scraping.dev/products/3"
        ],
        "content_formats": ["html", "markdown"],
        "max_concurrency": 5
    }'

```

 ```
import httpx

resp = httpx.post(
    "https://api.scrapfly.io/crawl",
    params={"key": "YOUR_API_KEY"},
    json={
        "url_list": [
            "https://web-scraping.dev/products/1",
            "https://web-scraping.dev/products/2",
            "https://web-scraping.dev/products/3",
        ],
        "content_formats": ["html", "markdown"],
        "max_concurrency": 5,
    },
    timeout=30,
)
print(resp.json())

```

#### Limits

- Up to **3 MB** total bytes for the array (raw JSON-encoded size).
- Up to **100,000** URLs per request. Beyond either cap, the API returns [`ERR::CRAWLER::URL_LIST_TOO_LARGE`](https://scrapfly.io/docs/crawler-api/error/ERR::CRAWLER::URL_LIST_TOO_LARGE).
- An empty array returns [`ERR::CRAWLER::URL_LIST_EMPTY`](https://scrapfly.io/docs/crawler-api/error/ERR::CRAWLER::URL_LIST_EMPTY).

#### URL validation

Each URL is validated independently. A URL is rejected when:

- The scheme is not `http` or `https`.
- The hostname is missing or malformed.
- A custom port is used. Only `:80` (http) and `:443` (https) are allowed.
- The hostname resolves to a private, loopback or multicast IP range.

 Invalid URLs are recorded in the crawl report with status `skipped` and reason `input_list:invalid_url`. The rest of the list keeps crawling, one bad URL doesn't fail the run.

### Remote URL List

 Provide a URL in `remote_url_list`. At crawl start the worker fetches that URL, parses the response as a plain-text list, and seeds the backlog with one URL per line. **No discovery happens**. This is the right fit when the URL list lives outside Scrapfly: a CMS export, a gist you keep updated, a manifest in object storage, etc. When the crawl is scheduled, the file is re-fetched on each run.

#### Request shape

 ```
curl -X POST 'https://api.scrapfly.io/crawl?key=YOUR_API_KEY' \
    -H 'Content-Type: application/json' \
    -d '{
        "remote_url_list": "https://example.com/urls.txt",
        "content_formats": ["html"]
    }'

```

#### File format

- One URL per line.
- Lines whose **first character is `#`** are treated as comments and skipped. URLs with a `#fragment` are kept since the `#` is not at column zero.
- Blank lines are ignored.
- Lines are trimmed of surrounding whitespace before validation.

 ```
# Comments start with # at column zero.

https://web-scraping.dev/products/1
https://web-scraping.dev/products/2
https://web-scraping.dev/products/3

# Blank lines are ignored.
https://web-scraping.dev/api/products

```

#### Hosting requirements

- The URL must be `http://` or `https://` on the default port (80 or 443).
- The endpoint must respond with `2xx`. Non-2xx responses raise [`ERR::CRAWLER::REMOTE_LIST_FETCH_FAILED`](https://scrapfly.io/docs/crawler-api/error/ERR::CRAWLER::REMOTE_LIST_FETCH_FAILED).
- The body must not exceed **3 MB** or **100,000** URLs. Beyond either cap the engine raises [`ERR::CRAWLER::REMOTE_LIST_TOO_LARGE`](https://scrapfly.io/docs/crawler-api/error/ERR::CRAWLER::REMOTE_LIST_TOO_LARGE).
- The Content-Type should be `text/plain` (or any `text/*`). Other types are accepted but parsed as text.

#### URL validation

 Each URL inside the file is validated independently. URLs with a non-http(s) scheme, a custom port, or a private/loopback/multicast IP are recorded with status `skipped` and reason `remote_list:invalid_url`; the rest of the list keeps crawling.

#### Hosting on S3, GCS, or any private object store

 Scrapfly fetches `remote_url_list` over plain HTTPS — no AWS or GCP credentials are stored on our side. To host the list in a private bucket, generate a **time-bound signed URL** on your side and pass that URL as `remote_url_list`. The signed URL grants read-only access to that one object until its expiry, after which it stops working — standard cloud-storage practice. Refer to the official cloud-vendor documentation linked under each example for the authoritative permissions, signing-process, and expiry-limit reference.

##### AWS S3 — presigned URL

 Generate a presigned `GET` URL with the AWS CLI or any AWS SDK. Authoritative AWS references:

- [User Guide — Sharing an object with a presigned URL ](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html)
- [AWS CLI — `aws s3 presign` reference ](https://docs.aws.amazon.com/cli/latest/reference/s3/presign.html)
- [S3 API — Signature Version 4 query-string authentication (max **7-day** expiry) ](https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html)

 ```
# Sign a URL valid up to 7 days (SigV4 maximum), paste into your scheduled crawl
aws s3 presign s3://my-bucket/urls.txt --expires-in 604800
# → https://my-bucket.s3.amazonaws.com/urls.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&...

# Use it as remote_url_list
curl -X POST 'https://api.scrapfly.io/crawl?key=YOUR_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "remote_url_list": "https://my-bucket.s3.amazonaws.com/urls.txt?X-Amz-Algorithm=...",
    "page_limit": 1000
  }'
```

 Need an expiry longer than 7 days? Use a long-lived IAM user / role whose credentials you keep, and re-sign before each schedule fire by updating the schedule via [`PATCH /crawl/schedules/&lcub;id&rcub;`](https://scrapfly.io/docs/crawler-api/schedule). AWS recommends rotation over long-lived signatures.

##### Google Cloud Storage — signed URL

 Generate a V4 signed URL with the Google Cloud SDK or any Google client library. Authoritative GCP references:

- [Cloud Storage docs — Signed URLs (overview &amp; auth options) ](https://cloud.google.com/storage/docs/access-control/signed-urls)
- [`gcloud storage sign-url` reference ](https://cloud.google.com/sdk/gcloud/reference/storage/sign-url)
- [Cloud Storage docs — Manual V4 signing-process spec (max **7-day** expiry) ](https://cloud.google.com/storage/docs/access-control/signing-urls-manually)

 ```
# Sign a URL valid up to 7 days (V4 maximum) with the gcloud CLI
gcloud storage sign-url gs://my-bucket/urls.txt --duration=7d --private-key-file=sa-key.json
# → https://storage.googleapis.com/my-bucket/urls.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&...

# Use it as remote_url_list — same shape as the S3 case, just a different host
curl -X POST 'https://api.scrapfly.io/crawl?key=YOUR_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "remote_url_list": "https://storage.googleapis.com/my-bucket/urls.txt?X-Goog-Algorithm=...",
    "page_limit": 1000
  }'
```

 **Tips:**

- Pick the URL expiry to comfortably outlive your schedule. SigV4 (AWS) and V4 (GCS) both cap signed-URL expiry at **7 days** — for longer-running schedules, rotate the URL via [`PATCH /crawl/schedules/&lcub;id&rcub;`](https://scrapfly.io/docs/crawler-api/schedule) from your own credential-rotation job (every 6 days, etc.).
- When the URL expires, the engine raises [`ERR::CRAWLER::REMOTE_LIST_FETCH_FAILED`](https://scrapfly.io/docs/crawler-api/error/ERR::CRAWLER::REMOTE_LIST_FETCH_FAILED) on the next fire; the schedule's last execution row reports a 4xx so you know to rotate.
- The signed-URL pattern works for any object store that supports it: [Cloudflare R2 (presigned URL) ](https://developers.cloudflare.com/r2/api/s3/presigned-urls/), [Azure Blob Storage (Service SAS) ](https://learn.microsoft.com/azure/storage/blobs/sas-service-create), MinIO, Backblaze B2, etc. The engine treats them all as standard HTTPS.

#### Scheduling

 Schedule a crawl with `remote_url_list` and Scrapfly will re-fetch the file on every recurrence. This is the simplest way to keep the crawled set in sync with an externally maintained source: drop a new URL into the file, the next scheduled run picks it up.

 ```
curl -X POST 'https://api.scrapfly.io/crawl/schedules?key=YOUR_API_KEY' \
    -H 'Content-Type: application/json' \
    -d '{
        "crawler_config": {
            "remote_url_list": "https://example.com/urls.txt"
        },
        "webhook_name": "my-webhook",
        "recurrence": { "cron": "0 6 * * *" },
        "notes": "Daily 06:00 UTC refresh"
    }'

```

 See the [Schedule](https://scrapfly.io/docs/crawler-api/schedule) page for full schedule semantics.

## Discovery is forced off in list and remote\_list modes

 When you use `url_list` or `remote_url_list`, the API forces these flags to false regardless of what you send:

- `use_sitemaps`
- `respect_robots_txt`
- `follow_external_links`
- `follow_internal_subdomains`
- `ignore_no_follow`

 The user-supplied list is the ground truth. If you need broader exploration, use the seed URL mode.

## Concurrency default

 In `url_list` and `remote_url_list` modes, the default `max_concurrency` is 30% of your account quota (clamped to at least 1). The user-supplied list is finite, so smaller batches are friendlier to the source servers than fanning out at full quota. Set `max_concurrency` explicitly to override.