// PRODUCT

Crawler API

One job, entire website. Configure a seed URL and the API recursively discovers, scrapes, and extracts every page - with full anti-bot bypass built in.

Thousands of pages from one API call. Anti-bot bypass on every page.

  • BFS/DFS depth control. Set max_depth, page_limit, and URL patterns. The crawler discovers links and queues them automatically.
  • Every WSA feature, applied per page. Anti-bot, residential proxies, JS rendering, AI extraction - configured once, applied to all crawled pages.
1,000 free credits. No credit card required.

10k+

pages crawled per job in production

8

webhook event types, per-page or on completion

190+

countries for geo-targeted proxies

55k+

developers building on the platform


CAPABILITIES

One Job, Entire Website

Scope rules, depth control, extraction, streaming. All composable, all on one endpoint.

Crawl Pipeline

Provide one seed URL. The crawler discovers every reachable page within your scope, fetches each one through the full Scrapfly stack with anti-bot bypass applied, and streams results as they arrive. Nothing to poll, nothing to stitch together.

One seed URL entire site
ASP per page auto anti-bot
Auto retry failed URLs
Streaming results as crawled
Seed URL starting point for discovery, one per job
Discovery BFS or DFS link traversal + sitemap ingestion
Scope Filter include/exclude paths, depth cap, page limit, dedup
ASP Fetch anti-bot bypass + proxy rotation applied per page
Parse / Extract HTML, markdown, JSON, AI model, custom template
Stream webhook events per page or pull via GET /urls, GET /pages
BFS / DFS traversal mode
JSON / JSONL structured output
Markdown LLM-ready text
WARC full archive

View Crawler API docs →

Crawl Scope Rules

Seed a URL and define exactly what to follow. Stay on the same host, narrow to URL patterns, block external links, limit depth, or cap total pages. Sitemap ingestion and robots.txt respect are both toggleable. The crawler queues and deduplicates URLs automatically.

include_only_paths
exclude_paths
follow_external_links
page_limit
max_depth
use_sitemaps
respect_robots_txt
deduplication

View scope docs →

Output Formats

Every crawled page is available in the format your pipeline needs. Pull discovered URLs as streaming plain text via GET /urls (gzip supported). Fetch full page results via GET /pages as JSON or JSONL. Request markdown output for LLM pipelines or WARC for archiving.

GET /urls streaming text
GET /pages JSON / JSONL
WARC archive format
HTML
Markdown
gzip streaming
JSONL

View results docs →

Anti-Bot Per Page, Automatic

Every URL the crawler discovers runs through the full Scrapfly anti-bot stack. Anti-bot bypass, proxy selection, JS rendering, and custom headers are configured once on the job and applied to every page. Failed pages are retried automatically at no extra credit cost.

See all antibot bypasses →

Webhook Streaming

Add a webhook URL and results stream to your endpoint as each page is visited. No polling, no held connections. Eight event types cover the full crawl lifecycle, from first discovery to job completion. The Python SDK ships typed dataclasses for each event.

crawler_started job begins, seed URL queued
url_discovered new URL found and added to the queue
url_visited page fetched, content available in payload
url_failed page failed after retries, error included
crawler_finished job complete, final summary delivered

View webhook docs →

Per-Page AI Extraction

Apply structured extraction to every page in the crawl. Pick a pre-trained model (product, article, review), define a template with CSS/XPath selectors, or send an extraction_prompt for LLM-based free-form extraction. Extracted JSON is included in each webhook payload and in the GET /pages response. Pipes through the same Extraction API used standalone.

markdown format
product model
article model
review model
extracted_data JSON

View extraction docs →

WARC Results

All crawled pages are stored as WARC artifacts. Download the full archive after completion or stream per-page via webhook. WARC format is standard across SEO tools, archiving pipelines, and LLM training datasets.

View WARC docs →

Live Observability

Every crawl job has a live dashboard showing URLs visited, discovered, failed, and skipped in real time. Replay any individual page scrape from the log viewer.

View monitoring docs →

URL Deduplication

URLs are normalized and deduplicated before queuing. Query string ordering, trailing slashes, and fragment identifiers are handled so no page is crawled twice. Combine with page_limit to stay within budget.

Built for Every Large-Scale Crawl Use Case

The Crawler API composes with the full Scrapfly product line. Use standalone for URL discovery and archiving, or pair with extraction and AI features for richer pipelines.

SEO auditing
RAG dataset prep
LLM training data
Competitive monitoring
Website archiving
Link graph extraction
Price monitoring
Content indexing

CODE

Crawl Any Site in Four Lines

Start with Quickstart, refine with scope, extract in the same call.

Seed URL, depth limit, hit start.

from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="API KEY")

# Start crawl job — discovers and scrapes links automatically
crawl = Crawl(
    client,
    CrawlerConfig(
        url='https://web-scraping.dev/products',
        page_limit=50,
        max_depth=3,
    )
).crawl().wait()

# Retrieve all crawled pages
pages = crawl.warc().get_pages()
for page in pages:
    print(page['url'], page['status_code'])
import { ScrapflyClient, CrawlerConfig, Crawl } from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });

const crawl = new Crawl(
    client,
    new CrawlerConfig({
        url: 'https://web-scraping.dev/products',
        page_limit: 50,
        max_depth: 3,
    })
);

await crawl.start();
await crawl.wait();

const status = await crawl.status();
console.log(`visited=${status.state.urls_visited}`);
const md = await crawl.read('https://web-scraping.dev/products', 'markdown');
console.log(md);
curl "https://api.scrapfly.io/crawl" \
  -d '{"url":"https://web-scraping.dev/products","page_limit":50,"max_depth":3}' \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: $SCRAPFLY_KEY"

Include/exclude paths, follow-external toggles, max pages, max depth.

from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="API KEY")

crawl = Crawl(
    client,
    CrawlerConfig(
        url='https://example.com',
        # Only follow links matching this pattern
        include_only_paths=['/blog/*', '/docs/*'],
        # Skip these paths entirely
        exclude_paths=['/tags/*', '/author/*'],
        # Stay on the same host
        follow_external_links=False,
        page_limit=200,
        max_depth=5,
    )
).crawl().wait()

print(f"Crawled {crawl.status().state.urls_visited} pages")
import { ScrapflyClient, CrawlerConfig, Crawl } from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });

const crawl = new Crawl(
  client,
  new CrawlerConfig({
    url: 'https://example.com',
    // Only follow links matching this pattern
    include_only_paths: ['/blog/*', '/docs/*'],
    // Skip these paths entirely
    exclude_paths: ['/tags/*', '/author/*'],
    // Stay on the same host
    follow_external_links: false,
    page_limit: 200,
    max_depth: 5,
  })
);

await crawl.start();
await crawl.wait();

const status = await crawl.status();
console.log(`Crawled ${status.state.urls_visited} pages`);
curl -X POST "https://api.scrapfly.io/crawl?key=$SCRAPFLY_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "include_only_paths": ["/blog/*", "/docs/*"],
    "exclude_paths": ["/tags/*", "/author/*"],
    "follow_external_links": false,
    "page_limit": 200,
    "max_depth": 5
  }'

Apply AI models or templates to every crawled page. Pipes through the Extraction API.

from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="API KEY")

crawl = Crawl(
    client,
    CrawlerConfig(
        url='https://web-scraping.dev/products',
        page_limit=50,
        # Emit markdown + clean_html for every page,
        # ready for LLM / RAG pipelines.
        content_formats=['markdown', 'clean_html'],
    )
).crawl().wait()

# Stream markdown for every visited page via a URL pattern
for page in crawl.read_iter(pattern='*', format='markdown'):
    print(page.url, len(page.content), 'chars')
import { ScrapflyClient, CrawlerConfig, Crawl } from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });

const crawl = new Crawl(
  client,
  new CrawlerConfig({
    url: 'https://web-scraping.dev/products',
    page_limit: 50,
    // Emit markdown + clean_html for every page,
    // ready for LLM / RAG pipelines.
    content_formats: ['markdown', 'clean_html'],
  })
);

await crawl.start();
await crawl.wait();

// Read markdown for the seed URL
const md = await crawl.read('https://web-scraping.dev/products', 'markdown');
console.log(`markdown length=${md?.length ?? 0} chars`);
curl -X POST "https://api.scrapfly.io/crawl?key=$SCRAPFLY_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://web-scraping.dev/products",
    "page_limit": 50,
    "content_formats": ["markdown", "clean_html"]
  }'

Fire-and-forget. Results push to your webhook per-page.

from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="API KEY")

# Fire-and-forget: register a webhook in the Scrapfly dashboard
# ("my-crawl-endpoint"), then pass its name here.
crawl = Crawl(
    client,
    CrawlerConfig(
        url='https://web-scraping.dev/products',
        page_limit=500,
        webhook_name='my-crawl-endpoint',
        webhook_events=['crawler_url_visited', 'crawler_finished'],
    )
).crawl()  # returns immediately, no .wait() needed

print(f"Crawl started: {crawl.uuid}")
import { ScrapflyClient, CrawlerConfig, Crawl } from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });

// Fire-and-forget: register a webhook in the Scrapfly dashboard
// ("my-crawl-endpoint"), then pass its name here.
const crawl = new Crawl(
  client,
  new CrawlerConfig({
    url: 'https://web-scraping.dev/products',
    page_limit: 500,
    webhook_name: 'my-crawl-endpoint',
    webhook_events: ['crawler_url_visited', 'crawler_finished'],
  })
);

await crawl.start();  // returns immediately, no .wait() needed
console.log(`Crawl started: ${crawl.uuid}`);
curl -X POST "https://api.scrapfly.io/crawl?key=$SCRAPFLY_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://web-scraping.dev/products",
    "page_limit": 500,
    "webhook_name": "my-crawl-endpoint",
    "webhook_events": ["crawler_url_visited", "crawler_finished"]
  }'

LEARN

Docs, Tools, And Ready-Made Crawlers

Everything you need to go from zero to production crawl pipeline.

API Reference

Every parameter, every response field, with runnable cURL examples for the Crawler API.

View Crawler API docs →

Academy

Interactive courses on web scraping, anti-bot bypass, and data extraction at scale.

View Academy →

Open-Source Scrapers

40+ production-ready scrapers on GitHub. Copy, paste, customize for your use case.

View scrapers repo →

Developer Tools

cURL-to-Python, JA3 checker, selector tester, HTTP/2 fingerprint inspector.

View developer tools →

// INTEGRATIONS

Seamlessly integrate with frameworks & platforms

Plug Scrapfly into your favorite tools, or build custom workflows with our first-class SDKs.


FAQ

Frequently Asked Questions

What is the Crawler API and how is it different from the Web Scraping API?

The Web Scraping API retrieves single pages you specify by URL. The Crawler API takes a seed URL and automatically discovers and scrapes every linked page on the domain. Use the scraping API for targeted extraction; use the Crawler API when you need a whole site and don't know all the URLs upfront.

How do I control which pages get crawled?

Use include_only_paths to whitelist URL patterns (e.g. /blog/*), exclude_paths to blacklist paths, max_depth to limit link depth from the seed, and page_limit to cap total pages. Set follow_external_links=False to stay on the seed host. All options combine, so you can precisely scope crawls to exactly what you need.

Does anti-bot bypass apply to every crawled page?

Yes. Anti-bot, proxy selection, and all WSA parameters configured on the crawl job are applied automatically to every URL the crawler discovers. You don't need to re-enable them per page. Cloudflare, DataDome, Akamai, PerimeterX, Kasada, and others are all handled transparently.

How do I receive results as the crawl runs?

Set a webhook URL on the job. The API sends a POST to your endpoint for each event: url_visited, url_discovered, url_failed, url_skipped, crawler_started, crawler_finished, and more. The Python SDK includes typed dataclasses for each event so you can dispatch without manual JSON parsing.

Can I extract structured data from every crawled page?

Yes. Set extraction_model to a pre-trained model (product, article, review), define an extraction_template with CSS/XPath/JMESPath selectors, or use extraction_prompt for LLM-based free-form extraction. The extracted data is included in the WARC artifact and webhook payloads per page.

Can I crawl JavaScript-heavy single-page applications?

Yes. Set render_js=True on the crawl job and every page is loaded in a real Chrome browser (Scrapium, our stealth Chromium). Dynamic content, AJAX-loaded links, and lazy-loaded data are all captured before the page is stored.

Is web crawling legal?

Crawling publicly accessible data is legal in most jurisdictions. Meta v. Bright Data and hiQ v. LinkedIn have established strong precedent for scraping public data. You remain responsible for respecting robots.txt, rate limits, and target terms of service. See our legal overview for details.


// PRICING

Transparent, usage-based pricing

One plan covers the full Scrapfly platform. Pick a monthly credit budget; every API shares the same credit pool. No per-product lock-in, no surprise line items.

Free tier

1,000 free credits on signup. No credit card required.

Pay on success

You only pay for successful requests. Failed calls are free.

No lock-in

Upgrade, downgrade, or cancel anytime. No contract.

Need per-page control instead of a full crawl?

The Crawler API handles full-site traversal. Each layer is also available standalone: Web Scraping API for individual page scraping with anti-bot bypass, Extraction API for structured data from any HTML, AI Web Scraping API for scrape + extract in one call, Browser API for hosted headless Chrome, Scrapium for stealth Chromium you drive with Playwright, or Curlium for byte-perfect HTTP without a browser.

Get Free API Key
1,000 free credits. No card.