  // PRODUCT# The Best Web Crawler API

 Crawl an entire website with one API call using Scrapfly's web crawler API. Configure a seed URL and the API recursively discovers, scrapes, and extracts every page, with full anti-bot bypass built in.

##  Thousands of pages from one API call. Anti-bot bypass on every page. 

- **BFS/DFS depth control.** Set `max_depth`, `page_limit`, and URL patterns. The crawler discovers links and queues them automatically.
- **Every WSA feature, applied per page.** Anti-bot, residential proxies, JS rendering, AI extraction - configured once, applied to all crawled pages.
 
 [ Get Free API Key ](https://scrapfly.io/register) [ Developer Docs ](https://scrapfly.io/docs/crawler-api/getting-started) 

 1,000 free credits. No credit card required. 

 





'; var lanesEl = el.querySelector('[data-lanes]'); // Engine slug → marketing-source SVG (filled silhouette, FA-style). // The two icon URLs are pre-resolved by Twig's asset() helper and // stored on the wrapper as data-attrs, so the  picks up the // correct CDN-prefixed URL in prod and the same-origin URL in dev // without the JS having to know about it. var engineIcons = { CURLIUM: el.dataset.curliumIcon, SCRAPIUM: el.dataset.scrapiumIcon }; function engineIconSrc (engineName) { return engineIcons[engineName] || ''; } function buildLane (idx, initialStep) { var job = pickJob(); var laneEl = document.createElement('div'); laneEl.className = 'anim-scrape__lane'; laneEl.innerHTML = '' + '' + '' + job.country + '/' + job.pool + '' + '`' + job.path + '`' + '' + stageNames\[initialStep\] + '' + '' + '

' + '

'; return { el: laneEl, engineIconEl: laneEl.querySelector('[data-engine-icon]'), geoEl: laneEl.querySelector('[data-geo]'), pathEl: laneEl.querySelector('[data-path]'), stageEl: laneEl.querySelector('[data-stage-text]'), elapsedEl: laneEl.querySelector('[data-elapsed]'), barEl: laneEl.querySelector('[data-bar]'), step: initialStep, cumul: 0, job: job }; } // Three lanes, each starting at a different stage so they look // genuinely parallel from the first frame (no staggered fade-in). var lanes = [buildLane(0, 0), buildLane(1, 1), buildLane(2, 2)]; lanes.forEach(function (l) { lanesEl.appendChild(l.el); }); // Live RPS readout — sums lane completions over a rolling window. var statRpsEl = el.querySelector('[data-rps]'); var completionTimes = []; function updateRps () { var now = Date.now(); completionTimes = completionTimes.filter(function (t) { return now - t &lt; 1000; }); statRpsEl.textContent = completionTimes.length === 0 ? '—' : String(completionTimes.length * 3); } function tickLane (lane) { var s = lane.step; var deltas = lane.job.deltas; if (s &lt; deltas.length) { var d = deltas[s][0] + Math.random() * (deltas[s][1] - deltas[s][0]); lane.cumul += d; lane.stageEl.textContent = stageNames[s]; lane.elapsedEl.textContent = Math.round(lane.cumul) + 'ms'; } lane.barEl.setAttribute('data-progress', String(s)); lane.step++; if (lane.step &gt; deltas.length) { // Cycle complete — record completion, then re-roll the entire // job (path + engine + geo + pool) so each lane keeps showing // realistic Scrapfly variety rather than reusing the same // engine/path forever. completionTimes.push(Date.now()); lane.step = 0; lane.cumul = 0; lane.job = pickJob(); lane.pathEl.textContent = lane.job.path; lane.geoEl.textContent = lane.job.country + '/' + lane.job.pool; lane.engineIconEl.src = engineIconSrc(lane.job.engine); lane.engineIconEl.alt = lane.job.engine; lane.engineIconEl.title = lane.job.engine; lane.stageEl.textContent = stageNames[0]; lane.elapsedEl.textContent = ''; } } // Each lane ticks on its own interval — staggered so they don't // synchronize over time (~480ms ± per-lane jitter). var intervals = lanes.map(function (lane, i) { // Phase-stagger first tick by 160ms × lane index. setTimeout(function () { tickLane(lane); }, i * 160); return setInterval(function () { tickLane(lane); }, 460 + i * 40); }); var rpsInterval = setInterval(updateRps, 250); // Return the first interval so the existing teardown contract holds. // (No teardown is currently invoked, but stay symmetrical with the // other drivers that all return one setInterval handle.) void intervals; void rpsInterval; return intervals[0]; }, browser: function (el) { el.innerHTML = 'CDP EVENTS ' + '
'; var feed = el.querySelector('[data-feed]'); // Each event carries a realistic [minDelta, maxDelta] in ms — the // gap from the *previous* event in the same request flight. Numbers // mirror what Chrome DevTools shows on a real CDP trace: tens of ms // between network events, hundreds for DOM/load milestones, ~10ms // for Input dispatch. Hardcoded timestamps are dropped from detail // strings so the feed-time column is the single source of truth. var events = [ ['Network.requestWillBeSent', 'GET web-scraping.dev/abc', [15, 25]], ['Page.frameStartedLoading', 'frame=main', [5, 15]], ['Network.responseReceived', 'status=200, type=document', [40, 120]], ['Page.domContentEventFired', 'frame=main', [180, 320]], ['Runtime.executionContextCreated', 'origin=web-scraping.dev', [10, 25]], ['DOM.documentUpdated', 'nodes=1,284', [20, 60]], ['Page.loadEventFired', 'frame=main', [120, 240]], ['Network.dataReceived', '124.3 KB', [15, 45]], ['Input.dispatchMouseEvent', 'click (842, 316)', [5, 15]] ]; var i = 0; // tCdp is the simulated CDP clock in ms, NOT wall-clock time. It // resets at the start of each request flight (every full cycle of // events) so the feed reads as a fresh trace, not a 30-minute // session log. var tCdp = 0; function tick () { var idx = i % events.length; if (idx === 0) tCdp = 0; var ev = events[idx]; var jitter = ev[2][0] + Math.random() * (ev[2][1] - ev[2][0]); tCdp += jitter; var dt = Math.round(tCdp) + 'ms'; var li = document.createElement('li'); li.innerHTML = '' + dt + '' + '' + ev\[0\] + '' + '' + ev\[1\] + ''; feed.insertBefore(li, feed.firstChild); while (feed.children.length &gt; 6) feed.removeChild(feed.lastChild); i++; } for (var k = 0; k &lt; 4; k++) tick(); return setInterval(tick, 950); }, screenshot: function (el) { el.innerHTML = 'CAPTURING ' + '' + '' + '

' + '

' + '

' + '' + 'PNG' + 'JPEG' + 'WEBP' + 'FULL PAGE' + '

' + '

'; var fmts = el.querySelectorAll('[data-fmt]'); var shutter = el.querySelector('[data-shutter]'); var spec = el.querySelector('[data-spec]'); var elapsed = el.querySelector('[data-elapsed]'); // Each format combines a realistic viewport spec, capture latency, // and resulting payload size. Numbers cross-checked against // Scrapfly screenshot benchmarks: PNG/JPEG/WEBP on 1920×1080 land // 180-400ms; full-page on a long article scrolls + stitches and // takes 700-1200ms. var presets = [ { dim: '1920×1080', size: '184 KB', latencyMs: [180, 320] }, { dim: '1920×1080', size: '92 KB', latencyMs: [160, 260] }, { dim: '1920×1080', size: '76 KB', latencyMs: [200, 360] }, { dim: '1920×6840', size: '1.4 MB', latencyMs: [780, 1180] } ]; var step = 0; var anim = null; function tick () { var p = presets[step]; var latency = Math.round(p.latencyMs[0] + Math.random() * (p.latencyMs[1] - p.latencyMs[0])); spec.textContent = p.dim + ' • ' + p.size; elapsed.textContent = latency + 'ms'; // Web Animations API for the shutter sweep — replaces a CSS // transition + offsetWidth-reflow restart trick. Each tick we // cancel the previous animation and run a fresh one; WAAPI keeps // the work on the compositor thread, so no main-thread reflow. if (anim) anim.cancel(); anim = shutter.animate( [{ width: '0%' }, { width: '100%' }], { duration: latency, easing: 'cubic-bezier(.2,.8,.2,1)', fill: 'forwards' } ); fmts.forEach(function (f, i) { f.classList.toggle('anim-screenshot__format--active', i === step); }); step = (step + 1) % fmts.length; } tick(); return setInterval(tick, 1500); }, extract: function (el) { el.innerHTML = 'SCHEMA HYDRATION ' + '' + '{ name: \_\_\_\_\_\_\_\_\_\_\_\_,
' + ' price: \_\_\_\_\_\_\_\_\_\_\_\_,
' + ' in\_stock: \_\_\_\_,
' + ' rating: \_\_\_\_ }' + '

'; var records = [ { name: '"Widget Pro"', price: '49.99', in_stock: 'true', rating: '4.7' }, { name: '"Acme Runner"', price: '129.00', in_stock: 'true', rating: '4.3' }, { name: '"Vintage Chair"', price: '340.00', in_stock: 'false', rating: '4.9' }, { name: '"Coffee Grinder"', price: '89.50', in_stock: 'true', rating: '4.6' } ]; var keys = ['name', 'price', 'in_stock', 'rating']; var stat = el.querySelector('[data-stat]'); // Counter that ticks up each completed record so the panel reads // as "ongoing batch extraction" rather than a single shot demo. var totalRecords = 0; var rec = 0, step = 0; function tick () { var key = keys[step]; var field = el.querySelector('[data-field="' + key + '"]'); if (field) { field.textContent = records[rec % records.length][key]; field.className = 'v v-new'; } step++; if (step &gt;= keys.length) { step = 0; rec++; totalRecords++; if (stat) stat.textContent = totalRecords.toLocaleString('en-US') + ' records • ~340ms/rec'; setTimeout(function () { keys.forEach(function (k) { var f = el.querySelector('[data-field="' + k + '"]'); if (!f) return; f.textContent = k === 'in_stock' || k === 'rating' ? '____' : '____________'; f.className = 'pending'; }); }, 600); } } // Faster field reveal — 250ms feels like a template extraction // (regex/CSS), not a slow LLM dribble. Total per-record: ~1s. return setInterval(tick, 250); }, crawl: function (el) { el.innerHTML = '' + '**0 urls discovered**' + 'depth 1/5 • 0 req/s' + '

' + '```
web-scraping.dev/
```

'; var countEl = el.querySelector('[data-count]'); var depthEl = el.querySelector('[data-depth]'); var rpsEl = el.querySelector('[data-rps]'); var treeEl = el.querySelector('[data-tree]'); var branches = [ '├─ /products (1,284 pages)', '│ ├─ /products/shoes (392)', '│ ├─ /products/bags (218)', '│ └─ /products/accessories (674)', '├─ /articles (3,902 pages)', '│ ├─ /articles/2024/', '│ └─ /articles/2025/', '├─ /reviews (8,401)', '└─ /sitemap.xml' ]; // Counter starts plausible, climbs by realistic-per-tick batches // (~10 req/s sustained = 65/tick at 650ms cadence; we vary per // tick to read as live discovery rather than a clock). var count = 1, branchIdx = 0, depth = 1; function tick () { var batch = 50 + Math.floor(Math.random() * 60); count += batch; countEl.textContent = count.toLocaleString('en-US'); // RPS oscillates around 8-15 — the typical Scrapfly crawler // throttle for a single seed under default politeness. rpsEl.textContent = String(8 + Math.floor(Math.random() * 8)); if (branchIdx &lt; branches.length) { treeEl.innerHTML += '\n' + branches[branchIdx]; branchIdx++; depth = Math.min(5, 1 + Math.floor(branchIdx / 2)); depthEl.textContent = String(depth); } else { setTimeout(function () { treeEl.innerHTML = 'web-scraping.dev/'; branchIdx = 0; depth = 1; count = 1; depthEl.textContent = '1'; countEl.textContent = '1'; }, 1400); branchIdx = branches.length + 1; } } return setInterval(tick, 650); } }; document.querySelectorAll('[data-hero-anim]').forEach(function (el) { var kind = el.getAttribute('data-hero-anim'); var driver = drivers[kind]; if (driver) driver(el); }); })(); 

 

 

---

## 10k+

pages crawled per job in production

 



 

## 8

webhook event types, per-page or on completion

 



 

## 190+

countries for geo-targeted proxies

 



 

## 55k+

developers building on the platform

 



 

 

 

---

 CAPABILITIES## One Job, Entire Website

Scope rules, depth control, extraction, streaming. All composable, all on one endpoint.

 

 ### Crawl Pipeline

Provide one seed URL. The crawler discovers every reachable page within your scope, fetches each one through the full Scrapfly stack with anti-bot bypass applied, and streams results as they arrive. Nothing to poll, nothing to stitch together.

  **One seed URL** entire site 

  **ASP per page** auto anti-bot 

  **Auto retry** failed URLs 

  **Streaming** results as crawled 

 

  **Seed URL** starting point for discovery, one per job 

 

  **Discovery** BFS or DFS link traversal + sitemap ingestion 

 

  **Scope Filter** include/exclude paths, depth cap, page limit, dedup 

 

 

 

  **ASP Fetch** anti-bot bypass + proxy rotation applied per page 

 

  **Parse / Extract** HTML, markdown, JSON, AI model, custom template 

 

  **Stream** webhook events per page or pull via GET /urls, GET /pages 

 

 

 

 

  **BFS / DFS** traversal mode 

  **JSON / JSONL** structured output 

  **Markdown** LLM-ready text 

  **WARC** full archive 

 

[View Crawler API docs →](https://scrapfly.io/docs/crawler-api/getting-started)

 



 

 

 ### Crawl Scope Rules

Seed a URL and define exactly what to follow. Stay on the same host, narrow to URL patterns, block external links, limit depth, or cap total pages. Sitemap ingestion and robots.txt respect are both toggleable. The crawler queues and deduplicates URLs automatically.

 include\_only\_paths 

 exclude\_paths 

 follow\_external\_links 

 page\_limit 

 max\_depth 

 use\_sitemaps 

 respect\_robots\_txt 

 deduplication 

 

[View scope docs →](https://scrapfly.io/docs/crawler-api/getting-started)

 



 

 ### Output Formats

Every crawled page is available in the format your pipeline needs. Pull discovered URLs as streaming plain text via `GET /urls` (gzip supported). Fetch full page results via `GET /pages` as JSON or JSONL. Request markdown output for LLM pipelines or WARC for archiving.

  **GET /urls** streaming text 

  **GET /pages** JSON / JSONL 

  **WARC** archive format 

 

HTML

Markdown

gzip streaming

JSONL

 

[View results docs →](https://scrapfly.io/docs/crawler-api/results)

 



 

 

 ### Anti-Bot Per Page, Automatic

Every URL the crawler discovers runs through the full Scrapfly anti-bot stack. Anti-bot bypass, proxy selection, JS rendering, and custom headers are configured once on the job and applied to every page. Failed pages are retried automatically at no extra credit cost.

 [Cloudflare](https://scrapfly.io/bypass/cloudflare) 

 [DataDome](https://scrapfly.io/bypass/datadome) 

 [Akamai](https://scrapfly.io/bypass/akamai) 

 [PerimeterX](https://scrapfly.io/bypass/perimeterx) 

 [Kasada](https://scrapfly.io/bypass/kasada) 

 [Imperva](https://scrapfly.io/bypass/incapsula) 

 

 [See all antibot bypasses →](https://scrapfly.io/bypass) 



 

 ### Webhook Streaming

Add a `webhook` URL and results stream to your endpoint as each page is visited. No polling, no held connections. Eight event types cover the full crawl lifecycle, from first discovery to job completion. The Python SDK ships typed dataclasses for each event.

  **crawler\_started** job begins, seed URL queued 

 

  **url\_discovered** new URL found and added to the queue 

 

  **url\_visited** page fetched, content available in payload 

 

  **url\_failed** page failed after retries, error included 

 

  **crawler\_finished** job complete, final summary delivered 

 

 

[View webhook docs →](https://scrapfly.io/docs/crawler-api/webhook)

 



 

 

 ### Per-Page AI Extraction

Apply structured extraction to every page in the crawl. Pick a pre-trained model (`product`, `article`, `review`), define a template with CSS/XPath selectors, or send an `extraction_prompt` for LLM-based free-form extraction. Extracted JSON is included in each webhook payload and in the `GET /pages` response. Pipes through the same [Extraction API](https://scrapfly.io/products/extraction-api) used standalone.

 [extraction\_model](https://scrapfly.io/docs/extraction-api/automatic-ai) 

 [extraction\_template](https://scrapfly.io/docs/extraction-api/rules-and-template) 

 [extraction\_prompt](https://scrapfly.io/docs/extraction-api/llm-prompt) 

 markdown format 

 product model 

 article model 

 review model 

 extracted\_data JSON 

 

[View extraction docs →](https://scrapfly.io/docs/crawler-api/extraction-rules)

 



 

 

 ### WARC Results

All crawled pages are stored as WARC artifacts. Download the full archive after completion or stream per-page via webhook. WARC format is standard across SEO tools, archiving pipelines, and LLM training datasets.

[View WARC docs →](https://scrapfly.io/docs/crawler-api/warc-format)

 



 

 ### Live Observability

Every crawl job has a live dashboard showing URLs visited, discovered, failed, and skipped in real time. Replay any individual page scrape from the log viewer.

[View monitoring docs →](https://scrapfly.io/docs/crawler-api/monitoring)

 



 

 ### URL Deduplication

URLs are normalized and deduplicated before queuing. Query string ordering, trailing slashes, and fragment identifiers are handled so no page is crawled twice. Combine with `page_limit` to stay within budget.

 



 

 

 ### Built for Every Large-Scale Crawl Use Case

The Crawler API composes with the full Scrapfly product line. Use standalone for URL discovery and archiving, or pair with extraction and AI features for richer pipelines.

 SEO auditing 

 RAG dataset prep 

 LLM training data 

 Competitive monitoring 

 Website archiving 

 Link graph extraction 

 Price monitoring 

 Content indexing 

 

 [ // WEB SCRAPING API **Single-page scraping** Targeted extraction with full anti-bot bypass. Use when you already know the URLs. ](https://scrapfly.io/products/web-scraping-api) 

 [ // DATA EXTRACTION API **Structured output from crawled pages** Submit raw HTML and get structured JSON back. Pairs with Crawler API results. ](https://scrapfly.io/products/extraction-api) 

 [ // AI WEB SCRAPING API **Auto-extraction at crawl scale** Scrape and extract in one call. Apply to every crawled page for zero-schema pipelines. ](https://scrapfly.io/ai-web-scraping-api) 

 

 



 

 

 

---

 CODE## Crawl Any Site in Four Lines

Start with Quickstart, refine with scope, extract in the same call.

 

 [ Quickstart ](#ca-strat-quickstart) [ Scope Rules ](#ca-strat-scope) [ Per-Page Extraction ](#ca-strat-extraction) [ Webhook Streaming ](#ca-strat-webhook) 

Seed URL, depth limit, hit start.

     Python TypeScript HTTP / cURL  

    

 ```
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="API KEY")

# Start crawl job — discovers and scrapes links automatically
crawl = Crawl(
    client,
    CrawlerConfig(
        url='https://web-scraping.dev/products',
        page_limit=50,
        max_depth=3,
    )
).crawl().wait()

# Retrieve all crawled pages
pages = crawl.warc().get_pages()
for page in pages:
    print(page['url'], page['status_code'])
```

 ```
import { ScrapflyClient, CrawlerConfig, Crawl } from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });

const crawl = new Crawl(
    client,
    new CrawlerConfig({
        url: 'https://web-scraping.dev/products',
        page_limit: 50,
        max_depth: 3,
    })
);

await crawl.start();
await crawl.wait();

const status = await crawl.status();
console.log(`visited=${status.state.urls_visited}`);
const md = await crawl.read('https://web-scraping.dev/products', 'markdown');
console.log(md);
```

 ```
curl "https://api.scrapfly.io/crawl" \
  -d '{"url":"https://web-scraping.dev/products","page_limit":50,"max_depth":3}' \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: $SCRAPFLY_KEY"
```

 

 

 [ Python SDK docs → ](https://scrapfly.io/docs/sdk/python) [ TypeScript SDK docs → ](https://scrapfly.io/docs/sdk/typescript) [ HTTP API docs → ](https://scrapfly.io/docs) 

 

Include/exclude paths, follow-external toggles, max pages, max depth.

     Python TypeScript HTTP / cURL  

    

 ```
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="API KEY")

crawl = Crawl(
    client,
    CrawlerConfig(
        url='https://example.com',
        # Only follow links matching this pattern
        include_only_paths=['/blog/*', '/docs/*'],
        # Skip these paths entirely
        exclude_paths=['/tags/*', '/author/*'],
        # Stay on the same host
        follow_external_links=False,
        page_limit=200,
        max_depth=5,
    )
).crawl().wait()

print(f"Crawled {crawl.status().state.urls_visited} pages")
```

 ```
import { ScrapflyClient, CrawlerConfig, Crawl } from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });

const crawl = new Crawl(
  client,
  new CrawlerConfig({
    url: 'https://example.com',
    // Only follow links matching this pattern
    include_only_paths: ['/blog/*', '/docs/*'],
    // Skip these paths entirely
    exclude_paths: ['/tags/*', '/author/*'],
    // Stay on the same host
    follow_external_links: false,
    page_limit: 200,
    max_depth: 5,
  })
);

await crawl.start();
await crawl.wait();

const status = await crawl.status();
console.log(`Crawled ${status.state.urls_visited} pages`);
```

 ```
curl -X POST "https://api.scrapfly.io/crawl?key=$SCRAPFLY_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "include_only_paths": ["/blog/*", "/docs/*"],
    "exclude_paths": ["/tags/*", "/author/*"],
    "follow_external_links": false,
    "page_limit": 200,
    "max_depth": 5
  }'
```

 

 

 [ Python SDK docs → ](https://scrapfly.io/docs/sdk/python) [ TypeScript SDK docs → ](https://scrapfly.io/docs/sdk/typescript) [ HTTP API docs → ](https://scrapfly.io/docs) 

 

Apply AI models or templates to every crawled page. Pipes through the [Extraction API](https://scrapfly.io/"/products/extraction-api").

     Python TypeScript HTTP / cURL  

    

 ```
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="API KEY")

crawl = Crawl(
    client,
    CrawlerConfig(
        url='https://web-scraping.dev/products',
        page_limit=50,
        # Emit markdown + clean_html for every page,
        # ready for LLM / RAG pipelines.
        content_formats=['markdown', 'clean_html'],
    )
).crawl().wait()

# Stream markdown for every visited page via a URL pattern
for page in crawl.read_iter(pattern='*', format='markdown'):
    print(page.url, len(page.content), 'chars')
```

 ```
import { ScrapflyClient, CrawlerConfig, Crawl } from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });

const crawl = new Crawl(
  client,
  new CrawlerConfig({
    url: 'https://web-scraping.dev/products',
    page_limit: 50,
    // Emit markdown + clean_html for every page,
    // ready for LLM / RAG pipelines.
    content_formats: ['markdown', 'clean_html'],
  })
);

await crawl.start();
await crawl.wait();

// Read markdown for the seed URL
const md = await crawl.read('https://web-scraping.dev/products', 'markdown');
console.log(`markdown length=${md?.length ?? 0} chars`);
```

 ```
curl -X POST "https://api.scrapfly.io/crawl?key=$SCRAPFLY_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://web-scraping.dev/products",
    "page_limit": 50,
    "content_formats": ["markdown", "clean_html"]
  }'
```

 

 

 [ Python SDK docs → ](https://scrapfly.io/docs/sdk/python) [ TypeScript SDK docs → ](https://scrapfly.io/docs/sdk/typescript) [ HTTP API docs → ](https://scrapfly.io/docs) 

 

Fire-and-forget. Results push to your webhook per-page.

     Python TypeScript HTTP / cURL  

    

 ```
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="API KEY")

# Fire-and-forget: register a webhook in the Scrapfly dashboard
# ("my-crawl-endpoint"), then pass its name here.
crawl = Crawl(
    client,
    CrawlerConfig(
        url='https://web-scraping.dev/products',
        page_limit=500,
        webhook_name='my-crawl-endpoint',
        webhook_events=['crawler_url_visited', 'crawler_finished'],
    )
).crawl()  # returns immediately, no .wait() needed

print(f"Crawl started: {crawl.uuid}")
```

 ```
import { ScrapflyClient, CrawlerConfig, Crawl } from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });

// Fire-and-forget: register a webhook in the Scrapfly dashboard
// ("my-crawl-endpoint"), then pass its name here.
const crawl = new Crawl(
  client,
  new CrawlerConfig({
    url: 'https://web-scraping.dev/products',
    page_limit: 500,
    webhook_name: 'my-crawl-endpoint',
    webhook_events: ['crawler_url_visited', 'crawler_finished'],
  })
);

await crawl.start();  // returns immediately, no .wait() needed
console.log(`Crawl started: ${crawl.uuid}`);
```

 ```
curl -X POST "https://api.scrapfly.io/crawl?key=$SCRAPFLY_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://web-scraping.dev/products",
    "page_limit": 500,
    "webhook_name": "my-crawl-endpoint",
    "webhook_events": ["crawler_url_visited", "crawler_finished"]
  }'
```

 

 

 [ Python SDK docs → ](https://scrapfly.io/docs/sdk/python) [ TypeScript SDK docs → ](https://scrapfly.io/docs/sdk/typescript) [ HTTP API docs → ](https://scrapfly.io/docs) 

 

 

 

---

 LEARN## Docs, Tools, And Ready-Made Crawlers

Everything you need to go from zero to production crawl pipeline.

 

 ### API Reference

Every parameter, every response field, with runnable cURL examples for the Crawler API.

 [ View Crawler API docs → ](https://scrapfly.io/docs/crawler-api/getting-started) 



 

 ### Academy

Interactive courses on web scraping, anti-bot bypass, and data extraction at scale.

 [ View Academy → ](https://scrapfly.io/academy) 



 

 ### Open-Source Scrapers

40+ production-ready scrapers on GitHub. Copy, paste, customize for your use case.

 [ View scrapers repo → ](https://github.com/scrapfly/scrapfly-scrapers) 



 

 ### Developer Tools

cURL-to-Python, JA3 checker, selector tester, HTTP/2 fingerprint inspector.

 [ View developer tools → ](https://scrapfly.io/web-scraping-tools) 



 

 

 

---

  // INTEGRATIONS## Seamlessly integrate with frameworks &amp; platforms

Plug Scrapfly into your favorite tools, or build custom workflows with our first-class SDKs.

 ### No-code automation

 [  Zapier ](https://scrapfly.io/integration/zapier) [  Make ](https://scrapfly.io/integration/make) [  n8n ](https://scrapfly.io/integration/n8n) 

 

### LLM &amp; RAG frameworks

 [  LlamaIndex ](https://scrapfly.io/integration/llamaindex) [  LangChain ](https://scrapfly.io/integration/langchain) [  CrewAI ](https://scrapfly.io/integration/crewai) 

 

### First-class SDKs

 [  Python pip install scrapfly-sdk ](https://scrapfly.io/docs/sdk/python) [  TypeScript Node, Deno, Bun ](https://scrapfly.io/docs/sdk/typescript) [  Go go get scrapfly-sdk ](https://scrapfly.io/docs/sdk/golang) [  Rust cargo add scrapfly-sdk ](https://scrapfly.io/docs/sdk/rust) [  Scrapy Full-feature extension ](https://scrapfly.io/docs/sdk/scrapy) 

 

 

 [ See all integrations  ](https://scrapfly.io/integration) 

 

---

  FAQ## Frequently Asked Questions

 

  ### What is the Crawler API and how is it different from the Web Scraping API?

 The Web Scraping API retrieves single pages you specify by URL. The Crawler API takes a seed URL and automatically discovers and scrapes every linked page on the domain. Use the scraping API for targeted extraction; use the Crawler API when you need a whole site and don't know all the URLs upfront.

 

   ### How do I control which pages get crawled?

 Use `include_only_paths` to whitelist URL patterns (e.g. `/blog/*`), `exclude_paths` to blacklist paths, `max_depth` to limit link depth from the seed, and `page_limit` to cap total pages. Set `follow_external_links=False` to stay on the seed host. All options combine, so you can precisely scope crawls to exactly what you need.

 

   ### Does anti-bot bypass apply to every crawled page?

 Yes. Anti-bot, proxy selection, and all WSA parameters configured on the crawl job are applied automatically to every URL the crawler discovers. You don't need to re-enable them per page. Cloudflare, DataDome, Akamai, PerimeterX, Kasada, and others are all handled transparently.

 

   ### How do I receive results as the crawl runs?

 Set a `webhook` URL on the job. The API sends a POST to your endpoint for each event: `url_visited`, `url_discovered`, `url_failed`, `url_skipped`, `crawler_started`, `crawler_finished`, and more. The Python SDK includes typed dataclasses for each event so you can dispatch without manual JSON parsing.

 

   ### Can I extract structured data from every crawled page?

 Yes. Set `extraction_model` to a pre-trained model (`product`, `article`, `review`), define an `extraction_template` with CSS/XPath/JMESPath selectors, or use `extraction_prompt` for LLM-based free-form extraction. The extracted data is included in the WARC artifact and webhook payloads per page.

 

   ### Can I crawl JavaScript-heavy single-page applications?

 Yes. Set `render_js=True` on the crawl job and every page is loaded in a real Chrome browser (Scrapium, our stealth Chromium). Dynamic content, AJAX-loaded links, and lazy-loaded data are all captured before the page is stored.

 

   ### Is web crawling legal?

 Crawling publicly accessible data is legal in most jurisdictions. Meta v. Bright Data and hiQ v. LinkedIn have established strong precedent for scraping public data. You remain responsible for respecting robots.txt, rate limits, and target terms of service. See [our legal overview](https://scrapfly.io/"/is-web-scraping-legal") for details.

 

  

 

  ---

 // PRICING## Transparent, usage-based pricing

One plan covers the full Scrapfly platform. Pick a monthly credit budget; every API shares the same credit pool. No per-product lock-in, no surprise line items.

 

  **Free tier**1,000 free credits on signup. No credit card required.

 

 

  **Pay on success**You only pay for successful requests. Failed calls are free.

 

 

  **No lock-in**Upgrade, downgrade, or cancel anytime. No contract.

 

 

 

 [ See pricing  ](https://scrapfly.io/pricing) [ Start free ](https://scrapfly.io/register) 

 

 

### Need per-page control instead of a full crawl?

 The Crawler API handles full-site traversal. Each layer is also available standalone: [Web Scraping API](https://scrapfly.io/products/web-scraping-api) for individual page scraping with [anti-bot bypass](https://scrapfly.io/bypass), [Extraction API](https://scrapfly.io/products/extraction-api) for structured data from any HTML, [AI Web Scraping API](https://scrapfly.io/ai-web-scraping-api) for scrape + extract in one call, [Browser API](https://scrapfly.io/products/cloud-browser-api) for hosted headless Chrome, [Scrapium](https://scrapfly.io/scrapium) for stealth Chromium you drive with Playwright, or [Curlium](https://scrapfly.io/curlium) for byte-perfect HTTP without a browser.

 

 [Get Free API Key](https://scrapfly.io/register)1,000 free credits. No card.