10k+
pages crawled per job in production
8
webhook event types, per-page or on completion
190+
countries for geo-targeted proxies
55k+
developers building on the platform
One Job, Entire Website
Scope rules, depth control, extraction, streaming. All composable, all on one endpoint.
Crawl Pipeline
Provide one seed URL. The crawler discovers every reachable page within your scope, fetches each one through the full Scrapfly stack with anti-bot bypass applied, and streams results as they arrive. Nothing to poll, nothing to stitch together.
Crawl Scope Rules
Seed a URL and define exactly what to follow. Stay on the same host, narrow to URL patterns, block external links, limit depth, or cap total pages. Sitemap ingestion and robots.txt respect are both toggleable. The crawler queues and deduplicates URLs automatically.
Output Formats
Every crawled page is available in the format your pipeline needs. Pull discovered URLs as streaming plain text via GET /urls (gzip supported). Fetch full page results via GET /pages as JSON or JSONL. Request markdown output for LLM pipelines or WARC for archiving.
Anti-Bot Per Page, Automatic
Every URL the crawler discovers runs through the full Scrapfly anti-bot stack. Anti-bot bypass, proxy selection, JS rendering, and custom headers are configured once on the job and applied to every page. Failed pages are retried automatically at no extra credit cost.
See all antibot bypasses →Webhook Streaming
Add a webhook URL and results stream to your endpoint as each page is visited. No polling, no held connections. Eight event types cover the full crawl lifecycle, from first discovery to job completion. The Python SDK ships typed dataclasses for each event.
Per-Page AI Extraction
Apply structured extraction to every page in the crawl. Pick a pre-trained model (product, article, review), define a template with CSS/XPath selectors, or send an extraction_prompt for LLM-based free-form extraction. Extracted JSON is included in each webhook payload and in the GET /pages response. Pipes through the same Extraction API used standalone.
WARC Results
All crawled pages are stored as WARC artifacts. Download the full archive after completion or stream per-page via webhook. WARC format is standard across SEO tools, archiving pipelines, and LLM training datasets.
Live Observability
Every crawl job has a live dashboard showing URLs visited, discovered, failed, and skipped in real time. Replay any individual page scrape from the log viewer.
URL Deduplication
URLs are normalized and deduplicated before queuing. Query string ordering, trailing slashes, and fragment identifiers are handled so no page is crawled twice. Combine with page_limit to stay within budget.
Built for Every Large-Scale Crawl Use Case
The Crawler API composes with the full Scrapfly product line. Use standalone for URL discovery and archiving, or pair with extraction and AI features for richer pipelines.
Crawl Any Site in Four Lines
Start with Quickstart, refine with scope, extract in the same call.
Seed URL, depth limit, hit start.
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="API KEY")
# Start crawl job — discovers and scrapes links automatically
crawl = Crawl(
client,
CrawlerConfig(
url='https://web-scraping.dev/products',
page_limit=50,
max_depth=3,
)
).crawl().wait()
# Retrieve all crawled pages
pages = crawl.warc().get_pages()
for page in pages:
print(page['url'], page['status_code'])
import { ScrapflyClient, CrawlerConfig, Crawl } from 'jsr:@scrapfly/scrapfly-sdk';
const client = new ScrapflyClient({ key: "API KEY" });
const crawl = new Crawl(
client,
new CrawlerConfig({
url: 'https://web-scraping.dev/products',
page_limit: 50,
max_depth: 3,
})
);
await crawl.start();
await crawl.wait();
const status = await crawl.status();
console.log(`visited=${status.state.urls_visited}`);
const md = await crawl.read('https://web-scraping.dev/products', 'markdown');
console.log(md);
curl "https://api.scrapfly.io/crawl" \
-d '{"url":"https://web-scraping.dev/products","page_limit":50,"max_depth":3}' \
-H "Content-Type: application/json" \
-H "X-Api-Key: $SCRAPFLY_KEY"
Include/exclude paths, follow-external toggles, max pages, max depth.
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="API KEY")
crawl = Crawl(
client,
CrawlerConfig(
url='https://example.com',
# Only follow links matching this pattern
include_only_paths=['/blog/*', '/docs/*'],
# Skip these paths entirely
exclude_paths=['/tags/*', '/author/*'],
# Stay on the same host
follow_external_links=False,
page_limit=200,
max_depth=5,
)
).crawl().wait()
print(f"Crawled {crawl.status().state.urls_visited} pages")
import { ScrapflyClient, CrawlerConfig, Crawl } from 'jsr:@scrapfly/scrapfly-sdk';
const client = new ScrapflyClient({ key: "API KEY" });
const crawl = new Crawl(
client,
new CrawlerConfig({
url: 'https://example.com',
// Only follow links matching this pattern
include_only_paths: ['/blog/*', '/docs/*'],
// Skip these paths entirely
exclude_paths: ['/tags/*', '/author/*'],
// Stay on the same host
follow_external_links: false,
page_limit: 200,
max_depth: 5,
})
);
await crawl.start();
await crawl.wait();
const status = await crawl.status();
console.log(`Crawled ${status.state.urls_visited} pages`);
curl -X POST "https://api.scrapfly.io/crawl?key=$SCRAPFLY_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"include_only_paths": ["/blog/*", "/docs/*"],
"exclude_paths": ["/tags/*", "/author/*"],
"follow_external_links": false,
"page_limit": 200,
"max_depth": 5
}'
Apply AI models or templates to every crawled page. Pipes through the Extraction API.
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="API KEY")
crawl = Crawl(
client,
CrawlerConfig(
url='https://web-scraping.dev/products',
page_limit=50,
# Emit markdown + clean_html for every page,
# ready for LLM / RAG pipelines.
content_formats=['markdown', 'clean_html'],
)
).crawl().wait()
# Stream markdown for every visited page via a URL pattern
for page in crawl.read_iter(pattern='*', format='markdown'):
print(page.url, len(page.content), 'chars')
import { ScrapflyClient, CrawlerConfig, Crawl } from 'jsr:@scrapfly/scrapfly-sdk';
const client = new ScrapflyClient({ key: "API KEY" });
const crawl = new Crawl(
client,
new CrawlerConfig({
url: 'https://web-scraping.dev/products',
page_limit: 50,
// Emit markdown + clean_html for every page,
// ready for LLM / RAG pipelines.
content_formats: ['markdown', 'clean_html'],
})
);
await crawl.start();
await crawl.wait();
// Read markdown for the seed URL
const md = await crawl.read('https://web-scraping.dev/products', 'markdown');
console.log(`markdown length=${md?.length ?? 0} chars`);
curl -X POST "https://api.scrapfly.io/crawl?key=$SCRAPFLY_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://web-scraping.dev/products",
"page_limit": 50,
"content_formats": ["markdown", "clean_html"]
}'
Fire-and-forget. Results push to your webhook per-page.
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="API KEY")
# Fire-and-forget: register a webhook in the Scrapfly dashboard
# ("my-crawl-endpoint"), then pass its name here.
crawl = Crawl(
client,
CrawlerConfig(
url='https://web-scraping.dev/products',
page_limit=500,
webhook_name='my-crawl-endpoint',
webhook_events=['crawler_url_visited', 'crawler_finished'],
)
).crawl() # returns immediately, no .wait() needed
print(f"Crawl started: {crawl.uuid}")
import { ScrapflyClient, CrawlerConfig, Crawl } from 'jsr:@scrapfly/scrapfly-sdk';
const client = new ScrapflyClient({ key: "API KEY" });
// Fire-and-forget: register a webhook in the Scrapfly dashboard
// ("my-crawl-endpoint"), then pass its name here.
const crawl = new Crawl(
client,
new CrawlerConfig({
url: 'https://web-scraping.dev/products',
page_limit: 500,
webhook_name: 'my-crawl-endpoint',
webhook_events: ['crawler_url_visited', 'crawler_finished'],
})
);
await crawl.start(); // returns immediately, no .wait() needed
console.log(`Crawl started: ${crawl.uuid}`);
curl -X POST "https://api.scrapfly.io/crawl?key=$SCRAPFLY_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://web-scraping.dev/products",
"page_limit": 500,
"webhook_name": "my-crawl-endpoint",
"webhook_events": ["crawler_url_visited", "crawler_finished"]
}'
Docs, Tools, And Ready-Made Crawlers
Everything you need to go from zero to production crawl pipeline.
API Reference
Every parameter, every response field, with runnable cURL examples for the Crawler API.
View Crawler API docs →Academy
Interactive courses on web scraping, anti-bot bypass, and data extraction at scale.
View Academy →Open-Source Scrapers
40+ production-ready scrapers on GitHub. Copy, paste, customize for your use case.
View scrapers repo →Developer Tools
cURL-to-Python, JA3 checker, selector tester, HTTP/2 fingerprint inspector.
View developer tools →Seamlessly integrate with frameworks & platforms
Plug Scrapfly into your favorite tools, or build custom workflows with our first-class SDKs.
LLM & RAG frameworks
Frequently Asked Questions
What is the Crawler API and how is it different from the Web Scraping API?
The Web Scraping API retrieves single pages you specify by URL. The Crawler API takes a seed URL and automatically discovers and scrapes every linked page on the domain. Use the scraping API for targeted extraction; use the Crawler API when you need a whole site and don't know all the URLs upfront.
How do I control which pages get crawled?
Use include_only_paths to whitelist URL patterns (e.g. /blog/*), exclude_paths to blacklist paths, max_depth to limit link depth from the seed, and page_limit to cap total pages. Set follow_external_links=False to stay on the seed host. All options combine, so you can precisely scope crawls to exactly what you need.
Does anti-bot bypass apply to every crawled page?
Yes. Anti-bot, proxy selection, and all WSA parameters configured on the crawl job are applied automatically to every URL the crawler discovers. You don't need to re-enable them per page. Cloudflare, DataDome, Akamai, PerimeterX, Kasada, and others are all handled transparently.
How do I receive results as the crawl runs?
Set a webhook URL on the job. The API sends a POST to your endpoint for each event: url_visited, url_discovered, url_failed, url_skipped, crawler_started, crawler_finished, and more. The Python SDK includes typed dataclasses for each event so you can dispatch without manual JSON parsing.
Can I extract structured data from every crawled page?
Yes. Set extraction_model to a pre-trained model (product, article, review), define an extraction_template with CSS/XPath/JMESPath selectors, or use extraction_prompt for LLM-based free-form extraction. The extracted data is included in the WARC artifact and webhook payloads per page.
Can I crawl JavaScript-heavy single-page applications?
Yes. Set render_js=True on the crawl job and every page is loaded in a real Chrome browser (Scrapium, our stealth Chromium). Dynamic content, AJAX-loaded links, and lazy-loaded data are all captured before the page is stored.
Is web crawling legal?
Crawling publicly accessible data is legal in most jurisdictions. Meta v. Bright Data and hiQ v. LinkedIn have established strong precedent for scraping public data. You remain responsible for respecting robots.txt, rate limits, and target terms of service. See our legal overview for details.
Transparent, usage-based pricing
One plan covers the full Scrapfly platform. Pick a monthly credit budget; every API shares the same credit pool. No per-product lock-in, no surprise line items.
1,000 free credits on signup. No credit card required.
You only pay for successful requests. Failed calls are free.
Upgrade, downgrade, or cancel anytime. No contract.
Need per-page control instead of a full crawl?
The Crawler API handles full-site traversal. Each layer is also available standalone: Web Scraping API for individual page scraping with anti-bot bypass, Extraction API for structured data from any HTML, AI Web Scraping API for scrape + extract in one call, Browser API for hosted headless Chrome, Scrapium for stealth Chromium you drive with Playwright, or Curlium for byte-perfect HTTP without a browser.