     [Blog](https://scrapfly.io/blog)   /  [Crawl4AI Guide: Web Crawling for LLMs, RAG, and AI Agents](https://scrapfly.io/blog/posts/crawl4AI-explained)   # Crawl4AI Guide: Web Crawling for LLMs, RAG, and AI Agents

 by [Ziad Shamndy](https://scrapfly.io/blog/author/ziad) Apr 18, 2026 21 min read [  ](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fcrawl4AI-explained "Share on LinkedIn")    

 

 

         

Web crawling for AI workflows looks nothing like traditional scraping. RAG pipelines, AI agents, and fine-tuning datasets need clean Markdown, structured JSON, and intelligent crawl strategies that know when to stop, not just raw HTML dumps. Crawl4AI is built for exactly that gap.

In this tutorial, we walk through Crawl4AI v0.8.x end to end. We install the framework, run a first crawl with `BrowserConfig` and `CrawlerRunConfig`, extract structured data with both CSS selectors and LLM-based strategies, run deep crawls with BFS and BestFirst, and ship the result through Docker or the `crwl` CLI. Every code example uses the current API and runs against a real target.

## Key Takeaways

- Crawl4AI v0.8.x gives developers a complete toolkit for AI-friendly web crawling, from single-page Markdown extraction to deep crawling and Docker deployment.
- Use Crawl4AI v0.8.x with `BrowserConfig` and `CrawlerRunConfig` for full control over the browser and crawl behavior
- Extract structured data with `JsonCssExtractionStrategy` for predictable HTML or `LLMExtractionStrategy` with `LLMConfig` and Pydantic schemas for complex pages
- Run deep crawls with `BFSDeepCrawlStrategy`, `DFSDeepCrawlStrategy`, or `BestFirstCrawlingStrategy`, controlled by `FilterChain` and scorers
- Use `AdaptiveCrawler.digest()` to stop crawling automatically when enough relevant content has been gathered
- Crawl multiple URLs concurrently with `arun_many()` and stream results as they arrive
- Deploy Crawl4AI as a Docker container with a monitoring dashboard, or run crawls directly with the `crwl` CLI
- Scale beyond local infrastructure with Scrapfly's Crawler API for managed anti-bot bypass and AI-ready output

**Get web scraping tips in your inbox**Trusted by 100K+ developers and 30K+ enterprises. Unsubscribe anytime.







## What Is Crawl4AI?

Crawl4AI is an open-source Python framework that turns web pages into clean, LLM-ready data, including Markdown, JSON, and structured schemas, for RAG pipelines, AI agents, and data extraction workflows.

Where traditional tools like [Scrapy and BeautifulSoup](https://scrapfly.io/blog/posts/web-scraping-with-python) focus on raw HTML extraction, Crawl4AI is built around the needs of LLM workflows. It ships with first-class Markdown output, a content-pruning step that strips boilerplate before generation, and built-in extraction strategies that call any LLM provider via LiteLLM.

A few capabilities set Crawl4AI apart for AI workloads:

- **LLM-friendly output** with `result.markdown.raw_markdown` and `result.markdown.fit_markdown` (filtered, boilerplate-stripped)
- **Adaptive crawling** through the `AdaptiveCrawler` class that stops once enough information is gathered
- **Deep crawling** with BFS, DFS, and BestFirst strategies, plus URL filters and relevance scorers
- **Anti-bot detection** (added in v0.8.5) with automatic proxy escalation
- **Async architecture** designed around `arun_many()` for high-throughput crawls

For a side-by-side look at how Crawl4AI fits next to other crawlers, guide covers the broader landscape.

[Top Web Crawler Tools in 2026Web crawling has evolved dramatically in 2026. What used to require complex infrastructure and constant maintenance can now be accomplished with a few clicks or lines of code. But with so many options available, choosing the right tool can make or break your data collection project.](https://scrapfly.io/blog/posts/top-web-crawler-tools)



## How Do You Install and Set Up Crawl4AI?

Crawl4AI installs in three steps, `pip install` the package, run `crawl4ai-setup` to fetch the Playwright Chromium build, then `crawl4ai-doctor` to confirm everything works. The full process takes about a minute on a clean environment and requires Python 3.10 or newer.

### How Do You Install Crawl4AI with pip?

A virtual environment keeps Crawl4AI's Playwright build from clashing with other projects. The single `pip install` pulls in the async crawler, the `crwl` CLI, and the Playwright Python bindings.

bash```bash
python -m venv .venv && source .venv/bin/activate
pip install -U crawl4ai
crawl4ai-setup
```



`crawl4ai-setup` downloads the Chromium binary, initializes the local cache database, and verifies the OS dependencies Playwright needs.

### How Do You Verify Your Crawl4AI Installation?

`crawl4ai-doctor` runs an end-to-end smoke test: Python version check, Playwright availability, and a real headless browser launch against a sample page.

bash```bash
crawl4ai-doctor
```



If the browser step fails, most often on minimal Linux containers missing system libraries, reinstall Chromium with its OS-level dependencies:

bash```bash
python -m playwright install --with-deps chromium
```



On Debian and Ubuntu the `--with-deps` flag pulls in `libnss3`, `libatk1.0-0`, and the other shared libraries Chromium expects. On macOS and Windows the base `crawl4ai-setup` step is normally enough.

For a deeper look at the browser layer Crawl4AI sits on top of, guide covers Playwright in depth.

[Web Scraping with Playwright and PythonPlaywright is the new, big browser automation toolkit - can it be used for web scraping? In this introduction article, we'll take a look how can we use Playwright and Python to scrape dynamic websites.](https://scrapfly.io/blog/posts/web-scraping-with-playwright-and-python)



## How Do You Run Your First Crawl with Crawl4AI?

The pattern is always the same: build an `AsyncWebCrawler` from a `BrowserConfig`, hand `arun()` a URL plus a `CrawlerRunConfig`, then read the formatted output off the result. Every other feature in Crawl4AI (extraction, deep crawling, anti-bot, adaptive stopping) layers onto these same two config objects, so getting comfortable with them is the highest-leverage step in this tutorial.

### What Does a Minimal Crawl Look Like?

The script below launches a headless browser, fetches a page, and prints the first 300 characters of the auto-generated Markdown. `CacheMode.BYPASS` forces a fresh fetch on every run, which is what you want during development so you are not debugging stale cache hits.

python```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    browser_config = BrowserConfig(headless=True, verbose=True)
    run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://web-scraping.dev/products",
            config=run_config,
        )
        print(f"Success: {result.success}")
        print(f"Markdown length: {len(result.markdown.raw_markdown)} chars")
        print(result.markdown.raw_markdown[:300])

asyncio.run(main())
```



Three things are worth noting in this snippet. First, the `async with` block makes sure the browser shuts down cleanly even if the crawl raises. Second, `verbose=True` on the `BrowserConfig` is what produces the `[FETCH]`, `[SCRAPE]`, and `[COMPLETE]` log lines you see below; turn it off in production. Third, `result.markdown.raw_markdown` is the unfiltered Markdown conversion of the rendered DOM, which is the right starting point before you decide what filtering to apply.

### How Do BrowserConfig and CrawlerRunConfig Work?

The split between `BrowserConfig` and `CrawlerRunConfig` is the most important design decision in the framework:

- **`BrowserConfig`** controls the browser *process* -- headless mode, user agent, proxy, viewport, persistent user profile. It is set once when the crawler is created.
- **`CrawlerRunConfig`** controls a *single crawl* -- cache mode, extraction strategy, content filter, JavaScript to inject, screenshots. It is passed on every `arun()` call and can vary per URL.

This separation lets you launch one expensive browser instance and reuse it across many different crawl configurations, keeping memory and startup cost predictable on long-running jobs.

The example below shows that in practice, same browser, but a `PruningContentFilter` strips boilerplate (navigation, footers, ads) from the Markdown before it ever reaches an LLM.

python```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    md_generator = DefaultMarkdownGenerator(
        content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed")
    )
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        markdown_generator=md_generator,
    )

    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
        result = await crawler.arun(
            url="https://web-scraping.dev/products",
            config=run_config,
        )
        print(f"Raw markdown: {len(result.markdown.raw_markdown)} chars")
        print(f"Fit markdown: {len(result.markdown.fit_markdown)} chars")

asyncio.run(main())
```



The `threshold` value (0.48 here) is a per-block content density score; anything below the threshold is treated as boilerplate and dropped. The output of this script gives you both versions side by side: `raw_markdown` is the full conversion, `fit_markdown` is the pruned version.

### What Output Formats Does Crawl4AI Produce?

A single `CrawlResult` exposes the same page in several formats simultaneously, so you do not have to re-crawl when you change your mind about which representation you need. Pick the field that matches your downstream consumer.

| Output | Field | Best for |
|---|---|---|
| Raw Markdown | `result.markdown.raw_markdown` | RAG ingestion, LLM context |
| Fit Markdown | `result.markdown.fit_markdown` | Token-sensitive LLM input |
| Cleaned HTML | `result.cleaned_html` | Re-parsing with BeautifulSoup |
| Raw HTML | `result.html` | Forensics, full-fidelity storage |
| Structured JSON | `result.extracted_content` | Direct app/database ingestion |
| Metadata | `result.metadata` | Title, description, language |

Use `fit_markdown` for anything that touches a paid LLM API, `raw_markdown` for retrieval indexes where recall matters more than token count, `cleaned_html` when you need to run a follow-up parser, and `extracted_content` when you have already configured a CSS or LLM extraction strategy.



## How Does Crawl4AI Extract Structured Data?

Crawl4AI offers two extraction approaches, First CSS or XPath selectors for predictable HTML structures, and LLM-based extraction for irregular or complex pages. Both run inside the same `CrawlerRunConfig`, which means you can swap strategies without rewriting the surrounding crawl code.

### How Does CSS-Based Extraction Work in Crawl4AI?

CSS extraction uses `JsonCssExtractionStrategy` with a schema describing a base selector and the fields to pull from each match. It is fast, free, and deterministic, which makes it the right default for sites with stable HTML.

python```python
import asyncio, json
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "Products",
    "baseSelector": "div.product",
    "fields": [
        {"name": "title", "selector": "h3", "type": "text"},
        {"name": "price", "selector": ".price", "type": "text"},
        {"name": "link",  "selector": "a",   "type": "attribute", "attribute": "href"},
    ],
}

async def main():
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
    )
    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
        result = await crawler.arun("https://web-scraping.dev/products", config=run_config)
        data = json.loads(result.extracted_content)
        print(f"Extracted {len(data)} products")
        print(json.dumps(data[0], indent=2))

asyncio.run(main())
```



### How Does LLM-Based Extraction Work in Crawl4AI?

When the HTML is irregular, or when you want semantic extraction (sentiment, classification, summarization), `LLMExtractionStrategy` with an `LLMConfig` and a Pydantic schema is the better fit. The strategy sends the cleaned page to the model and validates the response against the schema.

python```python
import asyncio, json, os
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class Product(BaseModel):
    title: str = Field(..., description="Product name")
    price: str = Field(..., description="Price including currency")
    description: str = Field("", description="Short product description")

async def main():
    llm_strategy = LLMExtractionStrategy(
        llm_config=LLMConfig(
            provider="openai/gpt-4o-mini",
            api_token=os.getenv("OPENAI_API_KEY"),
        ),
        schema=Product.model_json_schema(),
        extraction_type="schema",
        instruction="Extract every product on the page as Product objects.",
    )
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=llm_strategy,
    )

    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
        result = await crawler.arun("https://web-scraping.dev/products", config=run_config)
        print(json.dumps(json.loads(result.extracted_content)[:2], indent=2))

asyncio.run(main())
```



Swap `provider="openai/gpt-4o-mini"` for `provider="ollama/llama3.3"` and `api_token=None` to run the same extraction against a free local model. Crawl4AI uses LiteLLM under the hood, so any provider LiteLLM supports works here.

### How Can You Auto-Generate Extraction Schemas?

Manually writing CSS selectors is tedious. `JsonCssExtractionStrategy.generate_schema()` calls an LLM once on a sample page, returns a reusable CSS schema, and from then on every extraction runs LLM-free at CSS speed.

python```python
from crawl4ai import LLMConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import requests, json

html = requests.get("https://web-scraping.dev/products").text
schema = JsonCssExtractionStrategy.generate_schema(
    html=html,
    target_json_example='{"title": "...", "price": "...", "link": "..."}',
    llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="your-openai-key"),
)
print(json.dumps(schema, indent=2))
```



The first call costs a few cents in LLM tokens, every call after that is free and runs in milliseconds.

[Guide to LLM Training, Fine-Tuning, and RAGExplore LLM training, fine-tuning, and RAG. Learn how to leverage pre-trained models for custom tasks and real-time knowledge retrieval.](https://scrapfly.io/blog/posts/guide-to-llm-training-fine-tuning-and-rag)



## How Does Deep Crawling Work in Crawl4AI?

Deep crawling lets Crawl4AI explore an entire website beyond a single page, following links across multiple levels with configurable strategies, filters, and scoring. Three strategies ship out of the box, BFS, DFS, and BestFirst, each plugged into the same `CrawlerRunConfig`.

### What Is BFS Deep Crawling?

BFS (breadth-first search) explores all links at the current depth before going deeper. It is the right default when you want full coverage of a site to a fixed depth, for example mirroring documentation.

python```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy

async def main():
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=2, include_external=False),
    )
    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
        results = await crawler.arun("https://docs.crawl4ai.com/", config=run_config)
        print(f"Pages crawled: {len(results)}")
        for r in results[:5]:
            print(f"- depth {r.metadata.get('depth')}: {r.url}")

asyncio.run(main())
```



### How Do Filters and Scorers Control Deep Crawls?

A `FilterChain` decides which discovered URLs are even worth fetching. Common filters include `URLPatternFilter`, `DomainFilter`, and `ContentTypeFilter`, and you can combine any number of them.

python```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter, DomainFilter

async def main():
    filter_chain = FilterChain([
        URLPatternFilter(patterns=["*core*", "*api*"]),
        DomainFilter(allowed_domains=["docs.crawl4ai.com"]),
    ])
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2,
            filter_chain=filter_chain,
        ),
    )
    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
        results = await crawler.arun("https://docs.crawl4ai.com/", config=run_config)
        print(f"Filtered pages crawled: {len(results)}")

asyncio.run(main())
```



`KeywordRelevanceScorer` goes one step further. It scores each discovered URL by how relevant it looks before the crawler bothers fetching it, which is what makes BestFirst possible.

### What Is BestFirst Crawling?

`BestFirstCrawlingStrategy` visits the highest-scoring URLs first, focusing crawl budget on the pages most likely to contain what you care about. It pairs naturally with streaming mode (`stream=True`), so results arrive as they finish instead of in a single batch at the end.

python```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer

async def main():
    scorer = KeywordRelevanceScorer(
        keywords=["extraction", "llm", "schema"], weight=0.7
    )
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        stream=True,
        deep_crawl_strategy=BestFirstCrawlingStrategy(
            max_depth=2, url_scorer=scorer, max_pages=15,
        ),
    )
    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
        async for result in await crawler.arun("https://docs.crawl4ai.com/", config=run_config):
            score = result.metadata.get("score", 0)
            print(f"[score={score:.2f}] {result.url}")

asyncio.run(main())
```



For very long crawls, Crawl4AI supports crash recovery and prefetch mode for pure URL discovery without rendering. `DFSDeepCrawlStrategy` is also available when you need depth-first exploration.



## What Is Adaptive Crawling in Crawl4AI?

Adaptive crawling uses information foraging algorithms to automatically stop crawling once Crawl4AI has gathered enough relevant content to answer a query. Instead of telling the crawler "go three levels deep," you tell it "find enough to answer this question" and let it decide when to stop.

The `AdaptiveCrawler` class wraps a regular `AsyncWebCrawler` and exposes a `.digest()` coroutine that takes a starting URL and a query.

python```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig
from crawl4ai.adaptive_crawler import AdaptiveCrawler

async def main():
    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
        adaptive = AdaptiveCrawler(crawler)
        state = await adaptive.digest(
            start_url="https://docs.crawl4ai.com/",
            query="how to configure llm extraction with pydantic schemas",
        )
        adaptive.print_stats()
        print(f"Confidence: {state.confidence:.2%}")
        for page in adaptive.get_relevant_content(top_k=3):
            print(f"- {page['url']} (score={page['score']:.2f})")

asyncio.run(main())
```



The three signals (coverage, consistency, saturation) drive the stop decision. Coverage measures how many query terms appear, consistency measures how much pages agree, and saturation watches for diminishing returns. Adaptive crawling fits research, documentation mining, and any topic-targeted collection.



Scrapfly

#### Scale your web scraping effortlessly

Scrapfly handles proxies, browsers, and anti-bot bypass — so you can focus on data.

[Try Free →](https://scrapfly.io/register)## How Do You Crawl Multiple URLs Concurrently?

Use `crawler.arun_many()` to crawl many URLs in parallel, Crawl4AI manages concurrency automatically through a `MemoryAdaptiveDispatcher` that scales workers up and down based on available system memory. You do not need to write your own `asyncio.gather` or semaphore code.

python```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    urls = [
        "https://web-scraping.dev/product/1",
        "https://web-scraping.dev/product/2",
        "https://web-scraping.dev/product/3",
        "https://web-scraping.dev/product/4",
        "https://web-scraping.dev/product/5",
    ]
    run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=True)

    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
        async for result in await crawler.arun_many(urls=urls, config=run_config):
            if result.success:
                print(f"OK   {result.url} ({len(result.markdown.raw_markdown)} chars)")
            else:
                print(f"FAIL {result.url}: {result.error_message}")

asyncio.run(main())
```



With `stream=True`, each result is yielded the moment it finishes, so downstream processing, writing to disk, indexing in a vector store can start immediately. Drop the flag and `arun_many()` returns a list once every URL is done.

[Concurrency vs ParallelismLearn the key differences between Concurrency and Parallelism and how to leverage them in Python and JavaScript to optimize performance in various computational tasks.](https://scrapfly.io/blog/posts/concurrency-vs-parallelism)



## How Do You Handle Dynamic Content and Anti-Bot Detection?

Crawl4AI handles JavaScript-heavy pages through Playwright's browser engine, and v0.8.5 added an automatic anti-bot detection system with proxy escalation. JS interaction and anti-bot are configured on the same `CrawlerRunConfig` and `BrowserConfig` you have already been using.

### How Does Crawl4AI Execute JavaScript?

Pass JavaScript to the `js_code` parameter to click buttons, scroll, or trigger lazy-loading. Pair it with `wait_for` a CSS selector or JS expression to make sure the dynamic content is in the DOM before extraction runs.

python```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    js_code = [
        "window.scrollTo(0, document.body.scrollHeight);",
        "const btn = document.querySelector('button.load-more'); if (btn) btn.click();",
    ]
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        js_code=js_code,
        wait_for="css:div.product:nth-child(10)",
        page_timeout=15000,
    )
    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
        result = await crawler.arun("https://web-scraping.dev/products", config=run_config)
        print(f"Loaded markdown: {len(result.markdown.raw_markdown)} chars")

asyncio.run(main())
```



### How Does Crawl4AI Detect and Bypass Anti-Bot Systems?

The v0.8.5 detection system runs three tiers in order, known vendor fingerprints (Cloudflare, DataDome, PerimeterX), generic block indicators in the response, and structural integrity checks on the rendered DOM. When detection fires, Crawl4AI can automatically retry through a proxy chain you configured up front.

python```python
import asyncio
from crawl4ai import (
    AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, ProxyConfig
)

async def main():
    proxies = [
        ProxyConfig(server="http://proxy-a.example:8000", username="user", password="pass"),
        ProxyConfig(server="http://proxy-b.example:8000", username="user", password="pass"),
    ]
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        proxy_rotation_strategy="round_robin",
        proxy_config=proxies,
        max_retries=3,
        flatten_shadow_dom=True,
    )
    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
        result = await crawler.arun("https://web-scraping.dev/login", config=run_config)
        print(f"Success: {result.success} | Status: {result.status_code}")

asyncio.run(main())
```



`flatten_shadow_dom=True` extracts content hidden inside Shadow DOM components, which is increasingly common on modern e-commerce and SPA frameworks. Crawl4AI also exposes stealth mode and persistent user profiles through `BrowserConfig` for stickier sessions.



## How Do You Deploy Crawl4AI with Docker and CLI?

Crawl4AI runs as a Docker container with a built-in monitoring dashboard, or from the command line with the `crwl` CLI, both options work for production deployments. Docker is the right pick when you want a long-running crawl service; the CLI is best for one-off or scripted jobs.

bash```bash
docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
```



Once the container is up, the monitoring dashboard is at `http://localhost:11235/dashboard` and an interactive playground at `http://localhost:11235/playground`. The container also exposes a REST API and an MCP endpoint, which lets AI tools like Claude Code drive crawls directly.

For one-off jobs, the `crwl` CLI ships with the pip install:

bash```bash
crwl https://web-scraping.dev/products -o markdown
crwl https://docs.crawl4ai.com/ --deep-crawl bfs --max-pages 20
crwl https://web-scraping.dev/products --extraction-strategy llm --query "extract products as JSON"
```





## How Can You Scale Crawl4AI with Scrapfly?

For projects that need domain-wide crawling at production scale, Scrapfly's Crawler API provides managed crawling with automatic URL discovery, anti-bot bypass, and AI-ready output formats. It is the natural next step when local Crawl4AI hits the wall on reliability, throughput, or anti-bot complexity.

ScrapFly provides [web scraping](https://scrapfly.io/docs/scrape-api/getting-started), [screenshot](https://scrapfly.io/docs/screenshot-api/getting-started), and [extraction](https://scrapfly.io/docs/extraction-api/getting-started) APIs for data collection at scale.

- [Anti-bot protection bypass](https://scrapfly.io/docs/scrape-api/anti-scraping-protection) - scrape web pages without blocking!
- [Rotating residential proxies](https://scrapfly.io/docs/scrape-api/proxy) - prevent IP address and geographic blocks.
- [JavaScript rendering](https://scrapfly.io/docs/scrape-api/javascript-rendering) - scrape dynamic web pages through cloud browsers.
- [Full browser automation](https://scrapfly.io/docs/scrape-api/javascript-scenario) - control browsers to scroll, input and click on objects.
- [Format conversion](https://scrapfly.io/docs/scrape-api/getting-started#api_param_format) - scrape as HTML, JSON, Text, or Markdown.
- [Full screenshot customization](https://scrapfly.io/docs/screenshot-api/getting-started#api_param_capture) - scroll and capture exact areas.
- [Comprehensive options](https://scrapfly.io/docs/screenshot-api/getting-started) - block banners, use dark mode, and more.
- [LLM prompts](https://scrapfly.io/docs/extraction-api/llm-prompt) - extract data or ask questions using LLMs
- [Extraction models](https://scrapfly.io/docs/extraction-api/automatic-ai) - automatically find objects like products, articles, jobs, and more.
- [Extraction templates](https://scrapfly.io/docs/extraction-api/rules-and-template) - extract data using your own specification.
- [Python](https://scrapfly.io/docs/sdk/python) and [Typescript](https://scrapfly.io/docs/sdk/typescript) SDKs, as well as [Scrapy](https://scrapfly.io/docs/sdk/scrapy) and [no-code tool integrations](https://scrapfly.io/docs/integration/getting-started).

Here is what a full domain crawl looks like with Scrapfly:

python```python
from scrapfly import ScrapflyClient, CrawlerConfig

scrapfly = ScrapflyClient(key="YOUR_KEY")

# Crawl the product catalog
crawl_result = scrapfly.start_crawl(CrawlerConfig(
    url="https://web-scraping.dev/products",
    page_limit=20,                 # smaller for testing
    max_depth=2,                   # follow pagination links
    content_formats=["markdown", "clean_html"],
    asp=True,
))

crawl_uuid = crawl_result.uuid

# Track progress
status = scrapfly.get_crawl_status(crawl_uuid)
print(f"Status: {status.status}, Pages scraped: {status.state.urls_visited}")
```



The Scrapfly [Crawler API](https://scrapfly.io/crawler-api) handles the parts that get expensive to operate yourself, residential and datacenter proxy rotation, anti-bot bypass for Cloudflare and DataDome, automatic URL discovery, and Markdown or cleaned HTML output ready for LLM ingestion.

### Power your scraping with Scrapfly

Forget about getting blocked. Scrapfly handles anti-bot bypasses, browser rendering, and proxy rotation so you can focus on the data.



[Try for FREE!](https://scrapfly.io/register)





## FAQ

Is Crawl4AI Free and Open Source?Yes, Crawl4AI is completely free and open source under the Apache 2.0 license. You can use it for personal or commercial projects with no API key or payment required. The team is also building a hosted Cloud API, currently in closed beta for managed large-scale extraction.







What Are the Main Limitations of Crawl4AI?Run locally, Crawl4AI requires you to manage your own infrastructure, including browsers, proxies, and scaling. It depends on Playwright for rendering, which uses significant memory on large concurrent crawls. For sites with advanced anti-bot protection, you typically need to configure proxy chains or fall back to a managed scraping service.







Does Crawl4AI Work with Local LLMs Like Ollama?Yes. Crawl4AI uses LiteLLM under the hood, so any provider LiteLLM supports works, including local models via Ollama. Set `provider="ollama/llama3.3"` in `LLMConfig` with `api_token=None` for fully local, free LLM-based extraction.







Can Crawl4AI Handle JavaScript-Heavy Single-Page Applications?Yes. Crawl4AI uses Playwright to render JavaScript fully before extracting content. You can inject custom JS through `js_code`, wait for specific selectors with `wait_for`, and interact with dynamic elements like infinite scroll or "Load More" buttons.







How Does Crawl4AI Compare to Scrapy for Web Crawling?Scrapy is a general-purpose crawling framework optimized for high-volume HTTP crawls without a browser. Crawl4AI is purpose-built for AI workflows, it renders JavaScript, outputs LLM-ready Markdown, and ships with LLM-based extraction strategies. Choose Scrapy for high-volume traditional crawling, Crawl4AI for AI and LLM data pipelines.







Can Crawl4AI Be Used with LangChain or RAG Pipelines?Yes. Crawl4AI's Markdown and JSON outputs drop straight into LangChain document loaders or vector stores.









## Conclusion

Crawl4AI is not just another scraping library. It sits in its own category, an open-source crawling framework built specifically for LLM and AI data pipelines. The difference shows in every layer, from Markdown output that strips boilerplate before it reaches your model, to adaptive crawling that knows when to stop collecting.

When you hit the ceiling of local infrastructure, whether that is anti-bot protection, proxy management, or scaling beyond a single machine, Scrapfly's [Crawler API](https://scrapfly.io/crawler-api) picks up where local Crawl4AI leaves off with managed browsers and built-in bypass.



 

    Table of Contents- [Key Takeaways](#key-takeaways)
- [What Is Crawl4AI?](#what-is-crawl4ai)
- [How Do You Install and Set Up Crawl4AI?](#how-do-you-install-and-set-up-crawl4ai)
- [How Do You Install Crawl4AI with pip?](#how-do-you-install-crawl4ai-with-pip)
- [How Do You Verify Your Crawl4AI Installation?](#how-do-you-verify-your-crawl4ai-installation)
- [How Do You Run Your First Crawl with Crawl4AI?](#how-do-you-run-your-first-crawl-with-crawl4ai)
- [What Does a Minimal Crawl Look Like?](#what-does-a-minimal-crawl-look-like)
- [How Do BrowserConfig and CrawlerRunConfig Work?](#how-do-browserconfig-and-crawlerrunconfig-work)
- [What Output Formats Does Crawl4AI Produce?](#what-output-formats-does-crawl4ai-produce)
- [How Does Crawl4AI Extract Structured Data?](#how-does-crawl4ai-extract-structured-data)
- [How Does CSS-Based Extraction Work in Crawl4AI?](#how-does-css-based-extraction-work-in-crawl4ai)
- [How Does LLM-Based Extraction Work in Crawl4AI?](#how-does-llm-based-extraction-work-in-crawl4ai)
- [How Can You Auto-Generate Extraction Schemas?](#how-can-you-auto-generate-extraction-schemas)
- [How Does Deep Crawling Work in Crawl4AI?](#how-does-deep-crawling-work-in-crawl4ai)
- [What Is BFS Deep Crawling?](#what-is-bfs-deep-crawling)
- [How Do Filters and Scorers Control Deep Crawls?](#how-do-filters-and-scorers-control-deep-crawls)
- [What Is BestFirst Crawling?](#what-is-bestfirst-crawling)
- [What Is Adaptive Crawling in Crawl4AI?](#what-is-adaptive-crawling-in-crawl4ai)
- [How Do You Crawl Multiple URLs Concurrently?](#how-do-you-crawl-multiple-urls-concurrently)
- [How Do You Handle Dynamic Content and Anti-Bot Detection?](#how-do-you-handle-dynamic-content-and-anti-bot-detection)
- [How Does Crawl4AI Execute JavaScript?](#how-does-crawl4ai-execute-javascript)
- [How Does Crawl4AI Detect and Bypass Anti-Bot Systems?](#how-does-crawl4ai-detect-and-bypass-anti-bot-systems)
- [How Do You Deploy Crawl4AI with Docker and CLI?](#how-do-you-deploy-crawl4ai-with-docker-and-cli)
- [How Can You Scale Crawl4AI with Scrapfly?](#how-can-you-scale-crawl4ai-with-scrapfly)
- [Power your scraping with Scrapfly](#power-your-scraping-with-scrapfly)
- [FAQ](#faq)
- [Conclusion](#conclusion)
 
    Join the Newsletter  Get monthly web scraping insights 

 

  



Scale Your Web Scraping

Anti-bot bypass, browser rendering, and rotating proxies, all in one API. Start with 1,000 free credits.

  No credit card required  1,000 free API credits  Anti-bot bypass included 

 [Start Free](https://scrapfly.io/register) [View Docs](https://scrapfly.io/docs/onboarding) 

 Not ready? Get our newsletter instead. 

 

## Explore this Article with AI

 [ ChatGPT ](https://chat.openai.com/?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fcrawl4AI-explained) [ Gemini ](https://www.google.com/search?udm=50&aep=11&q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fcrawl4AI-explained) [ Grok ](https://x.com/i/grok?text=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fcrawl4AI-explained) [ Perplexity ](https://www.perplexity.ai/search/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fcrawl4AI-explained) [ Claude ](https://claude.ai/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fcrawl4AI-explained) 



 ## Related Articles

 [  

 ai 

### Guide to Understanding and Developing LLM Agents

Explore how LLM agents transform AI, from text generators into dynamic decision-makers with tools like LangChain for aut...

 

 ](https://scrapfly.io/blog/posts/practical-guide-to-llm-agents) [  

 ai 

### Guide to LLM Training, Fine-Tuning, and RAG

Explore LLM training, fine-tuning, and RAG. Learn how to leverage pre-trained models for custom tasks and real-time know...

 

 ](https://scrapfly.io/blog/posts/guide-to-llm-training-fine-tuning-and-rag) [  

 ai 

### Guide to Local LLMs

Discover the benefits of deploying Local LLMs, from enhanced privacy and reduced latency to tailored AI solutions.

 

 ](https://scrapfly.io/blog/posts/guide-to-local-llm) 

  ## Related Questions

- [ Q How to take screenshots in NodeJS? ](https://scrapfly.io/blog/answers/how-to-take-screenshots-nodejs)
 
  



   



 Scale your web scraping effortlessly, **1,000 free credits** [Start Free](https://scrapfly.io/register)