// PRODUCT

Extraction API

Send any HTML document, get back clean structured JSON. AI models, CSS/XPath templates, or LLM prompts - one endpoint handles all three.

Three extraction strategies. One API call.

  • Pre-trained AI models. Pass extraction_model='product' and get price, title, images, reviews - no selectors to write.
  • Zero-config or fully custom. Pick a model for instant results, define CSS/XPath templates for precision, or write a natural-language LLM prompt for anything in between.
1,000 free credits. No credit card required.

3

extraction strategies in one API

20+

pre-trained AI extraction models

100%

structured JSON output

55k+

developers using Scrapfly APIs


CAPABILITIES

Raw HTML In, Structured Data Out

AI models, template rules, and LLM prompts. All composable, all on one endpoint.

Extraction Pipeline

One endpoint, three strategies. Pass HTML, Markdown, or any text document. Choose an AI preset, define your own template, or write a plain-English prompt. The same JSON envelope comes back every time.

Input URL, raw HTML, Markdown, XML, JSON, plain text - or fetch + extract in one call
Parser cleans the document, strips scripts and noise, normalizes encoding
Extraction Strategy pre-trained AI model, CSS/XPath template, or LLM prompt - chosen per request
Schema Validator enforces field types, coerces values, reports coverage and confidence per field
Structured Output typed JSON - same shape every time, ready for your database or downstream pipeline
3 extraction strategies
20+ pre-trained AI models
per-field quality reporting
7+ input formats

Pre-Trained AI Models

Set extraction_model and get a typed JSON schema back. No selectors to write, no training required. Works across any domain that matches the schema.

20+schemas
zeroselectors needed
View AI model docs →

Template Extraction

Define selectors in JSON. CSS, XPath, and JMESPath are all supported. Chain type extractors (price, image, link, email) and formatters (lowercase, date, trim) on any field for deterministic, repeatable output - no LLM cost.

CSSselector engine
XPathselector engine
zeroLLM cost
livetest tool
View template docs →

LLM Prompt Extraction

Pass extraction_prompt with a plain-English instruction. Or supply a JSON schema and get that exact shape back, guaranteed. For one-off analysis, rapidly changing layouts, or custom data contracts.

Freeform Prompt ask "what are the product reviews?" - get clean JSON back
JSON Schema Input define the output shape; the LLM fills it with values from the document
Typed JSON Output matches your schema exactly - no post-processing required
freeformprompts
JSONschema binding
anylayout
View LLM prompt docs →

Any Input Format

The Extraction API accepts HTML, Markdown, plain text, XML, CSV, RSS, and JSON. Use it as a standalone parser on content you already have - no scraping required. Set content_type to match your document and all three strategies work identically.

7+input formats
standaloneparser mode
HTML
Markdown
RSS / XML
CSV
JSON
Plain text
View input format docs →

HTML to Markdown

Convert messy HTML into clean Markdown or plain text in one call. Feed it directly into your LLM context window without noisy tags, scripts, or inline styles.

markdownoutput
clean_htmloutput

Batch Mode

Send multiple documents in a single request. Process product pages, article feeds, or review lists in bulk without managing concurrency yourself.

multi-docper request
noconcurrency mgmt

Async + Webhooks

Fire-and-forget for long extraction jobs. Results push to your webhook endpoint when ready. No polling, no held connections, no timeouts.

webhookpush delivery
zeropolling needed
View webhook docs →

Scrape and Extract in One Call

Pair the Extraction API with the Web Scraping API. Add extraction_model or extraction_prompt to any /scrape request and get anti-bot bypass, JavaScript rendering, proxy rotation, and structured JSON back in a single round-trip.

asp=trueanti-bot bypass
jsrendering
1 callscrape + extract
View Web Scraping API →

Extract Across Many Pages

Attach extraction rules to a Crawler API job and pull structured data from every page in a crawl automatically. Useful for price monitoring, content aggregation, and building AI training datasets at scale.

rulesper crawl job
autoacross all pages
WARCor JSON output
View Crawler API →

CODE

Extract Structured Data in Three Lines

Pick a strategy, pick a language. Every example is a real, runnable SDK call.

Pass extraction_model="product" and get a typed product back. Works for article, review, job_posting, and 20+ other pre-trained schemas. Pair with the Web Scraping API for anti-bot bypass + rendering in the same call.

import json
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient 

client = ScrapflyClient(key="API KEY")

# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()  

api_response = client.extract(
    ExtractionConfig(
        body=document_text,
        content_type="text/html",
        # use one of dozens of defined data models:
        extraction_model="product",
        # optional: provide file's url for converting relative links to absolute
        url="https://web-scraping.dev/product/1",
    )
)
print(json.dumps(api_response.data))
import { 
    ScrapflyClient, ExtractionConfig 
} from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();

const api_result = await client.extract(
    new ExtractionConfig({
        body: document_text,
        content_type: "text/html",
        url: "https://web-scraping.dev/product/1",
        extraction_model: "product"
    })
);
console.log(JSON.stringify(api_result));
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
extraction_model==product \
url==https://web-scraping.dev/product/1 \
@product.html

Give the LLM a JSON schema, get that shape guaranteed back. Best when you have a fixed data contract. Ideal companion for the AI Web Scraping API pipeline.

from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient 

client = ScrapflyClient(key="API KEY")

# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()  

api_response = client.extract(
    ExtractionConfig(
        body=document_text,
        content_type="text/html",
        url="https://web-scraping.dev/product/1",
        extraction_prompt="extract price as JSON",
    )
)
print(api_response.data)
import { 
    ScrapflyClient, ExtractionConfig 
} from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();

const api_result = await client.extract(
    new ExtractionConfig({
        body: document_text,
        url: "https://web-scraping.dev/product/1",
        content_type: "text/html",
        extraction_prompt: "extract price as json",
    })
);
console.log(JSON.stringify(api_result));
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
"extraction_prompt==extract price as JSON" \
url==https://web-scraping.dev/product/1 \
@product.html

Natural-language extraction. Ask "what are the product reviews?" and get clean JSON. For agentic workflows, route through AI Browser Agent so the page is fetched stealth-first.

from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient 

client = ScrapflyClient(key="API KEY")

# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()  

api_response = client.extract(
    ExtractionConfig(
        body=document_text,
        content_type="text/html",
        extraction_prompt="summarize the review sentiment"
    )
)
print(api_response.data)
import { 
    ScrapflyClient, ExtractionConfig 
} from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();

const api_result = await client.extract(
    new ExtractionConfig({
        body: document_text,
        url: "https://web-scraping.dev/product/1",
        content_type: "text/html",
        extraction_prompt: "summarize the review sentiment",
    })
);
console.log(JSON.stringify(api_result));
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
"extraction_prompt==summarize the review sentiment" \
url==https://web-scraping.dev/product/1 \
@product.html

Convert any HTML to markdown, clean_html, or plain text. No LLM, no schema. Ready for RAG. For raw HTTP with perfect TLS fingerprints use Curlium; for stealth Chromium use Scrapium.

from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient

client = ScrapflyClient(key="API KEY")

# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()

api_response = client.extract(
    ExtractionConfig(
        body=document_text,
        extraction_prompt="find the product price",
        # use almost any file type identified through content_type:
        content_type="text/html",
        # content_type="text/xml",
        # content_type="text/plain",
        # content_type="text/markdown",
        # content_type="application/json",
    )
)
print(api_response.data)
import { 
    ScrapflyClient, ExtractionConfig 
} from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();

const api_result = await client.extract(
    new ExtractionConfig({
        body: document_text,
        url: "https://web-scraping.dev/product/1",
        extraction_prompt: "find the product price",
        // use almost any file type identified through content_type
        content_type: "text/html",
        // content_type: "text/xml",
        // content_type: "text/plain",
        // content_type: "text/markdown",
        // content_type: "application/json",
        // content_type: "application/csv",
    })
);
console.log(JSON.stringify(api_result));
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
"extraction_prompt==find the price" \
url==https://web-scraping.dev/product/1 \
@product.html

Define your own rules with CSS, XPath, or JMESPath selectors. Deterministic, no LLM cost. Test rules live with the selector tester.

import json
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient 

client = ScrapflyClient(key="API KEY")

# product from https://web-scraping.dev/reviews
document_text = Path("reviews.html").read_text()  
# define your JSON template
template = {  
  "source": "html",
  "selectors": [
    {
      "name": "date_posted",
      # use css selectors
      "type": "css",
      "query": "[data-testid='review-date']::text",
      "multiple": True,  # one or multiple?
      # post process results with formatters
      "formatters": [ {
        "name": "datetime",
        "args": {"format": "%Y, %b %d — %A"}
      } ]
    }
  ]
}
api_response = client.extract(
    ExtractionConfig(
        body=document_text,
        content_type="text/html",
        extraction_ephemeral_template=template
    )
)
print(json.dumps(api_response.data))
import { 
    ScrapflyClient, ExtractionConfig 
} from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/reviews
const document_text = Deno.readTextFileSync("./reviews.html").toString();

// define your template as JSON 
const template = {  
  "source": "html",
  "selectors": [
    {
      "name": "date_posted",
      // use css selectors
      "type": "css",
      "query": "[data-testid='review-date']::text",
      "multiple": true,  // one or multiple?
      // post process results with formatters
      "formatters": [ {
        "name": "datetime",
        "args": {"format": "%Y, %b %d — %A"}
      } ]
    }
  ]
}
const api_result = await client.extract(
    new ExtractionConfig({
        body: document_text,
        url: "https://web-scraping.dev/product/1",
        content_type: "text/html",
        ephemeral_template: template,
    })
);
console.log(JSON.stringify(api_result));
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
extraction_ephemeral_template:='{"source":"html","selectors":[{"name":"date_posted","type":"css","query":"[data-testid=review-date]::text","multiple":true}]}' \
@reviews.html

Same AI models, plus a confidence report per field. Debug extraction quality without a second call. When a site blocks model access, check which antibot it runs first.

import json
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient 

client = ScrapflyClient(key="API KEY")

# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()  

api_response = client.extract(
    ExtractionConfig(
        body=document_text,
        content_type="text/html",
        # use one of dozens of defined data models:
        extraction_model="product",
        # optional: provide file's url for converting relative links to absolute
        url="https://web-scraping.dev/product/1",
    )
)
# the data_quality field describes how much was found
print(json.dumps(api_response.extraction_result['data_quality']))
import { 
    ScrapflyClient, ExtractionConfig 
} from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();

const api_result = await client.extract(
    new ExtractionConfig({
        body: document_text,
        content_type: "text/html",
        url: "https://web-scraping.dev/product/1",
        extraction_model: "product"
    })
);
console.log(JSON.stringify(api_result));
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
extraction_model==product \
url==https://web-scraping.dev/product/1 \
@product.html

LEARN

Docs, Tools, And Ready-Made Parsers

Everything you need to go from raw HTML to a production data pipeline.

API Reference

Every parameter, every response field, with runnable examples for all three extraction strategies.

Developer Docs →

Academy

Interactive courses on web scraping, HTML parsing, and structured data extraction.

Start learning →

Open-Source Scrapers

40+ production-ready scrapers on GitHub. Each one uses the Extraction API for the parsing step.

Explore repo →

Developer Tools

CSS selector tester, cURL-to-Python converter, JA3 checker, and more.

Browse tools →

// INTEGRATIONS

Seamlessly integrate with frameworks & platforms

Plug Scrapfly into your favorite tools, or build custom workflows with our first-class SDKs.


FAQ

Frequently Asked Questions

What is the Extraction API?

The Extraction API turns raw HTML (or other text documents) into structured JSON. You can use a pre-trained AI model (product, article, review, job posting, etc.), define your own CSS/XPath template, or write a plain-English LLM prompt. All three strategies share the same endpoint and return the same JSON envelope.

When should I use a model vs. a template vs. a prompt?

Use a pre-trained model (extraction_model) when your document fits a standard schema - product pages, news articles, job listings. Use a template (extraction_template or ephemeral_template) when you need precise, repeatable extraction from a known site structure. Use a prompt (extraction_prompt) for ad-hoc questions, rapidly changing layouts, or anything that doesn't fit a fixed schema.

How do I pair the Extraction API with web scraping?

Use the Web Scraping API to fetch the raw HTML - with anti-bot bypass, JavaScript rendering, and proxy rotation - then pass the HTML body to the Extraction API. Both SDKs share the same client, so you can chain a scrape and an extract in a few lines of Python or TypeScript.

What document formats are supported?

HTML, XML, JSON, CSV, RSS, Markdown, and plain text are supported today. Send the body as a string and set content_type to match. Each format works with all three extraction strategies.

Is my data used for AI training?

No. Scrapfly does not store, share, or use your document content for training AI models. Data sent to the Extraction API is processed in memory and discarded after the response is returned. See our privacy policy for full details.

How does quality reporting work for AI models?

Every AI model extraction response includes a data_quality field that reports coverage (how many expected fields were found) and confidence. Use it to detect low-quality extractions and trigger fallback logic or manual review without inspecting every record.

What counts as a billable extraction?

Each successful API call counts as one extraction credit. Calls using AI models or LLM prompts cost additional credits (published in the pricing table). Failed requests - where the API returns an error - do not consume credits.


// PRICING

Simple pricing for structured extraction

One plan covers the full Scrapfly platform. Pick a monthly credit budget; every API shares the same credit pool. No per-product lock-in, no surprise line items.

Free tier

1,000 free credits on signup. No credit card required.

Pay on success

Failed extractions are free. You only pay for delivered JSON.

No lock-in

Upgrade, downgrade, or cancel anytime. No contract.

Build the full pipeline.

The Extraction API handles parsing. Pair it with the Web Scraping API for anti-bot bypass and JavaScript rendering, Browser API for full Playwright/Puppeteer control, or Crawler API to extract data across many pages in a single crawl job.

Get Free API Key
1,000 free credits. No card.