3
extraction strategies in one API
20+
pre-trained AI extraction models
100%
structured JSON output
55k+
developers using Scrapfly APIs
Raw HTML In, Structured Data Out
AI models, template rules, and LLM prompts. All composable, all on one endpoint.
Extraction Pipeline
One endpoint, three strategies. Pass HTML, Markdown, or any text document. Choose an AI preset, define your own template, or write a plain-English prompt. The same JSON envelope comes back every time.
Pre-Trained AI Models
Set extraction_model and get a typed JSON schema back. No selectors to write, no training required. Works across any domain that matches the schema.
Template Extraction
Define selectors in JSON. CSS, XPath, and JMESPath are all supported. Chain type extractors (price, image, link, email) and formatters (lowercase, date, trim) on any field for deterministic, repeatable output - no LLM cost.
LLM Prompt Extraction
Pass extraction_prompt with a plain-English instruction. Or supply a JSON schema and get that exact shape back, guaranteed. For one-off analysis, rapidly changing layouts, or custom data contracts.
Any Input Format
The Extraction API accepts HTML, Markdown, plain text, XML, CSV, RSS, and JSON. Use it as a standalone parser on content you already have - no scraping required. Set content_type to match your document and all three strategies work identically.
HTML to Markdown
Convert messy HTML into clean Markdown or plain text in one call. Feed it directly into your LLM context window without noisy tags, scripts, or inline styles.
Batch Mode
Send multiple documents in a single request. Process product pages, article feeds, or review lists in bulk without managing concurrency yourself.
Async + Webhooks
Fire-and-forget for long extraction jobs. Results push to your webhook endpoint when ready. No polling, no held connections, no timeouts.
Scrape and Extract in One Call
Pair the Extraction API with the Web Scraping API. Add extraction_model or extraction_prompt to any /scrape request and get anti-bot bypass, JavaScript rendering, proxy rotation, and structured JSON back in a single round-trip.
Extract Across Many Pages
Attach extraction rules to a Crawler API job and pull structured data from every page in a crawl automatically. Useful for price monitoring, content aggregation, and building AI training datasets at scale.
Extract Structured Data in Three Lines
Pick a strategy, pick a language. Every example is a real, runnable SDK call.
Pass extraction_model="product" and get a typed product back. Works for article, review, job_posting, and 20+ other pre-trained schemas. Pair with the Web Scraping API for anti-bot bypass + rendering in the same call.
import json
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient
client = ScrapflyClient(key="API KEY")
# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()
api_response = client.extract(
ExtractionConfig(
body=document_text,
content_type="text/html",
# use one of dozens of defined data models:
extraction_model="product",
# optional: provide file's url for converting relative links to absolute
url="https://web-scraping.dev/product/1",
)
)
print(json.dumps(api_response.data))
import {
ScrapflyClient, ExtractionConfig
} from 'jsr:@scrapfly/scrapfly-sdk';
const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();
const api_result = await client.extract(
new ExtractionConfig({
body: document_text,
content_type: "text/html",
url: "https://web-scraping.dev/product/1",
extraction_model: "product"
})
);
console.log(JSON.stringify(api_result));
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
extraction_model==product \
url==https://web-scraping.dev/product/1 \
@product.html
Give the LLM a JSON schema, get that shape guaranteed back. Best when you have a fixed data contract. Ideal companion for the AI Web Scraping API pipeline.
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient
client = ScrapflyClient(key="API KEY")
# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()
api_response = client.extract(
ExtractionConfig(
body=document_text,
content_type="text/html",
url="https://web-scraping.dev/product/1",
extraction_prompt="extract price as JSON",
)
)
print(api_response.data)
import {
ScrapflyClient, ExtractionConfig
} from 'jsr:@scrapfly/scrapfly-sdk';
const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();
const api_result = await client.extract(
new ExtractionConfig({
body: document_text,
url: "https://web-scraping.dev/product/1",
content_type: "text/html",
extraction_prompt: "extract price as json",
})
);
console.log(JSON.stringify(api_result));
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
"extraction_prompt==extract price as JSON" \
url==https://web-scraping.dev/product/1 \
@product.html
Natural-language extraction. Ask "what are the product reviews?" and get clean JSON. For agentic workflows, route through AI Browser Agent so the page is fetched stealth-first.
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient
client = ScrapflyClient(key="API KEY")
# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()
api_response = client.extract(
ExtractionConfig(
body=document_text,
content_type="text/html",
extraction_prompt="summarize the review sentiment"
)
)
print(api_response.data)
import {
ScrapflyClient, ExtractionConfig
} from 'jsr:@scrapfly/scrapfly-sdk';
const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();
const api_result = await client.extract(
new ExtractionConfig({
body: document_text,
url: "https://web-scraping.dev/product/1",
content_type: "text/html",
extraction_prompt: "summarize the review sentiment",
})
);
console.log(JSON.stringify(api_result));
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
"extraction_prompt==summarize the review sentiment" \
url==https://web-scraping.dev/product/1 \
@product.html
Convert any HTML to markdown, clean_html, or plain text. No LLM, no schema. Ready for RAG. For raw HTTP with perfect TLS fingerprints use Curlium; for stealth Chromium use Scrapium.
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient
client = ScrapflyClient(key="API KEY")
# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()
api_response = client.extract(
ExtractionConfig(
body=document_text,
extraction_prompt="find the product price",
# use almost any file type identified through content_type:
content_type="text/html",
# content_type="text/xml",
# content_type="text/plain",
# content_type="text/markdown",
# content_type="application/json",
)
)
print(api_response.data)
import {
ScrapflyClient, ExtractionConfig
} from 'jsr:@scrapfly/scrapfly-sdk';
const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();
const api_result = await client.extract(
new ExtractionConfig({
body: document_text,
url: "https://web-scraping.dev/product/1",
extraction_prompt: "find the product price",
// use almost any file type identified through content_type
content_type: "text/html",
// content_type: "text/xml",
// content_type: "text/plain",
// content_type: "text/markdown",
// content_type: "application/json",
// content_type: "application/csv",
})
);
console.log(JSON.stringify(api_result));
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
"extraction_prompt==find the price" \
url==https://web-scraping.dev/product/1 \
@product.html
Define your own rules with CSS, XPath, or JMESPath selectors. Deterministic, no LLM cost. Test rules live with the selector tester.
import json
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient
client = ScrapflyClient(key="API KEY")
# product from https://web-scraping.dev/reviews
document_text = Path("reviews.html").read_text()
# define your JSON template
template = {
"source": "html",
"selectors": [
{
"name": "date_posted",
# use css selectors
"type": "css",
"query": "[data-testid='review-date']::text",
"multiple": True, # one or multiple?
# post process results with formatters
"formatters": [ {
"name": "datetime",
"args": {"format": "%Y, %b %d — %A"}
} ]
}
]
}
api_response = client.extract(
ExtractionConfig(
body=document_text,
content_type="text/html",
extraction_ephemeral_template=template
)
)
print(json.dumps(api_response.data))
import {
ScrapflyClient, ExtractionConfig
} from 'jsr:@scrapfly/scrapfly-sdk';
const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/reviews
const document_text = Deno.readTextFileSync("./reviews.html").toString();
// define your template as JSON
const template = {
"source": "html",
"selectors": [
{
"name": "date_posted",
// use css selectors
"type": "css",
"query": "[data-testid='review-date']::text",
"multiple": true, // one or multiple?
// post process results with formatters
"formatters": [ {
"name": "datetime",
"args": {"format": "%Y, %b %d — %A"}
} ]
}
]
}
const api_result = await client.extract(
new ExtractionConfig({
body: document_text,
url: "https://web-scraping.dev/product/1",
content_type: "text/html",
ephemeral_template: template,
})
);
console.log(JSON.stringify(api_result));
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
extraction_ephemeral_template:='{"source":"html","selectors":[{"name":"date_posted","type":"css","query":"[data-testid=review-date]::text","multiple":true}]}' \
@reviews.html
Same AI models, plus a confidence report per field. Debug extraction quality without a second call. When a site blocks model access, check which antibot it runs first.
import json
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient
client = ScrapflyClient(key="API KEY")
# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()
api_response = client.extract(
ExtractionConfig(
body=document_text,
content_type="text/html",
# use one of dozens of defined data models:
extraction_model="product",
# optional: provide file's url for converting relative links to absolute
url="https://web-scraping.dev/product/1",
)
)
# the data_quality field describes how much was found
print(json.dumps(api_response.extraction_result['data_quality']))
import {
ScrapflyClient, ExtractionConfig
} from 'jsr:@scrapfly/scrapfly-sdk';
const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();
const api_result = await client.extract(
new ExtractionConfig({
body: document_text,
content_type: "text/html",
url: "https://web-scraping.dev/product/1",
extraction_model: "product"
})
);
console.log(JSON.stringify(api_result));
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
extraction_model==product \
url==https://web-scraping.dev/product/1 \
@product.html
Docs, Tools, And Ready-Made Parsers
Everything you need to go from raw HTML to a production data pipeline.
API Reference
Every parameter, every response field, with runnable examples for all three extraction strategies.
Developer Docs →Academy
Interactive courses on web scraping, HTML parsing, and structured data extraction.
Start learning →Open-Source Scrapers
40+ production-ready scrapers on GitHub. Each one uses the Extraction API for the parsing step.
Explore repo →Developer Tools
CSS selector tester, cURL-to-Python converter, JA3 checker, and more.
Browse tools →Seamlessly integrate with frameworks & platforms
Plug Scrapfly into your favorite tools, or build custom workflows with our first-class SDKs.
LLM & RAG frameworks
Frequently Asked Questions
What is the Extraction API?
The Extraction API turns raw HTML (or other text documents) into structured JSON. You can use a pre-trained AI model (product, article, review, job posting, etc.), define your own CSS/XPath template, or write a plain-English LLM prompt. All three strategies share the same endpoint and return the same JSON envelope.
When should I use a model vs. a template vs. a prompt?
Use a pre-trained model (extraction_model) when your document fits a standard schema - product pages, news articles, job listings. Use a template (extraction_template or ephemeral_template) when you need precise, repeatable extraction from a known site structure. Use a prompt (extraction_prompt) for ad-hoc questions, rapidly changing layouts, or anything that doesn't fit a fixed schema.
How do I pair the Extraction API with web scraping?
Use the Web Scraping API to fetch the raw HTML - with anti-bot bypass, JavaScript rendering, and proxy rotation - then pass the HTML body to the Extraction API. Both SDKs share the same client, so you can chain a scrape and an extract in a few lines of Python or TypeScript.
What document formats are supported?
HTML, XML, JSON, CSV, RSS, Markdown, and plain text are supported today. Send the body as a string and set content_type to match. Each format works with all three extraction strategies.
Is my data used for AI training?
No. Scrapfly does not store, share, or use your document content for training AI models. Data sent to the Extraction API is processed in memory and discarded after the response is returned. See our privacy policy for full details.
How does quality reporting work for AI models?
Every AI model extraction response includes a data_quality field that reports coverage (how many expected fields were found) and confidence. Use it to detect low-quality extractions and trigger fallback logic or manual review without inspecting every record.
What counts as a billable extraction?
Each successful API call counts as one extraction credit. Calls using AI models or LLM prompts cost additional credits (published in the pricing table). Failed requests - where the API returns an error - do not consume credits.
Simple pricing for structured extraction
One plan covers the full Scrapfly platform. Pick a monthly credit budget; every API shares the same credit pool. No per-product lock-in, no surprise line items.
1,000 free credits on signup. No credit card required.
Failed extractions are free. You only pay for delivered JSON.
Upgrade, downgrade, or cancel anytime. No contract.
Build the full pipeline.
The Extraction API handles parsing. Pair it with the Web Scraping API for anti-bot bypass and JavaScript rendering, Browser API for full Playwright/Puppeteer control, or Crawler API to extract data across many pages in a single crawl job.