  // PRODUCT# The Best AI Extraction API

 Turn any HTML document into structured JSON with Scrapfly's extraction API. Three extraction strategies in one endpoint: pre-trained AI models, CSS/XPath templates, or LLM prompts.

##  Three extraction strategies. One API call. 

- **Pre-trained AI models.** Pass `extraction_model='product'` and get price, title, images, reviews - no selectors to write.
- **Zero-config or fully custom.** Pick a model for instant results, define CSS/XPath templates for precision, or write a natural-language LLM prompt for anything in between.
 
 [ Get Free API Key ](https://scrapfly.io/register) [ Developer Docs ](https://scrapfly.io/docs/extraction-api/getting-started) 

 1,000 free credits. No credit card required. 

 





'; var lanesEl = el.querySelector('[data-lanes]'); // Engine slug → marketing-source SVG (filled silhouette, FA-style). // The two icon URLs are pre-resolved by Twig's asset() helper and // stored on the wrapper as data-attrs, so the  picks up the // correct CDN-prefixed URL in prod and the same-origin URL in dev // without the JS having to know about it. var engineIcons = { CURLIUM: el.dataset.curliumIcon, SCRAPIUM: el.dataset.scrapiumIcon }; function engineIconSrc (engineName) { return engineIcons[engineName] || ''; } function buildLane (idx, initialStep) { var job = pickJob(); var laneEl = document.createElement('div'); laneEl.className = 'anim-scrape__lane'; laneEl.innerHTML = '' + '' + '' + job.country + '/' + job.pool + '' + '`' + job.path + '`' + '' + stageNames\[initialStep\] + '' + '' + '

' + '

'; return { el: laneEl, engineIconEl: laneEl.querySelector('[data-engine-icon]'), geoEl: laneEl.querySelector('[data-geo]'), pathEl: laneEl.querySelector('[data-path]'), stageEl: laneEl.querySelector('[data-stage-text]'), elapsedEl: laneEl.querySelector('[data-elapsed]'), barEl: laneEl.querySelector('[data-bar]'), step: initialStep, cumul: 0, job: job }; } // Three lanes, each starting at a different stage so they look // genuinely parallel from the first frame (no staggered fade-in). var lanes = [buildLane(0, 0), buildLane(1, 1), buildLane(2, 2)]; lanes.forEach(function (l) { lanesEl.appendChild(l.el); }); // Live RPS readout — sums lane completions over a rolling window. var statRpsEl = el.querySelector('[data-rps]'); var completionTimes = []; function updateRps () { var now = Date.now(); completionTimes = completionTimes.filter(function (t) { return now - t &lt; 1000; }); statRpsEl.textContent = completionTimes.length === 0 ? '—' : String(completionTimes.length * 3); } function tickLane (lane) { var s = lane.step; var deltas = lane.job.deltas; if (s &lt; deltas.length) { var d = deltas[s][0] + Math.random() * (deltas[s][1] - deltas[s][0]); lane.cumul += d; lane.stageEl.textContent = stageNames[s]; lane.elapsedEl.textContent = Math.round(lane.cumul) + 'ms'; } lane.barEl.setAttribute('data-progress', String(s)); lane.step++; if (lane.step &gt; deltas.length) { // Cycle complete — record completion, then re-roll the entire // job (path + engine + geo + pool) so each lane keeps showing // realistic Scrapfly variety rather than reusing the same // engine/path forever. completionTimes.push(Date.now()); lane.step = 0; lane.cumul = 0; lane.job = pickJob(); lane.pathEl.textContent = lane.job.path; lane.geoEl.textContent = lane.job.country + '/' + lane.job.pool; lane.engineIconEl.src = engineIconSrc(lane.job.engine); lane.engineIconEl.alt = lane.job.engine; lane.engineIconEl.title = lane.job.engine; lane.stageEl.textContent = stageNames[0]; lane.elapsedEl.textContent = ''; } } // Each lane ticks on its own interval — staggered so they don't // synchronize over time (~480ms ± per-lane jitter). var intervals = lanes.map(function (lane, i) { // Phase-stagger first tick by 160ms × lane index. setTimeout(function () { tickLane(lane); }, i * 160); return setInterval(function () { tickLane(lane); }, 460 + i * 40); }); var rpsInterval = setInterval(updateRps, 250); // Return the first interval so the existing teardown contract holds. // (No teardown is currently invoked, but stay symmetrical with the // other drivers that all return one setInterval handle.) void intervals; void rpsInterval; return intervals[0]; }, browser: function (el) { el.innerHTML = 'CDP EVENTS ' + '
'; var feed = el.querySelector('[data-feed]'); // Each event carries a realistic [minDelta, maxDelta] in ms — the // gap from the *previous* event in the same request flight. Numbers // mirror what Chrome DevTools shows on a real CDP trace: tens of ms // between network events, hundreds for DOM/load milestones, ~10ms // for Input dispatch. Hardcoded timestamps are dropped from detail // strings so the feed-time column is the single source of truth. var events = [ ['Network.requestWillBeSent', 'GET web-scraping.dev/abc', [15, 25]], ['Page.frameStartedLoading', 'frame=main', [5, 15]], ['Network.responseReceived', 'status=200, type=document', [40, 120]], ['Page.domContentEventFired', 'frame=main', [180, 320]], ['Runtime.executionContextCreated', 'origin=web-scraping.dev', [10, 25]], ['DOM.documentUpdated', 'nodes=1,284', [20, 60]], ['Page.loadEventFired', 'frame=main', [120, 240]], ['Network.dataReceived', '124.3 KB', [15, 45]], ['Input.dispatchMouseEvent', 'click (842, 316)', [5, 15]] ]; var i = 0; // tCdp is the simulated CDP clock in ms, NOT wall-clock time. It // resets at the start of each request flight (every full cycle of // events) so the feed reads as a fresh trace, not a 30-minute // session log. var tCdp = 0; function tick () { var idx = i % events.length; if (idx === 0) tCdp = 0; var ev = events[idx]; var jitter = ev[2][0] + Math.random() * (ev[2][1] - ev[2][0]); tCdp += jitter; var dt = Math.round(tCdp) + 'ms'; var li = document.createElement('li'); li.innerHTML = '' + dt + '' + '' + ev\[0\] + '' + '' + ev\[1\] + ''; feed.insertBefore(li, feed.firstChild); while (feed.children.length &gt; 6) feed.removeChild(feed.lastChild); i++; } for (var k = 0; k &lt; 4; k++) tick(); return setInterval(tick, 950); }, screenshot: function (el) { el.innerHTML = 'CAPTURING ' + '' + '' + '

' + '

' + '

' + '' + 'PNG' + 'JPEG' + 'WEBP' + 'FULL PAGE' + '

' + '

'; var fmts = el.querySelectorAll('[data-fmt]'); var shutter = el.querySelector('[data-shutter]'); var spec = el.querySelector('[data-spec]'); var elapsed = el.querySelector('[data-elapsed]'); // Each format combines a realistic viewport spec, capture latency, // and resulting payload size. Numbers cross-checked against // Scrapfly screenshot benchmarks: PNG/JPEG/WEBP on 1920×1080 land // 180-400ms; full-page on a long article scrolls + stitches and // takes 700-1200ms. var presets = [ { dim: '1920×1080', size: '184 KB', latencyMs: [180, 320] }, { dim: '1920×1080', size: '92 KB', latencyMs: [160, 260] }, { dim: '1920×1080', size: '76 KB', latencyMs: [200, 360] }, { dim: '1920×6840', size: '1.4 MB', latencyMs: [780, 1180] } ]; var step = 0; var anim = null; function tick () { var p = presets[step]; var latency = Math.round(p.latencyMs[0] + Math.random() * (p.latencyMs[1] - p.latencyMs[0])); spec.textContent = p.dim + ' • ' + p.size; elapsed.textContent = latency + 'ms'; // Web Animations API for the shutter sweep — replaces a CSS // transition + offsetWidth-reflow restart trick. Each tick we // cancel the previous animation and run a fresh one; WAAPI keeps // the work on the compositor thread, so no main-thread reflow. if (anim) anim.cancel(); anim = shutter.animate( [{ width: '0%' }, { width: '100%' }], { duration: latency, easing: 'cubic-bezier(.2,.8,.2,1)', fill: 'forwards' } ); fmts.forEach(function (f, i) { f.classList.toggle('anim-screenshot__format--active', i === step); }); step = (step + 1) % fmts.length; } tick(); return setInterval(tick, 1500); }, extract: function (el) { el.innerHTML = 'SCHEMA HYDRATION ' + '' + '{ name: \_\_\_\_\_\_\_\_\_\_\_\_,
' + ' price: \_\_\_\_\_\_\_\_\_\_\_\_,
' + ' in\_stock: \_\_\_\_,
' + ' rating: \_\_\_\_ }' + '

'; var records = [ { name: '"Widget Pro"', price: '49.99', in_stock: 'true', rating: '4.7' }, { name: '"Acme Runner"', price: '129.00', in_stock: 'true', rating: '4.3' }, { name: '"Vintage Chair"', price: '340.00', in_stock: 'false', rating: '4.9' }, { name: '"Coffee Grinder"', price: '89.50', in_stock: 'true', rating: '4.6' } ]; var keys = ['name', 'price', 'in_stock', 'rating']; var stat = el.querySelector('[data-stat]'); // Counter that ticks up each completed record so the panel reads // as "ongoing batch extraction" rather than a single shot demo. var totalRecords = 0; var rec = 0, step = 0; function tick () { var key = keys[step]; var field = el.querySelector('[data-field="' + key + '"]'); if (field) { field.textContent = records[rec % records.length][key]; field.className = 'v v-new'; } step++; if (step &gt;= keys.length) { step = 0; rec++; totalRecords++; if (stat) stat.textContent = totalRecords.toLocaleString('en-US') + ' records • ~340ms/rec'; setTimeout(function () { keys.forEach(function (k) { var f = el.querySelector('[data-field="' + k + '"]'); if (!f) return; f.textContent = k === 'in_stock' || k === 'rating' ? '____' : '____________'; f.className = 'pending'; }); }, 600); } } // Faster field reveal — 250ms feels like a template extraction // (regex/CSS), not a slow LLM dribble. Total per-record: ~1s. return setInterval(tick, 250); }, crawl: function (el) { el.innerHTML = '' + '**0 urls discovered**' + 'depth 1/5 • 0 req/s' + '

' + '```
web-scraping.dev/
```

'; var countEl = el.querySelector('[data-count]'); var depthEl = el.querySelector('[data-depth]'); var rpsEl = el.querySelector('[data-rps]'); var treeEl = el.querySelector('[data-tree]'); var branches = [ '├─ /products (1,284 pages)', '│ ├─ /products/shoes (392)', '│ ├─ /products/bags (218)', '│ └─ /products/accessories (674)', '├─ /articles (3,902 pages)', '│ ├─ /articles/2024/', '│ └─ /articles/2025/', '├─ /reviews (8,401)', '└─ /sitemap.xml' ]; // Counter starts plausible, climbs by realistic-per-tick batches // (~10 req/s sustained = 65/tick at 650ms cadence; we vary per // tick to read as live discovery rather than a clock). var count = 1, branchIdx = 0, depth = 1; function tick () { var batch = 50 + Math.floor(Math.random() * 60); count += batch; countEl.textContent = count.toLocaleString('en-US'); // RPS oscillates around 8-15 — the typical Scrapfly crawler // throttle for a single seed under default politeness. rpsEl.textContent = String(8 + Math.floor(Math.random() * 8)); if (branchIdx &lt; branches.length) { treeEl.innerHTML += '\n' + branches[branchIdx]; branchIdx++; depth = Math.min(5, 1 + Math.floor(branchIdx / 2)); depthEl.textContent = String(depth); } else { setTimeout(function () { treeEl.innerHTML = 'web-scraping.dev/'; branchIdx = 0; depth = 1; count = 1; depthEl.textContent = '1'; countEl.textContent = '1'; }, 1400); branchIdx = branches.length + 1; } } return setInterval(tick, 650); } }; document.querySelectorAll('[data-hero-anim]').forEach(function (el) { var kind = el.getAttribute('data-hero-anim'); var driver = drivers[kind]; if (driver) driver(el); }); })(); 

 

 

---

## 3

extraction strategies in one API

 



 

## 20+

pre-trained AI extraction models

 



 

## 100%

structured JSON output

 



 

## 55k+

developers using Scrapfly APIs

 



 

 

 

---

 CAPABILITIES## Raw HTML In, Structured Data Out

AI models, template rules, and LLM prompts. All composable, all on one endpoint.

 

 ### Extraction Pipeline

One endpoint, three strategies. Pass HTML, Markdown, or any text document. Choose an AI preset, define your own template, or write a plain-English prompt. The same JSON envelope comes back every time.

  **Input** URL, raw HTML, Markdown, XML, JSON, plain text - or fetch + extract in one call 

 

  **Parser** cleans the document, strips scripts and noise, normalizes encoding 

 

  **Extraction Strategy** pre-trained AI model, CSS/XPath template, or LLM prompt - chosen per request 

 

  **Schema Validator** enforces field types, coerces values, reports coverage and confidence per field 

 

  **Structured Output** typed JSON - same shape every time, ready for your database or downstream pipeline 

 

 

  **3** extraction strategies 

  **20+** pre-trained AI models 

  **per-field** quality reporting 

  **7+** input formats 

 

 



 

 

 ### Pre-Trained AI Models

Set `extraction_model` and get a typed JSON schema back. No selectors to write, no training required. Works across any domain that matches the schema.

**20+**schemas

**zero**selectors needed

 

[product](https://scrapfly.io/docs/extraction-api/automatic-ai)

[article](https://scrapfly.io/docs/extraction-api/automatic-ai)

[review](https://scrapfly.io/docs/extraction-api/automatic-ai)

[job\_posting](https://scrapfly.io/docs/extraction-api/automatic-ai)

[real\_estate](https://scrapfly.io/docs/extraction-api/automatic-ai)

[recipe](https://scrapfly.io/docs/extraction-api/automatic-ai)

[forum](https://scrapfly.io/docs/extraction-api/automatic-ai)

[social\_post](https://scrapfly.io/docs/extraction-api/automatic-ai)

[event + more](https://scrapfly.io/docs/extraction-api/automatic-ai)

 

 [View AI model docs →](https://scrapfly.io/docs/extraction-api/automatic-ai) 



 

 ### Template Extraction

Define selectors in JSON. CSS, XPath, and JMESPath are all supported. Chain type extractors (price, image, link, email) and formatters (lowercase, date, trim) on any field for deterministic, repeatable output - no LLM cost.

[CSS selectors](https://scrapfly.io/docs/extraction-api/rules-and-template)

[XPath selectors](https://scrapfly.io/docs/extraction-api/rules-and-template)

[JMESPath](https://scrapfly.io/docs/extraction-api/rules-and-template)

[type extractors](https://scrapfly.io/docs/extraction-api/rules-and-template)

[formatters](https://scrapfly.io/docs/extraction-api/rules-and-template)

[selector tester](https://scrapfly.io/web-scraping-tools/css-xpath-tester)

[ephemeral rules](https://scrapfly.io/docs/extraction-api/rules-and-template)

[saved templates](https://scrapfly.io/docs/extraction-api/rules-and-template)

 

**CSS**selector engine

**XPath**selector engine

**zero**LLM cost

**live**test tool

 

 [View template docs →](https://scrapfly.io/docs/extraction-api/rules-and-template) 



 

 

 ### LLM Prompt Extraction

Pass `extraction_prompt` with a plain-English instruction. Or supply a JSON schema and get that exact shape back, guaranteed. For one-off analysis, rapidly changing layouts, or custom data contracts.

  **Freeform Prompt** ask "what are the product reviews?" - get clean JSON back 

 

  **JSON Schema Input** define the output shape; the LLM fills it with values from the document 

 

  **Typed JSON Output** matches your schema exactly - no post-processing required 

 

 

**freeform**prompts

**JSON**schema binding

**any**layout

 

 [View LLM prompt docs →](https://scrapfly.io/docs/extraction-api/llm-prompt) 



 

 ### Any Input Format

The Extraction API accepts HTML, Markdown, plain text, XML, CSV, RSS, and JSON. Use it as a standalone parser on content you already have - no scraping required. Set `content_type` to match your document and all three strategies work identically.

**7+**input formats

**standalone**parser mode

 

HTML



Markdown



RSS / XML



CSV



JSON



Plain text



 

 [View input format docs →](https://scrapfly.io/docs/extraction-api/getting-started) 



 

 

 ### HTML to Markdown

Convert messy HTML into clean Markdown or plain text in one call. Feed it directly into your LLM context window without noisy tags, scripts, or inline styles.

**markdown**output

**clean\_html**output

 

 



 

 ### Batch Mode

Send multiple documents in a single request. Process product pages, article feeds, or review lists in bulk without managing concurrency yourself.

**multi-doc**per request

**no**concurrency mgmt

 

 



 

 ### Async + Webhooks

Fire-and-forget for long extraction jobs. Results push to your webhook endpoint when ready. No polling, no held connections, no timeouts.

**webhook**push delivery

**zero**polling needed

 

 [View webhook docs →](https://scrapfly.io/docs/extraction-api/webhook) 



 

 

 ### Scrape and Extract in One Call

Pair the Extraction API with the Web Scraping API. Add `extraction_model` or `extraction_prompt` to any `/scrape` request and get anti-bot bypass, JavaScript rendering, proxy rotation, and structured JSON back in a single round-trip.

**asp=true**anti-bot bypass

**js**rendering

**1 call**scrape + extract

 

[Web Scraping API](https://scrapfly.io/products/web-scraping-api)

[combo guide](https://scrapfly.io/docs/extraction-api/getting-started)

 

 [View Web Scraping API →](https://scrapfly.io/products/web-scraping-api) 



 

 ### Extract Across Many Pages

Attach extraction rules to a Crawler API job and pull structured data from every page in a crawl automatically. Useful for price monitoring, content aggregation, and building AI training datasets at scale.

**rules**per crawl job

**auto**across all pages

**WARC**or JSON output

 

[Crawler API](https://scrapfly.io/products/crawler-api)

[extraction rules](https://scrapfly.io/docs/crawler-api/extraction-rules)

 

 [View Crawler API →](https://scrapfly.io/products/crawler-api) 



 

 

 

---

 CODE## Extract Structured Data in Three Lines

Pick a strategy, pick a language. Every example is a real, runnable SDK call.

 

 [ AI Auto Models ](#ea-strat-aiauto) [ LLM + JSON Schema ](#ea-strat-llm-structured) [ LLM Freeform Prompt ](#ea-strat-llm-question) [ Clean Formats ](#ea-strat-llm-formats) [ CSS / XPath Rules ](#ea-strat-templates) [ AI Auto + Report ](#ea-strat-aiauto-report) 

Pass `extraction_model="product"` and get a typed product back. Works for article, review, job\_posting, and 20+ other pre-trained schemas. Pair with the [Web Scraping API](https://scrapfly.io/products/web-scraping-api) for anti-bot bypass + rendering in the same call.

     Python TypeScript HTTP / cURL  

    

 ```
import json
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient 

client = ScrapflyClient(key="API KEY")

# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()  

api_response = client.extract(
    ExtractionConfig(
        body=document_text,
        content_type="text/html",
        # use one of dozens of defined data models:
        extraction_model="product",
        # optional: provide file's url for converting relative links to absolute
        url="https://web-scraping.dev/product/1",
    )
)
print(json.dumps(api_response.data))
```

 ```
import { 
    ScrapflyClient, ExtractionConfig 
} from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();

const api_result = await client.extract(
    new ExtractionConfig({
        body: document_text,
        content_type: "text/html",
        url: "https://web-scraping.dev/product/1",
        extraction_model: "product"
    })
);
console.log(JSON.stringify(api_result));
```

 ```
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
extraction_model==product \
url==https://web-scraping.dev/product/1 \
@product.html
```

 

 

 [ Python SDK docs → ](https://scrapfly.io/docs/sdk/python) [ TypeScript SDK docs → ](https://scrapfly.io/docs/sdk/typescript) [ HTTP API docs → ](https://scrapfly.io/docs) 

 

Give the LLM a JSON schema, get that shape guaranteed back. Best when you have a fixed data contract. Ideal companion for the [AI Web Scraping API](https://scrapfly.io/ai-web-scraping-api) pipeline.

     Python TypeScript HTTP / cURL  

    

 ```
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient 

client = ScrapflyClient(key="API KEY")

# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()  

api_response = client.extract(
    ExtractionConfig(
        body=document_text,
        content_type="text/html",
        url="https://web-scraping.dev/product/1",
        extraction_prompt="extract price as JSON",
    )
)
print(api_response.data)
```

 ```
import { 
    ScrapflyClient, ExtractionConfig 
} from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();

const api_result = await client.extract(
    new ExtractionConfig({
        body: document_text,
        url: "https://web-scraping.dev/product/1",
        content_type: "text/html",
        extraction_prompt: "extract price as json",
    })
);
console.log(JSON.stringify(api_result));
```

 ```
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
"extraction_prompt==extract price as JSON" \
url==https://web-scraping.dev/product/1 \
@product.html
```

 

 

 [ Python SDK docs → ](https://scrapfly.io/docs/sdk/python) [ TypeScript SDK docs → ](https://scrapfly.io/docs/sdk/typescript) [ HTTP API docs → ](https://scrapfly.io/docs) 

 

Natural-language extraction. Ask "what are the product reviews?" and get clean JSON. For agentic workflows, route through [AI Browser Agent](https://scrapfly.io/products/ai-browser-agent) so the page is fetched stealth-first.

     Python TypeScript HTTP / cURL  

    

 ```
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient 

client = ScrapflyClient(key="API KEY")

# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()  

api_response = client.extract(
    ExtractionConfig(
        body=document_text,
        content_type="text/html",
        extraction_prompt="summarize the review sentiment"
    )
)
print(api_response.data)
```

 ```
import { 
    ScrapflyClient, ExtractionConfig 
} from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();

const api_result = await client.extract(
    new ExtractionConfig({
        body: document_text,
        url: "https://web-scraping.dev/product/1",
        content_type: "text/html",
        extraction_prompt: "summarize the review sentiment",
    })
);
console.log(JSON.stringify(api_result));
```

 ```
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
"extraction_prompt==summarize the review sentiment" \
url==https://web-scraping.dev/product/1 \
@product.html
```

 

 

 [ Python SDK docs → ](https://scrapfly.io/docs/sdk/python) [ TypeScript SDK docs → ](https://scrapfly.io/docs/sdk/typescript) [ HTTP API docs → ](https://scrapfly.io/docs) 

 

Convert any HTML to markdown, clean\_html, or plain text. No LLM, no schema. Ready for RAG. For raw HTTP with perfect TLS fingerprints use [Curlium](https://scrapfly.io/curlium); for stealth Chromium use [Scrapium](https://scrapfly.io/scrapium).

     Python TypeScript HTTP / cURL  

    

 ```
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient

client = ScrapflyClient(key="API KEY")

# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()

api_response = client.extract(
    ExtractionConfig(
        body=document_text,
        extraction_prompt="find the product price",
        # use almost any file type identified through content_type:
        content_type="text/html",
        # content_type="text/xml",
        # content_type="text/plain",
        # content_type="text/markdown",
        # content_type="application/json",
    )
)
print(api_response.data)
```

 ```
import { 
    ScrapflyClient, ExtractionConfig 
} from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();

const api_result = await client.extract(
    new ExtractionConfig({
        body: document_text,
        url: "https://web-scraping.dev/product/1",
        extraction_prompt: "find the product price",
        // use almost any file type identified through content_type
        content_type: "text/html",
        // content_type: "text/xml",
        // content_type: "text/plain",
        // content_type: "text/markdown",
        // content_type: "application/json",
        // content_type: "application/csv",
    })
);
console.log(JSON.stringify(api_result));
```

 ```
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
"extraction_prompt==find the price" \
url==https://web-scraping.dev/product/1 \
@product.html
```

 

 

 [ Python SDK docs → ](https://scrapfly.io/docs/sdk/python) [ TypeScript SDK docs → ](https://scrapfly.io/docs/sdk/typescript) [ HTTP API docs → ](https://scrapfly.io/docs) 

 

Define your own rules with CSS, XPath, or JMESPath selectors. Deterministic, no LLM cost. Test rules live with the [selector tester](https://scrapfly.io/web-scraping-tools/css-xpath-tester).

     Python TypeScript HTTP / cURL  

    

 ```
import json
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient 

client = ScrapflyClient(key="API KEY")

# product from https://web-scraping.dev/reviews
document_text = Path("reviews.html").read_text()  
# define your JSON template
template = {  
  "source": "html",
  "selectors": [
    {
      "name": "date_posted",
      # use css selectors
      "type": "css",
      "query": "[data-testid='review-date']::text",
      "multiple": True,  # one or multiple?
      # post process results with formatters
      "formatters": [ {
        "name": "datetime",
        "args": {"format": "%Y, %b %d — %A"}
      } ]
    }
  ]
}
api_response = client.extract(
    ExtractionConfig(
        body=document_text,
        content_type="text/html",
        extraction_ephemeral_template=template
    )
)
print(json.dumps(api_response.data))
```

 ```
import { 
    ScrapflyClient, ExtractionConfig 
} from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/reviews
const document_text = Deno.readTextFileSync("./reviews.html").toString();

// define your template as JSON 
const template = {  
  "source": "html",
  "selectors": [
    {
      "name": "date_posted",
      // use css selectors
      "type": "css",
      "query": "[data-testid='review-date']::text",
      "multiple": true,  // one or multiple?
      // post process results with formatters
      "formatters": [ {
        "name": "datetime",
        "args": {"format": "%Y, %b %d — %A"}
      } ]
    }
  ]
}
const api_result = await client.extract(
    new ExtractionConfig({
        body: document_text,
        url: "https://web-scraping.dev/product/1",
        content_type: "text/html",
        ephemeral_template: template,
    })
);
console.log(JSON.stringify(api_result));
```

 ```
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
extraction_ephemeral_template:='{"source":"html","selectors":[{"name":"date_posted","type":"css","query":"[data-testid=review-date]::text","multiple":true}]}' \
@reviews.html
```

 

 

 [ Python SDK docs → ](https://scrapfly.io/docs/sdk/python) [ TypeScript SDK docs → ](https://scrapfly.io/docs/sdk/typescript) [ HTTP API docs → ](https://scrapfly.io/docs) 

 

Same AI models, plus a confidence report per field. Debug extraction quality without a second call. When a site blocks model access, check [which antibot it runs](https://scrapfly.io/products/antibot-detector) first.

     Python TypeScript HTTP / cURL  

    

 ```
import json
from pathlib import Path
from scrapfly import ExtractionConfig, ScrapflyClient 

client = ScrapflyClient(key="API KEY")

# product from https://web-scraping.dev/product/1
document_text = Path("product.html").read_text()  

api_response = client.extract(
    ExtractionConfig(
        body=document_text,
        content_type="text/html",
        # use one of dozens of defined data models:
        extraction_model="product",
        # optional: provide file's url for converting relative links to absolute
        url="https://web-scraping.dev/product/1",
    )
)
# the data_quality field describes how much was found
print(json.dumps(api_response.extraction_result['data_quality']))
```

 ```
import { 
    ScrapflyClient, ExtractionConfig 
} from 'jsr:@scrapfly/scrapfly-sdk';

const client = new ScrapflyClient({ key: "API KEY" });
// product from https://web-scraping.dev/product/1
const document_text = Deno.readTextFileSync("./product.html").toString();

const api_result = await client.extract(
    new ExtractionConfig({
        body: document_text,
        content_type: "text/html",
        url: "https://web-scraping.dev/product/1",
        extraction_model: "product"
    })
);
console.log(JSON.stringify(api_result));
```

 ```
http POST https://api.scrapfly.io/extraction \
key==$SCRAPFLY_KEY \
content_type==text/html \
extraction_model==product \
url==https://web-scraping.dev/product/1 \
@product.html
```

 

 

 [ Python SDK docs → ](https://scrapfly.io/docs/sdk/python) [ TypeScript SDK docs → ](https://scrapfly.io/docs/sdk/typescript) [ HTTP API docs → ](https://scrapfly.io/docs) 

 

 

 

---

 LEARN## Docs, Tools, And Ready-Made Parsers

Everything you need to go from raw HTML to a production data pipeline.

 

 ### API Reference

Every parameter, every response field, with runnable examples for all three extraction strategies.

 [ Developer Docs → ](https://scrapfly.io/docs/extraction-api/getting-started) 



 

 ### Academy

Interactive courses on web scraping, HTML parsing, and structured data extraction.

 [ Start learning → ](https://scrapfly.io/academy) 



 

 ### Open-Source Scrapers

40+ production-ready scrapers on GitHub. Each one uses the Extraction API for the parsing step.

 [ Explore repo → ](https://github.com/scrapfly/scrapfly-scrapers) 



 

 ### Developer Tools

CSS selector tester, cURL-to-Python converter, JA3 checker, and more.

 [ Browse tools → ](https://scrapfly.io/web-scraping-tools) 



 

 

 

---

  // INTEGRATIONS## Seamlessly integrate with frameworks &amp; platforms

Plug Scrapfly into your favorite tools, or build custom workflows with our first-class SDKs.

 ### No-code automation

 [  Zapier ](https://scrapfly.io/integration/zapier) [  Make ](https://scrapfly.io/integration/make) [  n8n ](https://scrapfly.io/integration/n8n) 

 

### LLM &amp; RAG frameworks

 [  LlamaIndex ](https://scrapfly.io/integration/llamaindex) [  LangChain ](https://scrapfly.io/integration/langchain) [  CrewAI ](https://scrapfly.io/integration/crewai) 

 

### First-class SDKs

 [  Python pip install scrapfly-sdk ](https://scrapfly.io/docs/sdk/python) [  TypeScript Node, Deno, Bun ](https://scrapfly.io/docs/sdk/typescript) [  Go go get scrapfly-sdk ](https://scrapfly.io/docs/sdk/golang) [  Rust cargo add scrapfly-sdk ](https://scrapfly.io/docs/sdk/rust) [  Scrapy Full-feature extension ](https://scrapfly.io/docs/sdk/scrapy) 

 

 

 [ See all integrations  ](https://scrapfly.io/integration) 

 

---

  FAQ## Frequently Asked Questions

 

  ### What is the Extraction API?

 The Extraction API turns raw HTML (or other text documents) into structured JSON. You can use a pre-trained AI model (product, article, review, job posting, etc.), define your own CSS/XPath template, or write a plain-English LLM prompt. All three strategies share the same endpoint and return the same JSON envelope.

 

   ### When should I use a model vs. a template vs. a prompt?

 Use a pre-trained model (`extraction_model`) when your document fits a standard schema - product pages, news articles, job listings. Use a template (`extraction_template` or `ephemeral_template`) when you need precise, repeatable extraction from a known site structure. Use a prompt (`extraction_prompt`) for ad-hoc questions, rapidly changing layouts, or anything that doesn't fit a fixed schema.

 

   ### How do I pair the Extraction API with web scraping?

 Use the [Web Scraping API](https://scrapfly.io/"/products/web-scraping-api") to fetch the raw HTML - with anti-bot bypass, JavaScript rendering, and proxy rotation - then pass the HTML body to the Extraction API. Both SDKs share the same client, so you can chain a scrape and an extract in a few lines of Python or TypeScript.

 

   ### What document formats are supported?

 HTML, XML, JSON, CSV, RSS, Markdown, and plain text are supported today. Send the body as a string and set `content_type` to match. Each format works with all three extraction strategies.

 

   ### Is my data used for AI training?

 No. Scrapfly does not store, share, or use your document content for training AI models. Data sent to the Extraction API is processed in memory and discarded after the response is returned. See our privacy policy for full details.

 

   ### How does quality reporting work for AI models?

 Every AI model extraction response includes a `data_quality` field that reports coverage (how many expected fields were found) and confidence. Use it to detect low-quality extractions and trigger fallback logic or manual review without inspecting every record.

 

   ### What counts as a billable extraction?

 Each successful API call counts as one extraction credit. Calls using AI models or LLM prompts cost additional credits (published in the pricing table). Failed requests - where the API returns an error - do not consume credits.

 

  

 

  ---

 // PRICING## Simple pricing for structured extraction

One plan covers the full Scrapfly platform. Pick a monthly credit budget; every API shares the same credit pool. No per-product lock-in, no surprise line items.

 

  **Free tier**1,000 free credits on signup. No credit card required.

 

 

  **Pay on success**Failed extractions are free. You only pay for delivered JSON.

 

 

  **No lock-in**Upgrade, downgrade, or cancel anytime. No contract.

 

 

 

 [ See pricing  ](https://scrapfly.io/pricing) [ Start free ](https://scrapfly.io/register) 

 

 

### Build the full pipeline.

 The Extraction API handles parsing. Pair it with the [Web Scraping API](https://scrapfly.io/products/web-scraping-api) for anti-bot bypass and JavaScript rendering, [Browser API](https://scrapfly.io/products/cloud-browser-api) for full Playwright/Puppeteer control, or [Crawler API](https://scrapfly.io/products/crawler-api) to extract data across many pages in a single crawl job.

 

 [Get Free API Key](https://scrapfly.io/register)1,000 free credits. No card.