FAQ

View as markdown

Here are some of the most common issues and questions that come up when using Scrapfly Extraction API. See the tag filter on the right for more.

What is the Extraction API?

The Extraction API is a powerful tool that extracts structured data from any text content including HTML, Markdown, and plain text. It uses AI models, LLMs, and custom parsing instructions to convert unstructured content into structured formats like JSON.

The API supports three extraction methods:

extraction_model - Automatic extraction using predefined AI models for common data types (products, articles, reviews, etc.)
extraction_prompt - LLM-powered extraction using custom prompts
extraction_template - Template-based extraction using JSON schemas and CSS/XPath selectors

How do I use the Extraction API?

Make a POST request to https://api.scrapfly.io/extraction with:

Your API key as the key parameter
The content to extract from in the request body
One extraction method: extraction_model, extraction_prompt, or extraction_template

See the Getting Started guide for detailed examples.

Can I combine web scraping and extraction?

Yes! The Web Scraping API directly integrates extraction capabilities. You can add extraction_model, extraction_prompt, or extraction_template parameters to your scraping requests to get extracted data in a single API call.

This is the recommended approach when you need both scraping and extraction.

What extraction models are available?

Scrapfly supports automatic extraction models for common web data types:

E-commerce: product, product_listing
Content: article, review_list, social_media_post
Real Estate: real_estate_property, real_estate_property_listing
Jobs: job_posting, job_listing
Travel: hotel, hotel_listing, event
Other: organization, software, stock, vehicle_ad, food_recipe, search_engine_results

See the complete model list for details on each model's schema.

How accurate are the automatic extraction models?

Automatic extraction models use advanced AI to identify and extract data patterns. Accuracy depends on:

How well the page structure matches common patterns for that data type
The quality and structure of the source HTML
Whether the page uses semantic HTML markup (schema.org, microdata, etc.)

For best results on non-standard pages, consider using extraction_prompt or extraction_template for more control.

How does LLM prompt extraction work?

LLM prompt extraction uses AI language models to understand your natural language instructions and extract data accordingly. Simply provide a prompt describing what data you want to extract, and the AI will interpret the content and return structured results.

Example: extraction_prompt=Extract all product names, prices, and ratings in JSON format

See the LLM Prompt documentation for more examples and best practices.

What are best practices for writing extraction prompts?

For effective LLM prompt extraction:

Be specific about what data fields you want
Specify the output format (JSON, CSV, etc.)
Include examples of the expected structure when possible
Mention data types and formatting requirements
Keep prompts concise but descriptive

Example: "Extract product information as JSON with fields: name (string), price (number), availability (boolean)"

When should I use template extraction?

Use extraction_template when:

You need precise control over data extraction
You're scraping the same website structure repeatedly
You want deterministic, rule-based extraction
You need to extract data from specific HTML elements using CSS or XPath selectors
Cost efficiency is important (template extraction is cheaper than LLM-based methods)

How do I write an extraction template?

Extraction templates are JSON schemas that define extraction rules using CSS/XPath selectors. They support nested structures, arrays, and various data transformations.

See the Template Extraction documentation for detailed syntax, examples, and available selector types.

What content types can I extract from?

The Extraction API supports:

HTML - Web pages and HTML documents
Markdown - Markdown formatted text
Plain Text - Raw text content
JSON - JSON data (for transformation/restructuring)

Specify the content type using the content-type header (e.g., text/html, text/markdown).

Can I send compressed content?

Yes! To reduce bandwidth and improve performance, you can send gzip-compressed content by:

Compressing your content with gzip
Setting the Content-Encoding: gzip header
Sending the compressed data in the request body

This is especially useful for large HTML documents.

How much does extraction cost?

Extraction API costs vary by method:

Template extraction: Most cost-effective, charged per extraction
Automatic models: Moderate cost, charged per extraction
LLM prompts: Higher cost due to AI processing, charged based on content size and complexity

See the Billing page for detailed pricing information. You can also check extraction costs in the API response headers (X-Scrapfly-Api-Cost).

How do I check extraction costs?

Extraction costs are available in:

API Response: Check the X-Scrapfly-Api-Cost response header
Response Body: Look at the context.cost field in the JSON response
Dashboard: View detailed cost breakdown in the Monitoring dashboard

What does HTTP status code 400 mean?

Status 400 indicates a malformed request. Common causes:

Missing required parameters (key, extraction method)
Invalid extraction template JSON syntax
Incorrect content-type header
Invalid parameter values

Check the error message in the response body for specific details. See the Errors page for all error codes.

What does HTTP status code 401 mean?

Status 401 indicates authentication failure. Make sure you:

Include the key parameter with your valid API key
Have an active Scrapfly subscription
Are using the correct API endpoint

Why are my extraction results empty?

Empty or incomplete results can occur when:

Template extraction: CSS/XPath selectors don't match the content structure. Verify selectors using browser dev tools.
Automatic models: Content doesn't match expected patterns. Try a different model or use LLM prompt extraction.
LLM prompts: Prompt is too vague or content is ambiguous. Make your prompt more specific.
Content issue: The source content is dynamically loaded via JavaScript (use Web Scraping API with render_js=true)

How can I speed up extraction?

To improve extraction performance:

Use extraction_template for fastest processing (rule-based, no AI)
Send compressed content using gzip to reduce transfer time
Minimize content size by pre-filtering unnecessary HTML
Use specific extraction models instead of generic LLM prompts when possible
Consider caching results if extracting from the same content repeatedly

Does Extraction API support webhooks?

Yes! You can use the webhook parameter to receive extraction results asynchronously. This is useful for long-running extractions or integrating with other systems.

See the Webhook documentation for details on setup and payload format.

What is the maximum content size for extraction?

Content size limits vary by extraction method:

Template extraction: Up to 10MB of HTML content
Automatic models: Up to 5MB of HTML content
LLM prompts: Up to 1MB of content (due to LLM token limits)

For larger content, consider splitting it into chunks or pre-processing to extract relevant sections.

What is the URL parameter used for?

The url parameter is optional and provides context to the extraction engine. It helps:

Resolve relative URLs in the content to absolute URLs
Provide additional context for AI models
Track extraction source in monitoring dashboards

While optional, including the URL parameter is recommended for better results.

Which extraction method should I use?

Choose based on your needs:

Template extraction: You know the exact structure, need fastest performance, want lowest cost
Automatic models: Extracting common data types (products, articles), want balance of ease and accuracy
LLM prompts: Need flexibility, extracting unique/complex data, willing to pay more for AI-powered extraction

You can also combine methods: use templates for structured parts and LLM prompts for unstructured content.

Where can I learn about data extraction?

Scrapfly provides comprehensive learning resources:

HTML Parsing Academy - Learn HTML parsing fundamentals
JSON Parsing Academy - Master JSON data extraction
Scrapfly Academy - Complete web scraping roadmap
Scrapfly Blog - Tutorials and guides

Do SDKs support the Extraction API?

Yes! Both Python and TypeScript SDKs fully support the Extraction API with convenient methods for all three extraction types. The SDKs handle request formatting, authentication, and response parsing automatically.

See the Scrape API Getting Started guide for SDK installation and usage examples.

What format do extraction results use?

All extraction methods return results as JSON by default. The structure varies:

Template extraction: Matches your template schema exactly
Automatic models: Uses predefined schemas specific to each model (see model documentation)
LLM prompts: Structure depends on your prompt instructions

Results are available in the result.extracted_data.data field of the API response.

How can I test extraction before production?

Test your extraction configurations using:

Web Player: Try extraction in the Scrapfly dashboard player (if using combined scraping + extraction)
Test websites: Use web-scraping.dev for practice
Small batches: Start with a few examples before scaling
Monitoring: Check extraction logs in the dashboard

FAQ