LLM Extraction
Harness the power of natural language processing to seamlessly extract data from any website. Our advanced models simplify the extraction process by handling technical complexities such as chunking large documents, tokenization, and other NLP tasks. This allows you to focus on what truly matters while we take care of the heavy lifting.
Benefits
- Ease of Use: No need to worry about technical details; our models manage everything for you.
-
Versatile Content Types:
Supports various text content types including
text, html, xml, markdown, json, rss, xml, csv
. We plan to support additional content types likeapplication/pdf
in the future.
Usage
-
Retrieve your content: When using extraction API, you already have the content. For the example we will use the result data structure of the extracted data example from the prompt on the url
https://web-scraping.dev/product/1 and save it's content to the current directory
where you will run the curl command below as
product.html
-
Prepare your prompt:
-
Call the extraction API:
with the your prompt urlencoded
extraction_prompt=Extract%20the%20product%20specification%20in%20json%20format
Command Explanation
-
curl -X POST
:curl
is a command-line tool for transferring data with URLs.-X POST
specifies the HTTP method to be used, which is POST in this case.
-
-H "content-type: text/html"
:-H
is used to specify an HTTP header for the request."content-type: text/html"
sets the Content-Type header totext/html
, indicating that the data being sent is HTML.
-
URL:
- The URL of the API endpoint being accessed, including query parameters for authentication and specifying the target URL and extraction prompt.
-
key
: An API key for authentication. -
url
: The URL of the web page to be scraped, URL-Encoded . -
extraction_prompt
: A prompt specifying what to extract, in this case, "Retrieve the latest reviews in a JSON format".
-
-d @product.html
:-d
is used to specify the data to be sent in the POST request body.@product.html
indicates that the data should be read from a file namedproduct.html
.
-
-
The result
You can instruct to retrieve a json format, based on the
content_type
field you get the actual format and we validate it.
When the content type isapplication/json
we will return the data as a json object without re-encoding it (json text in json object) for simplicity of usage.
If you are receiving the error ERR::EXTRACTION::DATA_ERROR make sure to read the description provided with the error code, when the LLM is not able to extract the data you are asking for, it explains the reason why.
- The data you are asking for is not present in the document
- Be more precise by adding
In this document,
- Use correct semantic related to data extraction, for example replace
retrieve
byextract
Web Scraping API
In this example we will extract the data with the following LLM prompt:
Combined with cache feature, we cache the raw data from the website, allowing you to re-extract the data with multiple extraction passes at a much faster speed and lower cost. This applies to the following extraction types:
Learn more about cache feature
curl -G \
--request "GET" \
--url "https://api.scrapfly.io/scrape" \
--data-urlencode "tags=player,project:default" \
--data-urlencode "extraction_prompt=Present me the product like you are a sales person, summarize it and give the top pro and cons from reviews in bullet point list" \
--data-urlencode "cache=true" \
--data-urlencode "asp=true" \
--data-urlencode "render_js=true" \
--data-urlencode "key=__API_KEY__" \
--data-urlencode "url=https://web-scraping.dev/product/1"
https://api.scrapfly.io/scrape?tags=player%252Cproject%253Adefault&extraction_prompt=Present%2520me%2520the%2520product%2520like%2520you%2520are%2520a%2520sales%2520person%252C%2520summarize%2520it%2520and%2520give%2520the%2520top%2520pro%2520and%2520cons%2520from%2520reviews%2520in%2520bullet%2520point%2520list&cache=true&asp=true&render_js=true&key=&url=https%253A%252F%252Fweb-scraping.dev%252Fproduct%252F1
The full Web Scraping API response structure
where the extracted data is available in the result.extracted_data.data
field:
🔥 Popular LLM Integration
Scrapfly is directly integrated with well known tools to simplify the LLM data retrieval
LlamaIndex
LlamaIndex, formerly known as GPT Index, is a data framework designed to facilitate the connection between large language models (LLMs) and a wide variety of data sources. It provides tools to effectively ingest, index, and query data within these models.
Integrate Scrapfly with LlamaIndexLangchain
LangChain is a robust framework designed for developing applications powered by language models. It focuses on enabling the creation of applications that can leverage the capabilities of large language models (LLMs) for a variety of use cases.
Integrate Scrapfly with LangchainLimitation
- The maximum length of the prompt is 10,000 characters
-
It's possible that we are not able to fulfill all request under heavy load >1k req/s, and with the complexity on GPU shortage/quota the scaling is limited.
Also we prioritize the request based on the plan, the free plan has the lowest priority.
You will get an error code
ERR::EXTRACTION::OUT_OF_CAPACITY
if this happens. -
The maximum prompt execution time is 25 seconds. The biggest factor is the output (response) size, the bigger the output the longer it takes to process.
We observe a TPS (Token per second) between 120 and 150, we expect to reach 500 tokens per second by the end of the year.
- Web Scraping API
Error Handling
All related errors are listed below. You can see full description and example of error response on the Errors section.
- ERR::EXTRACTION::CONFIG_ERROR - Parameters sent to the API are not valid
- ERR::EXTRACTION::CONTENT_TYPE_NOT_SUPPORTED - The content type of the response is not supported for extraction.
- ERR::EXTRACTION::DATA_ERROR - Extracted data is invalid or have an issue
- ERR::EXTRACTION::INVALID_RULE - The extraction rule is invalid
- ERR::EXTRACTION::INVALID_TEMPLATE - The template used for extraction is invalid
- ERR::EXTRACTION::NO_CONTENT - Target response is empty
- ERR::EXTRACTION::OPERATION_TIMEOUT - Extraction Operation Timeout
- ERR::EXTRACTION::OUT_OF_CAPACITY - Not able to extract more data, backend are out of capacity, retry later.
- ERR::EXTRACTION::TEMPLATE_NOT_FOUND - The provided template do not exist
- ERR::EXTRACTION::TIMEOUT - The extraction was tool long (maximum 25s) or do not had enough time to complete
Pricing
During the beta period, LLM extraction is billed 5 API Credits. The final pricing will be announced when the beta period will end.
For more information about the pricing you can learn more on the dedicated section