LLM Extraction

Harness the power of natural language processing to seamlessly extract data from any website. Our advanced models simplify the extraction process by handling technical complexities such as chunking large documents, tokenization, and other NLP tasks. This allows you to focus on what truly matters while we take care of the heavy lifting.

Benefits

  • Ease of Use: No need to worry about technical details; our models manage everything for you.
  • Versatile Content Types: Supports various text content types including text, html, xml, markdown, json, rss, xml, csv. We plan to support additional content types like application/pdf in the future.

Usage

  1. Retrieve your content: When using extraction API, you already have the content. For the example we will use the result data structure of the extracted data example from the prompt on the url https://web-scraping.dev/product/1 and save it's content to the current directory where you will run the curl command below as product.html
  2. Prepare your prompt:
  3. Call the extraction API: with the your prompt urlencoded extraction_prompt=Extract%20the%20product%20specification%20in%20json%20format
    Command Explanation
    • curl -X POST:
      • curl is a command-line tool for transferring data with URLs.
      • -X POST specifies the HTTP method to be used, which is POST in this case.
    • -H "content-type: text/html":
      • -H is used to specify an HTTP header for the request.
      • "content-type: text/html" sets the Content-Type header to text/html, indicating that the data being sent is HTML.
    • URL:
      • The URL of the API endpoint being accessed, including query parameters for authentication and specifying the target URL and extraction prompt.
      • key: An API key for authentication.
      • url: The URL of the web page to be scraped, URL-Encoded .
      • extraction_prompt: A prompt specifying what to extract, in this case, "Retrieve the latest reviews in a JSON format".
    • -d @product.html:
      • -d is used to specify the data to be sent in the POST request body.
      • @product.html indicates that the data should be read from a file named product.html.
  4. The result
    You can instruct to retrieve a json format, based on the content_type field you get the actual format and we validate it.

    When the content type is application/json we will return the data as a json object without re-encoding it (json text in json object) for simplicity of usage.
If you are receiving the error ERR::EXTRACTION::DATA_ERROR make sure to read the description provided with the error code, when the LLM is not able to extract the data you are asking for, it explains the reason why.
  • The data you are asking for is not present in the document
  • Be more precise by adding In this document,
  • Use correct semantic related to data extraction, for example replace retrieve by extract

Web Scraping API

In this example we will extract the data with the following LLM prompt:

Combined with cache feature, we cache the raw data from the website, allowing you to re-extract the data with multiple extraction passes at a much faster speed and lower cost. This applies to the following extraction types:
Learn more about cache feature
curl -G \
--request "GET" \
--url "https://api.scrapfly.io/scrape" \
--data-urlencode "tags=player,project:default" \
--data-urlencode "extraction_prompt=Present me the product like you are a sales person, summarize it and give the top pro and cons from reviews in bullet point list" \
--data-urlencode "cache=true" \
--data-urlencode "asp=true" \
--data-urlencode "render_js=true" \
--data-urlencode "key=__API_KEY__" \
--data-urlencode "url=https://web-scraping.dev/product/1"
https://api.scrapfly.io/scrape?tags=player%252Cproject%253Adefault&extraction_prompt=Present%2520me%2520the%2520product%2520like%2520you%2520are%2520a%2520sales%2520person%252C%2520summarize%2520it%2520and%2520give%2520the%2520top%2520pro%2520and%2520cons%2520from%2520reviews%2520in%2520bullet%2520point%2520list&cache=true&asp=true&render_js=true&key=&url=https%253A%252F%252Fweb-scraping.dev%252Fproduct%252F1

The full Web Scraping API response structure where the extracted data is available in the result.extracted_data.data field:

Scrapfly is directly integrated with well known tools to simplify the LLM data retrieval

LlamaIndex

LlamaIndex, formerly known as GPT Index, is a data framework designed to facilitate the connection between large language models (LLMs) and a wide variety of data sources. It provides tools to effectively ingest, index, and query data within these models.

Integrate Scrapfly with LlamaIndex

Langchain

LangChain is a robust framework designed for developing applications powered by language models. It focuses on enabling the creation of applications that can leverage the capabilities of large language models (LLMs) for a variety of use cases.

Integrate Scrapfly with Langchain

Limitation

  • The maximum length of the prompt is 10,000 characters
  • It's possible that we are not able to fulfill all request under heavy load >1k req/s, and with the complexity on GPU shortage/quota the scaling is limited. Also we prioritize the request based on the plan, the free plan has the lowest priority. You will get an error code ERR::EXTRACTION::OUT_OF_CAPACITY if this happens.
  • The maximum prompt execution time is 25 seconds. The biggest factor is the output (response) size, the bigger the output the longer it takes to process. We observe a TPS (Token per second) between 120 and 150, we expect to reach 500 tokens per second by the end of the year.

All related errors are listed below. You can see full description and example of error response on the Errors section.

Pricing

LLM extraction is billed 5 API Credits.

For more information about the pricing you can learn more on the dedicated section

Summary