AI Powered Data Extraction

Scrapfly's AI-powered automatic parser effortlessly converts unstructured HTML data into predefined, structured models. Experience precise and efficient data extraction with our advanced technology. It can ingest any text format like HTML, text, Markdown, json

Usage

  1. We will use https://web-scraping.dev/product/1 page as example and save it's content to the current directory where you will run the curl command below as product.html
  2. We will use the product model to extract product information
  3. Call the extraction API
    If you have jq available on your machine, you can pretty print the output JSON by appending it to the command like -d @product.html | jq.
    Command Explanation
    • curl -X POST:
      • curl is a command-line tool for transferring data with URLs.
      • -X POST specifies the HTTP method to be used, which is POST in this case.
    • -H "content-type: text/html":
      • -H is used to specify an HTTP header for the request.
      • "content-type: text/html" sets the Content-Type header to text/html, indicating that the data being sent is HTML.
    • URL:
      • The URL of the API endpoint being accessed, including query parameters for authentication and specifying the target URL and extraction prompt.
      • key: An API key for authentication.
      • url: The URL of the web page to be scraped, URL-encoded.
      • extraction_model: The AI model to use for extraction.
    • -d @product.html:
      • -d is used to specify the data to be sent in the POST request body.
      • @product.html indicates that the data should be read from a file named product.html.

Result:

Models

Currently those models has been tailored based on customer feedback and usage. If you need a specific model and enough generalist, you can contact us on the support link below. If some fields are missing, you can also contact us to add them.

Contact us

For automatic structured data extraction, choose a model from below and the AI will try to fulfill it from the scrape web page you provide

Web Scraping API

import requests

url = "https://api.scrapfly.io/scrape?tags=player%2Cproject%3Adefault&extraction_model=product&asp=true&render_js=true&auto_scroll=true&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Fproduct%2F1"
response = requests.request("GET", url)
data = response.json()
print(data)
print(data['result'])
https://api.scrapfly.io/scrape?tags=player%252Cproject%253Adefault&extraction_model=product&asp=true&render_js=true&auto_scroll=true&key=&url=https%253A%252F%252Fweb-scraping.dev%252Fproduct%252F1

API Response

You will retrieve the following information from the API response result.extracted_data

  • result.extracted_data.content_type: Always be JSON
  • result.extracted_data.data: Structured extracted data
  • result.extracted_data.data_quality
    • errors Will give the list of data violations that do not follow the validation schema
    • fulfilled A boolean indicating whether the schema is fully satisfied.
    • fulfillment_percent The percentage of fulfillment, where 0 indicates empty and 100 indicates perfect.
Combined with cache feature, we cache the raw data from the website, allowing you to re-extract the data with multiple extraction passes at a much faster speed and lower cost. This applies to the following extraction types:
Learn more about cache feature

All related errors are listed below. You can see full description and example of error response on the Errors section.

Pricing

During the beta period, extraction model is billed 5 API Credits. The final pricing will be announced when the beta period will end.

For more information about the pricing you can learn more on the dedicated section

Summary