Structured Data Extraction with AI-Powered Automatic Parser
Scrapfly's AI-powered automatic parser effortlessly converts unstructured HTML data into predefined,
structured models. Experience precise and efficient data extraction with our advanced technology.
It can ingest any text format like HTML
, text
, Markdown
,
json
Usage
-
We will use https://web-scraping.dev/product/1 page
as example and save it's content to the current directory where you will run the curl command below as
product.html
- We will use the product model to extract product information
-
Call the extraction API
If you have
jq
available on your machine, you can pretty print the output JSON by appending it to the command like-d @product.html | jq
.Command Explanation
-
curl -X POST
:curl
is a command-line tool for transferring data with URLs.-X POST
specifies the HTTP method to be used, which is POST in this case.
-
-H "content-type: text/html"
:-H
is used to specify an HTTP header for the request."content-type: text/html"
sets the Content-Type header totext/html
, indicating that the data being sent is HTML.
-
URL:
- The URL of the API endpoint being accessed, including query parameters for authentication and specifying the target URL and extraction prompt.
-
key
: An API key for authentication. -
url
: The URL of the web page to be scraped, URL-encoded. -
extraction_model
: The AI model to use for extraction.
-
-d @product.html
:-d
is used to specify the data to be sent in the POST request body.@product.html
indicates that the data should be read from a file namedproduct.html
.
-
Result:
Models
Currently those models has been tailored based on customer feedback and usage. If you need a specific model and enough generalist, you can contact us on the support link below. If some fields are missing, you can also contact us to add them.
Contact us
For automatic structured data extraction, choose a model from below and the AI will try to fulfill it from the scrape web page you provide
- Article
- Event
- Food Recipe
- Hotel
- Hotel Listing
- Job Listing
- Job Posting
- Organization
- Product
- Product Listing
- Real Estate Property
- Real Estate Property Listing
- Review List
- Search Engine Results
- Social Media Post
- Stock
- Vehicle Ad
- Vehicle Ad Listing
Web Scraping API
require "uri"
require "net/http"
url = URI("https://api.scrapfly.io/scrape?tags=player%2Cproject%3Adefault&extraction_model=product&asp=true&render_js=true&auto_scroll=true&key=__API_KEY__&url=https%3A%2F%2Fweb-scraping.dev%2Fproduct%2F1")
https = Net::HTTP.new(url.host, url.port);
https.use_ssl = true
request = Net::HTTP::Get.new(url)
response = https.request(request)
puts response.read_body
https://api.scrapfly.io/scrape?tags=player%252Cproject%253Adefault&extraction_model=product&asp=true&render_js=true&auto_scroll=true&key=&url=https%253A%252F%252Fweb-scraping.dev%252Fproduct%252F1
API Response
You will retrieve the following information from the API response result.extracted_data
result.extracted_data.content_type
: Always be JSONresult.extracted_data.data
: Structured extracted data-
result.extracted_data.data_quality
- errors Will give the list of data violations that do not follow the validation schema
- fulfilled A boolean indicating whether the schema is fully satisfied.
- fulfillment_percent The percentage of fulfillment, where 0 indicates empty and 100 indicates perfect.
Combined with cache feature, we cache the raw data from the website, allowing you to re-extract the data with multiple extraction passes at a much faster speed and lower cost. This applies to the following extraction types:
Learn more about cache feature
Error Handling
All related errors are listed below. You can see full description and example of error response on the Errors section.
- ERR::EXTRACTION::DATA_ERROR - Extracted data is invalid or have an issue
- ERR::EXTRACTION::ERR::EXTRACTION::TIMEOUT - Data extraction timeout
- ERR::EXTRACTION::INVALID_RULE - The extraction rule is invalid
- ERR::EXTRACTION::INVALID_TEMPLATE - The template used for extraction is invalid
- ERR::EXTRACTION::NO_CONTENT - Target response is empty
- ERR::EXTRACTION::OUT_OF_CAPACITY - Not able to extract more data, backend are out of capacity, retry later.
- ERR::EXTRACTION::TEMPLATE_NOT_FOUND - The provided template do not exist
Pricing
During the beta period, extraction model is billed 5 API Credits. The final pricing will be announced when the beta period will end.
For more information about the pricing you can learn more on the dedicated section