Scrapfly Product Release Notes
2024-09-13
Python SDK
Python SDK 0.8.18 released. This version introduce the support of:
- Extraction API
- Screenshot API
- Web Scraping API transparent large object handling
You can now install the new version with pip install scrapfly-sdk==0.8.18
or upgrade with pip install --upgrade scrapfly-sdk
Web Scraping API
Web Scraping API HTML to Markdown conversion now has been improvement for more accurate conversion and include/fallback to provide maximum information.
The conversion now include:
- When the
alt
attribute for images or links, we are looking for title, aria-label. If nothing is found, "Link" text is used. - We detect that links anchor are empty or contains non printable content (like svg, images, etc.) we fallback to the
alt
attribute thentitle
aria-label
. If nothing is found, "Link" is used. - Improvement of code blocks rendering for inline and multiline code blocks
- Some indentation issue has been fixed
Web Scraping API When browser rendering is enabled, wait_for_selector
can now await for an intermediate request based on URL pattern.
To use XHR pattern, simply prefix your selector with xhr:
like xhr:https://web-scraping.dev/api/
. We also support wildcard *
notation
like xhr:https://web-scraping.dev/product/*/results
to match all request to the domain.
For more information about the XHR selector, please refer to the Web Scraping API documentation
Extraction API
Extraction now support compressed document, simply announce Content-Encoding: gzip
in the request header to tell the document is gzipped. We support
gzip
, deflate
and zstd
compression.
For more information about the compression support, please refer to the Extraction API documentation
2024-08-02
Extraction API
We have added new model for auto-extraction backed by AI. The following models are now available:
job_listing
job_posting
vehicle_ad_listing
vehicle_ad
event
food_recipe
hotel
hotel_listing
organization
stock
You can now use those models in the extraction_model
parameter of the Extraction API. To learn more about those models, refer to the Extraction API documentation
2024-07-10
Web Scraping API
Starting from 2024-09-01 Content larger than 5 MB are now handled as BLOB for binary content and CLOB for text content. the response.result.content
will now contain a URL to download the content instead of the content itself. The format blob
or clob
will be specified in the response.result.format
field.
SDK will be update to handle it transparently. For more information, refer to the Web Scraping API documentation
Extraction API
Extraction now support compressed document, simply announce Content-Encoding: gzip
in the request header to tell the document is gzipped. We support
gzip
, deflate
and zstd
compression.
For more information about the compression support, please refer to the Extraction API documentation
2024-07-05
Webhook
We are rolling out our new webhook system to fix issues with the previous one. The new system is more reliable and faster and unlock new features for incoming API. The internal scheduler has been improved with a capacity planner for a better concurrency constraints and handle the load.
Web Scraping API
An issue where webhook notification are infinitely replayed an noising the monitoring dashboards has been fixed. This issue do impact billing (only the notification is retried), but it generate a lot of noise in the monitoring dashboards by creating logs and falsify the metrics.
Extraction API
Webhook is now available for the Extraction API. You can now receive a notification when an extraction is completed with the result.
For more information, refer to the dedicated Extraction API Webhook documentation
Screenshot API
Webhook is now available for the Screenshot API. You can now receive a notification when the screenshot is complete and the image directly in body response.
For more information, refer to the dedicated Screenshot API Webhook documentation
2024-06-10
Web Scraping API
Web Scraping API now announce the debug replay url, when you are using the debug
parameter in the Web Scraping API,
the response will now contain a content_replay_url
in context.debug
to replay a scrape against the exact same content.
This URL need to be authenticated with the same API key used to perform the scrape.
context {
debug: {
screenshot_url: "https://api.scrapfly.io/11cd6abe-5061-4dce-8d37-5d50e667a071/scrape/screenshot/ee8484c6-ee5f-4775-a665-0a2b57631c1c/debug",
response_url: "https://api.scrapfly.io/scrape/debug/ee8484c6-ee5f-4775-a665-0a2b57631c1c",
content_replay_url: "https://api.scrapfly.io/scrape/debug/ee8484c6-ee5f-4775-a665-0a2b57631c1c/replay",
}
}
For more information, refer to the Web Scraping API documentation
2024-06-04
Screenshot API
Screenshot API released. This API allows you to take screenshots of web pages, much simpler than the Web Scraping API and all preset pre configured (Image load, High quality, Rendering wait)
Screenshot API provide some unique features:
- Multiple image format (jpg, png, webp, gif)
- Multiple capture mode (custom viewport, fullpage, vertical, elements)
- Custom resolution
- Caching
- Page options (Dark mode, block banners, block ads, print format)
You can now discover the Screenshot API in the API documentation.
Extraction API
Extraction API released in BETA. This API allows you to extract structured data from web pages. It comes with 3 modes of extraction:
- Custom rules with extraction template: define your own extraction rules, formatters, and filters
- LLM Prompt Extraction: Extract or ask question about the document using our pre-trained LLM model dedicated to web scraping
- Automatic extraction: Choose a model of extraction based on the type of page (product, job, article, etc.) and retrieved the structured data and metadata information to evaluate the quality of the extraction
You can now discover the Extraction API in the API documentation.
Web Scraping API
Web Scraping API now integrate data extraction from the scraped pages. You refer the documentation of those new parameters:
extraction_template
: Use your own extraction rulesextraction_prompt
: Use LLM prompt to retrieve dataextraction_model
: Use automatic extraction mode
Fixed an issue where the Web Scraping API screenshot return the image with an invalid IANA content type
image/jpg
instead of image/jpeg
The proxified_response
parameter, when using extraction_template
or extraction_prompt
or extraction_model
,
now return the content-type
of the extracted data instead of the original response content-type.
More information about the proxified_response
parameter in the Web Scraping API documentation
The format
parameter now accept options to configure the output format of the scraped page.
Markdown format now allow to:
- Disable images
no_links
and use the alt text instead - Disable links
no_images
and use the anchor instead
By using the following notation: markdown:no_links,no_images
- {format}:{option1},{optionN}
To lean more about those formats, refer to the Web Scraping API documentation
2024-04-24
Python SDK
Python SDK 0.8.17 released. This version introduce the support of:
- Web Scraping API
format
parameter - Web Scraping API
screenshot_flags
parameter
You can now install the new version with pip install scrapfly-sdk==0.8.17
or upgrade with pip install --upgrade scrapfly-sdk
Javascript SDK
Javascript SDK 0.5.0 released. This version introduce the support of:
- Web Scraping API
format
parameter - Web Scraping API
screenshot_flags
parameter
You can now install the new version with npm install scrapfly-sdk@0.5.0
or upgrade with npm install scrapfly-sdk@latest
2024-04-22
Python SDK
Scrapfly has now official integration with LlamaIndex to help you to extract data.
Scrapfly has now official integration with LangChain to help you to extract data.
Web Scraping API
Introduce a new parameter format
to the Web Scraping API to allow you to convert the scraped page to a specific format.
With the rise of LLM usage, you can now convert into friendly LLM format and more.
You can now convert the scraped page to:
markdown
text
json
(auto parse)clean_html
If you are using proxified_response
to directly retrieve the content, the announced content-type
will
follow the format you choose.
To lean more about those formats, refer to the Web Scraping API documentation
You can now pass flags to configure screenshot options directly from the Web Scraping API.
Available flags:
load_images
dark_mode
block_banners
high_quality
print_media_format
To lean more about those flags, refer to the Web Scraping API documentation