Scrapfly Product Release Notes

2024-09-13

Python SDK

RELEASE

Python SDK 0.8.18 released. This version introduce the support of:

You can now install the new version with pip install scrapfly-sdk==0.8.18 or upgrade with pip install --upgrade scrapfly-sdk

See the Python SDK documentation

PyPi package

Github Repository

Web Scraping API

IMPROVEMENT

Web Scraping API HTML to Markdown conversion now has been improvement for more accurate conversion and include/fallback to provide maximum information.

The conversion now include:

  • When the alt attribute for images or links, we are looking for title, aria-label. If nothing is found, "Link" text is used.
  • We detect that links anchor are empty or contains non printable content (like svg, images, etc.) we fallback to the alt attribute then title aria-label. If nothing is found, "Link" is used.
  • Improvement of code blocks rendering for inline and multiline code blocks
  • Some indentation issue has been fixed
FEATURE

Web Scraping API When browser rendering is enabled, wait_for_selector can now await for an intermediate request based on URL pattern.

To use XHR pattern, simply prefix your selector with xhr: like xhr:https://web-scraping.dev/api/. We also support wildcard * notation like xhr:https://web-scraping.dev/product/*/results to match all request to the domain.

For more information about the XHR selector, please refer to the Web Scraping API documentation

Extraction API

FEATURE

Extraction now support compressed document, simply announce Content-Encoding: gzip in the request header to tell the document is gzipped. We support gzip, deflate and zstd compression.

For more information about the compression support, please refer to the Extraction API documentation

2024-08-02

Extraction API

ANNOUNCEMENT

We have added new model for auto-extraction backed by AI. The following models are now available:

  • job_listing
  • job_posting
  • vehicle_ad_listing
  • vehicle_ad
  • event
  • food_recipe
  • hotel
  • hotel_listing
  • organization
  • stock

You can now use those models in the extraction_model parameter of the Extraction API. To learn more about those models, refer to the Extraction API documentation

2024-07-10

Web Scraping API

BREAKING

Starting from 2024-09-01 Content larger than 5 MB are now handled as BLOB for binary content and CLOB for text content. the response.result.content will now contain a URL to download the content instead of the content itself. The format blob or clob will be specified in the response.result.format field.

SDK will be update to handle it transparently. For more information, refer to the Web Scraping API documentation

Extraction API

FEATURE

Extraction now support compressed document, simply announce Content-Encoding: gzip in the request header to tell the document is gzipped. We support gzip, deflate and zstd compression.

For more information about the compression support, please refer to the Extraction API documentation

2024-07-05

Webhook

ANNOUNCEMENT

We are rolling out our new webhook system to fix issues with the previous one. The new system is more reliable and faster and unlock new features for incoming API. The internal scheduler has been improved with a capacity planner for a better concurrency constraints and handle the load.

Web Scraping API

FIXED

An issue where webhook notification are infinitely replayed an noising the monitoring dashboards has been fixed. This issue do impact billing (only the notification is retried), but it generate a lot of noise in the monitoring dashboards by creating logs and falsify the metrics.

Extraction API

FEATURE

Webhook is now available for the Extraction API. You can now receive a notification when an extraction is completed with the result.

For more information, refer to the dedicated Extraction API Webhook documentation

Screenshot API

FEATURE

Webhook is now available for the Screenshot API. You can now receive a notification when the screenshot is complete and the image directly in body response.

For more information, refer to the dedicated Screenshot API Webhook documentation

2024-06-10

Web Scraping API

CHANGED

Web Scraping API now announce the debug replay url, when you are using the debug parameter in the Web Scraping API, the response will now contain a content_replay_url in context.debug to replay a scrape against the exact same content.

This URL need to be authenticated with the same API key used to perform the scrape.

    context {
      debug: {
        screenshot_url: "https://api.scrapfly.io/11cd6abe-5061-4dce-8d37-5d50e667a071/scrape/screenshot/ee8484c6-ee5f-4775-a665-0a2b57631c1c/debug",
        response_url: "https://api.scrapfly.io/scrape/debug/ee8484c6-ee5f-4775-a665-0a2b57631c1c",
        content_replay_url: "https://api.scrapfly.io/scrape/debug/ee8484c6-ee5f-4775-a665-0a2b57631c1c/replay",
      }
    }

For more information, refer to the Web Scraping API documentation

2024-06-04

Screenshot API

ANNOUNCEMENT

Screenshot API released. This API allows you to take screenshots of web pages, much simpler than the Web Scraping API and all preset pre configured (Image load, High quality, Rendering wait)

Screenshot API provide some unique features:

  • Multiple image format (jpg, png, webp, gif)
  • Multiple capture mode (custom viewport, fullpage, vertical, elements)
  • Custom resolution
  • Caching
  • Page options (Dark mode, block banners, block ads, print format)

You can now discover the Screenshot API in the API documentation.

Extraction API

ANNOUNCEMENT

Extraction API released in BETA. This API allows you to extract structured data from web pages. It comes with 3 modes of extraction:

  • Custom rules with extraction template: define your own extraction rules, formatters, and filters
  • LLM Prompt Extraction: Extract or ask question about the document using our pre-trained LLM model dedicated to web scraping
  • Automatic extraction: Choose a model of extraction based on the type of page (product, job, article, etc.) and retrieved the structured data and metadata information to evaluate the quality of the extraction

You can now discover the Extraction API in the API documentation.

Web Scraping API

FEATURE

Web Scraping API now integrate data extraction from the scraped pages. You refer the documentation of those new parameters:

FIXED

Fixed an issue where the Web Scraping API screenshot return the image with an invalid IANA content type image/jpg instead of image/jpeg

CHANGED

The proxified_response parameter, when using extraction_template or extraction_prompt or extraction_model, now return the content-type of the extracted data instead of the original response content-type.

More information about the proxified_response parameter in the Web Scraping API documentation

CHANGED

The format parameter now accept options to configure the output format of the scraped page.

Markdown format now allow to:

  • Disable images no_links and use the alt text instead
  • Disable links no_images and use the anchor instead

By using the following notation: markdown:no_links,no_images - {format}:{option1},{optionN}

To lean more about those formats, refer to the Web Scraping API documentation

2024-04-24

Python SDK

RELEASE

Python SDK 0.8.17 released. This version introduce the support of:

  • Web Scraping API format parameter
  • Web Scraping API screenshot_flags parameter

You can now install the new version with pip install scrapfly-sdk==0.8.17 or upgrade with pip install --upgrade scrapfly-sdk

See the Python SDK documentation PyPi package

Javascript SDK

RELEASE

Javascript SDK 0.5.0 released. This version introduce the support of:

  • Web Scraping API format parameter
  • Web Scraping API screenshot_flags parameter

You can now install the new version with npm install scrapfly-sdk@0.5.0 or upgrade with npm install scrapfly-sdk@latest

See the Javascript SDK documentation NPM package

2024-04-22

Python SDK

ANNOUNCEMENT

Scrapfly has now official integration with LlamaIndex to help you to extract data.

See LLama index documentation

ANNOUNCEMENT

Scrapfly has now official integration with LangChain to help you to extract data.

See LangChain documentation

Web Scraping API

FEATURE

Introduce a new parameter format to the Web Scraping API to allow you to convert the scraped page to a specific format. With the rise of LLM usage, you can now convert into friendly LLM format and more.

You can now convert the scraped page to:

  • markdown
  • text
  • json (auto parse)
  • clean_html

If you are using proxified_response to directly retrieve the content, the announced content-type will follow the format you choose.

To lean more about those formats, refer to the Web Scraping API documentation

FEATURE

You can now pass flags to configure screenshot options directly from the Web Scraping API.

Available flags:

  • load_images
  • dark_mode
  • block_banners
  • high_quality
  • print_media_format

To lean more about those flags, refer to the Web Scraping API documentation

Summary