Scrapfly LangChain Integration

Scrapfly is available on LangChain - a popular framework for developing applications powered by large language models (LLMs).

For Langchain, Scrapfly is available as a document loader object which uses Scrapfly Web Scrape API to retrieve web page data for use within the LangChain ecosystem.

Scrapfly's Extraction API already provides many functionalities of LangChain through the LLM Extraction feature if you're looking for a more streamlined LLM parsing solution.

Usage

To start get your Srapfly API key on your dashboard. Then install Scrapfly Python SDK and LangChain:

Then, the ScrapflyLoader is available for scraping any web page:

For more advanced use, the integration supports all Scrapfly Web Scrape API options matching the Python SDK signature:

Example Use

Langchain is a very large, feature-rich framework that can be daunting at first but Scrapfly document loader can greatly simplify the usage as it can return clean markdown pages that don't need any complex processing.

For this example, let's take a look at a very simple RAG chain which will:

  1. Scrape a product page (https://web-scraping.dev/product/1) using Scrapfly as markdown
  2. Generate a prompt for data extraction using Langchain prompt templatings
  3. Submit page data to the prompt and cast the results to JSON

This whole process in Langchain would look as simple as the script below:

Let's unpack the above example step-by-step. We first set up our prompt template with placeholders that'll be expanded on the chain call:

Then, we configure and run our Scrapfly document loader to scrape the page as markdown:

Finally, we have the prompt template and our data so we can execute the RAG chain with our inputs. This returns us a JSON (python dictionary) object:

Errors

Langchain will display the Scrapfly API error message in the standard Scrapfly API error message format. For more see:

Pricing

No additional costs.

Summary