Scrapfly LangChain Integration
Scrapfly is available on LangChain
- a popular framework for developing applications powered by large language models (LLMs).
For Langchain, Scrapfly is available as a document loader object which uses Scrapfly Web Scrape API
to retrieve web page data for use within the LangChain ecosystem.
Scrapfly's Extraction API already provides many functionalities of LangChain through
the LLM Extraction feature
if you're looking for a more streamlined LLM parsing solution.
Usage
To start get your Srapfly API key on your dashboard. Then install Scrapfly Python SDK and LangChain:
pip install scrapfly-sdk langchain langchain-community
Then, the ScrapflyLoader
is available for scraping any web page:
from langchain_community.document_loaders import ScrapflyLoader
scrapfly_loader = ScrapflyLoader(
["https://web-scraping.dev/products"],
api_key="Your ScrapFly API key", # Get your API key from https://www.scrapfly.io/
continue_on_failure=False, # enable to ignore unprocessable web pages and log their exceptions
)
# Load documents from URLs as markdown
documents = scrapfly_loader.load()
print(documents)
For more advanced use, the integration supports all Scrapfly Web Scrape API options matching the Python SDK signature:
from langchain_community.document_loaders import ScrapflyLoader
scrapfly_scrape_config = {
"asp": True, # Bypass scraping blocking and antibot solutions, like Cloudflare
"render_js": True, # Enable JavaScript rendering with a cloud headless browser
"proxy_pool": "public_residential_pool", # Select a proxy pool (datacenter or residnetial)
"country": "us", # Select a proxy location
"auto_scroll": True, # Auto scroll the page
"js": "", # Execute custom JavaScript code by the headless browser
}
scrapfly_loader = ScrapflyLoader(
["https://web-scraping.dev/products"],
api_key="Your ScrapFly API key", # Get your API key from https://www.scrapfly.io/
scrape_config=scrapfly_scrape_config, # Pass the scrape_config object
scrape_format="markdown", # The scrape result format, either `markdown`(default) or `text`
continue_on_failure=True, # eanble to ignore unprocessable web pages and log their exceptions
)
# Load documents from URLs as markdown
documents = scrapfly_loader.load()
print(documents)
Example Use
Langchain is a very large, feature-rich framework that can be daunting at first but Scrapfly document loader can greatly simplify
the usage as it can return clean markdown pages that don't need any complex processing.
For this example, let's take a look at a very simple RAG chain which will:
- Scrape a product page (https://web-scraping.dev/product/1) using Scrapfly as markdown
- Generate a prompt for data extraction using Langchain prompt templatings
- Submit page data to the prompt and cast the results to JSON
This whole process in Langchain would look as simple as the script below:
import os
from langchain_community.document_loaders import ScrapflyLoader
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
# 1. add openAPI key
os.environ["OPENAI_API_KEY"] = "YOUR OPENAPI KEY"
# 2. prompt design
prompt = "Given the data fetched from the specified product URLs, \
find the following product fields {fields} in the provided \
markdown and return a JSON"
prompt_template = ChatPromptTemplate.from_messages(
[("system", prompt), ("user", "{markdown}")]
)
# 3. put together in a chain: form prompt -> openAI -> result parser
chain = (
prompt_template
| ChatOpenAI(model="gpt-3.5-turbo-0125")
| JsonOutputParser()
)
# 4. Retrieve page HTML as markdown using Scrapfly
loader = ScrapflyLoader(
["https://web-scraping.dev/product/1"],
api_key="YOUR SCRAPFLY KEY",
)
docs = loader.load()
# 5. execute RAG chain with your inputs:
print(chain.invoke({
"fields": ["price", "title"], # select product price and field
"markdown": docs # supply the markdown content from Scrapfly scraper
}))
# prints:
# {'price': '$9.99', 'title': 'Box of Chocolate Candy'}
Let's unpack the above example step-by-step. We first set up our prompt template with placeholders that'll be expanded on the chain call:
system:
Given the data fetched from the specified product URLs,
find the following product fields {fields} in the provided
markdown and return a JSON:
user:
{markdown}
Then, we configure and run our Scrapfly document loader to scrape the page as markdown:
# 4. Retrieve page HTML as markdown using Scrapfly
loader = ScrapflyLoader(
["https://web-scraping.dev/product/1"],
api_key="YOUR SCRAPFLY KEY",
)
docs = loader.load()
Finally, we have the prompt template and our data so we can execute the RAG chain with our inputs.
This returns us a JSON (python dictionary) object:
{'price': '$9.99', 'title': 'Box of Chocolate Candy'}
Errors
Langchain will display the Scrapfly API error message in the standard Scrapfly API error message format. For more see:
Pricing
No additional costs.