LangChain Integration
Power up LLM with web scraping
Scrapfly officially integrates with LangChain framework for LLM tool development in Python. Making RAG accessible to anyone:
- Scrape any page using Web Scraping API and all of its features like cloud web browsers and blocking bypass
- RAG extend your LangChain tools with web scraped documents using Scrapfly document loader
- Auto convert scraped data to Markdown, JSON or other data types for easy ingestion
Get Started with LangChain Web Automation
Create a free Scrapfly Account
Install Python Packages
See Some Usage Examples!
What can LangChain integration do?
from langchain_community.document_loaders import ScrapflyLoader
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
# 1. prompt design
prompt = """
Given the data fetched from product URLs,
find the following product fields {fields}
in the provided markdown and return a JSON
"""
prompt_template = ChatPromptTemplate.from_messages(
[("system", prompt), ("user", "{markdown}")]
)
# 2. chain: form prompt -> execute with openAI -> parse result
os.environ["OPENAI_API_KEY"] = "YOUR OPENAPI KEY"
chain = (
prompt_template
| ChatOpenAI(model="gpt-3.5-turbo-0125")
| JsonOutputParser()
)
# 3. Retrieve page HTML as markdown using Scrapfly
loader = ScrapflyLoader(
["https://web-scraping.dev/product/1"],
api_key="YOUR SCRAPFLY KEY",
)
docs = loader.load()
# 4. execute RAG chain with your inputs:
print(chain.invoke({
"fields": ["price", "title"],
"markdown": docs
}))
# prints:
{
'price': '$9.99',
'title': 'Box of Chocolate Candy'
}
ScrapflyReader extends LangChain with the ability to scrape any page and extend your LLM operations with RAG functionality:
- Bypass scraper blocking to collected web page datasets
- Use javascript rendering to scrape all data on available the page
- Automatically convert results to markdown or json for better LLM understanding
Scrapfly integration handles all of the document retrieval challenges in your LLM applications so you can focus on delivering real AI products.
Need more functionality?
Scrapfly is accessible through Python and Typescript SDKs so you can create your own scripts and integrations in Python or any Javascript runtime like NodeJS, Deno or Bun!
The SDKs include all Scrapfly API features and many useful utilities and shortcuts making for a powerful development experience.
Transform Your Industry with Web Data
Explore web data solutions for your industry — we got you covered!
AI Training
Crawl the latest images, videos and user generated content for AI training.
Compliance
Scrape online presence to validate compliance and security.
eCommerce
Scrape products, reviews and more to enhance your eCommerce and brand awareness.
Financial Service
Scrape the latest stock, shipping and financial data to enhance your finance datasets.
Fraud Detection
Scrape products and listings to detect fraud and counterfeit activity.
Jobs Data
Scrape the latest job listings, salaries and more to enhance your job search.
Lead Generation
Scrape online profiles and contact details to enhance your lead generation.
Logistics
Scrape logistics data like shipping, tracking, container prices to enhance your deliveries.
Explore
More
Use Cases
Frequently Asked Questions
How to Web Scrape with LangChain?
To web scrape using LangChain, you can leverage its built-in objects called Loaders, designed for reading external data sources. One such tool is the ScrapflyLoader, which allows you to scrape web pages and retrieve data in various formats, including rendered HTML, JSON, or Markdown. These formats are perfect for creating vector indexes used in Retrieval-Augmented Generation (RAG) applications.
How to Use LangChain with Websites?
LangChain enables you to scrape websites and create a vector index from the scraped content, allowing you to integrate real-world data into any LLM model. This approach, called RAG, is ideal for enhancing LLMs with up-to-date information. Use the ScrapflyReader to generate a vector index from specified URLs. Check out this example to see how LangChain works with websites for advanced LLM prompting.
Is Web Scraping with LangChain Legal?
Yes, using LangChain for web scraping is generally legal when scraping publicly available data. However, it's important to be cautious when handling PII (personally identifiable information) or copyrighted material, as laws like GDPR may impose restrictions on how such data is stored or used. For more details, refer to our comprehensive web scraping laws article.
What is a Web Scraping API?
Web Scraping API is a service that abstracts away the complexities and challenges of web scraping and data extraction. This allows developers to focus on creating software rather than dealing with issues like web scraping blocking and other data access challenges.
What is an Extraction API?
Extraction API is a service that abstracts away the complexities and challenges of data extraction and parsing. It does this through AI auto extract and LLM prompt features as well as manual schema based instructions for precise control
What is an Screenshot API?
Screenshot API is a service that abstracts away the complexities and challenges web browser screenshot capture. This allows you to capture a screenshot of any web page while handling challenges like blocking ads and pop-ups, bypassing browser blocks, and returning the screenshot in any format of any page area you need.