What is Parsing? From Raw Data to Insights
Learn about the fundamentals of parsing data, across formats like JSON, XML, HTML, and PDFs. Learn how to use Python parsers and AI models for efficient data extraction.
With the recent development revolutions in the artificial intelligence domain, it became easy to access and use LLMs using available services, such as LlamaIndex and LangChain. But what about extending these services' LLMs with web scraped data?
In this article, we'll explain how to use LLM and web scraping for RAG applications. We'll start by defining their related concepts and then go through a step-by-step tutorial on applying the concepts to both LlamaIndex and LangChain with Python. Let's get started!
Large Language Models (LLMs) are machine learning models specialized in human text. They can understand and generate text based on a given input. LLMs are able to reply to a given prompt input by processing the text and evaluating it using their trained data.
In simple terms, LLMs are built using a specific type of machine learning model called neural networks. These networks are trained on a significant amount of pure text data. After receiving input, it's processed in two major steps:
Using LLM for web scraping enables various use cases due to its capabilities in text understanding, such as sentiment analysis, answering questions, summarizing text, or assisting in code generation.
Retrieval Augmented Generation (RAG) is a technique used to optimize a large language model output. To understand why it is used, let's explore a commonly encountered annoyance.
An LLM can be trained with terabytes of data and billions of parameters. However, it may lack understanding of a specific, niche, or private business domain. At the same time, re-training an LLM model is a time-consuming task and requires lots of engineering resources.
The RAG technique allows for extending a pre-trained LLM model with additional datasets. This approach enables the model to be aware and up-to-date with a specific context, making it far more accurate at answering questions or providing assistance with submitted prompts.
In the following sections, we'll go through a step-by-step guide on applying web scraping with LLMs to create a context-augmented RAG model.
Such an approach can be approached using the following steps:
That being said, there are two challenges associated with this web scraping LLM workflow:
To address the above challenges, we'll use Scrapfly for web page scraping as text or markdown, as both data types are accessible by LLMs. As for LLM communication, we'll use LlmaIndex and LangChain.
It's common for web scraping tools to send HTTP requests to web pages in order to retrieve their data as HTML. However, utilizing web scraping as the RAG data source, we have to extract the web data in a format that LLMs understand, either as Text or Markdown.
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
Here's how to use Scrapfly for LLM web scraping as Markdown using the Python SDK:
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your Scrapfly API key")
api_response: ScrapeApiResponse = scrapfly.scrape(
ScrapeConfig(
# target website URL
url="https://web-scraping.dev/login",
# bypass anti scraping protection
asp=True,
# set the proxy location to a specific country
country="US",
# specify the proxy pool
proxy_pool="public_residential_pool",
# enable JavaScript rendering (use a cloud browser)
render_js=True,
# specify the web scraping format
format="markdown"
)
)
# get the results
data = api_response.scrape_result['content']
print(data)
"""
[web-scraping.dev](https://web-scraping.dev/)
* Docs
* [API](https://web-scraping.dev/docs)
* [Graphql](https://web-scraping.dev/api/graphql)
* [Products](https://web-scraping.dev/products)
* [Reviews](https://web-scraping.dev/reviews)
* [Testimonials](https://web-scraping.dev/testimonials)
* [login](https://web-scraping.dev/login)
....
"""
For the rest of this guide, we'll be using Scrapfly to extract the data required for RAG system building. To follow along, sign up to get your Scrapfly API key.
LlamaIndex is an open-source framework for connecting datasets into large language models. It provides the necessary components required for building context-augmented LLMs.
The context augmentation allows a model to be aware of the provided datasets, allowing for various use cases, including:
In order to use LlamaIndex to build RAG models, we'll use it to interface web scraping for LLMs. For this, we'll utilize Scrapfly's LlamaIndex web scraping integration. It allows retrieving web page data into markdown documents, accessible for LLMs.
First, let's install the required Python packages:
The above packages can be installed using the following pip
command:
pip install llama-index llama-index-readers-web scrapfly-sdk
Let's start by exploring using LlamaIndex web scraping to retrieve a web page to feed the LLM model. For this, we'll use LlamaIndex ScrapflyReader
:
from llama_index.readers.web import ScrapflyReader
# Initiate ScrapflyReader with your Scrapfly API key
scrapfly_reader = ScrapflyReader(
api_key="Your Scrapfly API key",
ignore_scrape_failures=True, # Ignore unprocessable web pages and log their exceptions
)
scrapfly_scrape_config = {
"asp": True, # Bypass scraping blocking and antibot solutions, like Cloudflare
"render_js": True, # Enable JavaScript rendering with a cloud headless browser
"proxy_pool": "public_residential_pool", # Select a proxy pool (datacenter or residnetial)
"country": "us", # Select a proxy location
"auto_scroll": True, # Auto scroll the page
"js": "", # Execute custom JavaScript code by the headless browser
}
# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/products"], # List of URLs to scrape
scrape_config=scrapfly_scrape_config, # Pass the scrape config
scrape_format="markdown", # The scrape result format, either `markdown`(default) or `text`
)
print(documents)
The above code is fairly straightforward. Let's break down its workflow:
ScrapflyReader
gets initialized using the Scrapfly API key.scrapfly_scrape_config
object is created. It represents the Scrapfly API parameters to use with each scrape request.load_data
method is used to pass a list of URLs to scrape for LLM as markdown and convert them to documents.Now that the documents are ready, let's proceed with the RAG model creation by augmenting an LLM with the scraped data.
LlamaIndex has integrations with almost all the available LLMs out there. These include cloud LLMs, such as OpenAI, Mistral, and Gemini, as well as local LLMs, such as Ollama. However, using cloud LLMs requires having a subscription to the selected provider. Hence, using local models like Ollama can be a great alternative.
In this guide on using web scraping for retrieval-augmented generation, we'll use OpenAI as the LLM, which is the default LLM for LlamaIndex SDK. For instructions on using other LLMs, refer to the official LlamaIndex examples documentation.
Here's how to use web scraping for RAG models using OpenAI. First, get your OpenAI key and use the following code:
import os
from llama_index.readers.web import ScrapflyReader
from llama_index.core import VectorStoreIndex
scrapfly_reader = ScrapflyReader(
api_key="Your Scrapfly API key",
ignore_scrape_failures=True,
)
# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/products"]
)
# Set the OpenAI key as a environment variable
os.environ['OPENAI_API_KEY'] = "Your OpenAI Key"
# Create an index store for the documents
index = VectorStoreIndex.from_documents(documents)
# Create the RAG engine with using the index store
query_engine = index.as_query_engine()
# Submit a query
response = query_engine.query("What is the flavor of the dark energy potion?")
print(response)
"The flavor of the dark energy potion is bold cherry cola."
Here, we start by creating a VectorStoreIndex
, a component required by the RAG model. It splits the documents into a set of chunks, sets the relationship between their text, and saves them into memory. Then, we create a query_engine
over the store index using the LLM for querying.
The above query prompt example briefly illustrates how to use retrieval augmented generation with web scraping. We asked a question regarding the scraped data and got the correct result!
That being said, RAG for web scraping can be utilized for further advanced data processing tasks. For example, let's attempt to the web page data into a clean JSON dataset using a query prompt:
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Add the product data into a JSON dataset as an array of objects")
print(response)
From the query response, we can observe that the RAG model took care of the data parsing, processing, and cleaning:
[
{
"name": "Box of Chocolate Candy",
"url": "https://web-scraping.dev/product/1",
"description": "Indulge your sweet tooth with our Box of Chocolate Candy...",
"price": 24.99
},
....
]
LangChain is another popular framework for communicating with LLMs. It provides several components for working with and processing languages for several use cases, including:
To approach the use of LLMs and web scraping for LangChain RAG models, we will utilize Scrapfly's LangChain web scraping integration. It interfaces the Scrapfly API capabilities, including retrieving web pages' data as Markdown and Text.
Let's start with the installation process. We'll install the core LangChain Python packages, as well as additional utility packages:
ScrapflyLoader
.Install the above packages using the following pip
command:
pip install langchain langchainhub langchain-community langchain-chroma langchain-openai langchain-text-splitters scrapfly-sdk
The first step in building LangChain RAG models is extracting the data to augment the LLM's context. For this, we'll use the ScrapflyLoader
to scrape a web page as markdown:
from langchain_community.document_loaders import ScrapflyLoader
scrapfly_scrape_config = {
"asp": True, # Bypass scraping blocking and antibot solutions, like Cloudflare
"render_js": True, # Enable JavaScript rendering with a cloud headless browser
"proxy_pool": "public_residential_pool", # Select a proxy pool (datacenter or residnetial)
"country": "us", # Select a proxy location
"auto_scroll": True, # Auto scroll the page
"js": "", # Execute custom JavaScript code by the headless browser
}
scrapfly_loader = ScrapflyLoader(
urls=["https://web-scraping.dev/products"],
api_key="Your ScrapFly API key",
continue_on_failure=True, # Ignore unprocessable web pages and log their exceptions
scrape_config=scrapfly_scrape_config, # Pass the scrape_config object
scrape_format="markdown", # The scrape result format, either `markdown` (default) or `text`
)
# Load documents from URLs as markdown
documents = scrapfly_loader.load()
print(documents)
Here, we create a scrapfly_scrape_config
object with the desired Scrapfly API parameters to use with the scrape requests. Then, we pass it to the ScrapflyLoader
along the web page URLs to scrape.
The next step is to load the scraped markdown documents into an LLM for the LangChain RAG application building.
LangChain has native integrations with tens of LLM providers through both cloud and local setups. In this RAG application using web scraping and LangChain example, we'll be using OpenAI as the LLM of choice.
The first step is creating an OpenAI key from the account dashboard, an OpenAI subscription is required for this step. A great alternative is using local LLM frameworks, such as Ollama. Refer to the documentation example for the usage instructions.
Here's how to utilize web scraping with LangChain to create a RAG application with OpenAI as an LLM:
import os
from langchain import hub
from langchain_chroma import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import ScrapflyLoader
scrapfly_loader = ScrapflyLoader(
urls=["https://web-scraping.dev/products"],
api_key="Your Scrapfly API key",
continue_on_failure=True,
)
# Load the web page data into markdown documents
documents = scrapfly_loader.load()
# Set the OpenAI key as an environment variable
os.environ["OPENAI_API_KEY"] = "Your OpenAI key"
# Create a chunk splitter with 1000 chars each and 200 chars to overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# Save the documents into splits
splits = text_splitter.split_documents(documents)
# Create a vector store
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
# Create a retriever object to support document searches
retriever = vectorstore.as_retriever()
In the above code, we start by retrieving the web pages as mark documents using ScrapflyLoader
. After the documents are retrieved, they get processed through a few steps to create a search vector store:
text_splitter
to split the documents into chunks. A large chunk makes fitting documents into the limited model context harder. The chunk overlap prevents important words from being separated from their full context during the process.vectorstore
with the divided chunks, using OpenAI as the embedding model.retriever
object to fetch the relevant documents based on the submitted prompt.Next, we'll use the vector store retriever with OpenAI to build the RAG chain model:
#....
retriever = vectorstore.as_retriever()
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# Use OpenAI as the LLM model
model = ChatOpenAI()
# Use rag-prompt as the prompt template https://smith.langchain.com/hub/rlm/rag-prompt
prompt = hub.pull("rlm/rag-prompt")
# Create a QA retriever chain to pass the documents with each prompt
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)
# Submit a prompt query
response = rag_chain.invoke("What are the chocolate candy box flavors?")
print(response)
"The chocolate candy box flavors include zesty orange and sweet cherry."
Let's break down the above code:
format_docs
function to format the retriever's returned document string.rag_chain
as a pipeline to process incoming prompt queries.From the prompt response, we can see that the LangChain RAG model can effectively understand and query the extracted data!
To wrap up this guide on building a RAG system for web scraping, let's have a look at some frequently asked questions.
Using web scraping for RAG applications can empower various use cases based on the data domain, including:
LLM refers to a large language model representing a neural network model trained on a vast amount of text data, making it able to understand human text. Popular LLM examples are ChatGPT and Gemini. On the other hand, RAG refers to retrieval-augmented generation. It represents enhancing ready LLMs with custom training data to make the LLM's context aware of the provided datasets.
The short answer is no. LLMs are trained to comprehend linear text data, but HTML follows a tree-based structure, which is challenging for LLMs to interpret and understand. Hence, using web scraping for LLMs requires the extracted data to be parsed. Such a solution is provided by Scrapfly's format feature, enabling scraping any web page as text or markdown.
In this guide, we have explained what LLMs and RAG applications are and how they compare to each other: LLMs are the text models themselves, which get fed with custom data to build the RAG application.
Then, we went through a step-by-step guide to utilizing LLM for web scraping examples for building RAG systems using both LlamaIndex and LangChain. In a nutshell, the required steps are: