Scrapfly LlamaIndex Integration
Scrapfly is available on LlamaIndex
- a popular data framework for building LLM applications.
For LlamaIndex, Scrapfly is available as a Web Page Reader object which uses Scrapfly Web Scrape
API to retrieve web page data for use within the LlamaIndex ecosystem.
Many functionalities of LlamaIndex are already provided by Scrapfly's Extraction API through
the LLM Extraction feature
if you're looking for a more streamlined LLM parsing solution.
Usage
To start get your Srapfly API key on your dashboard.
Then install Scrapfly Python SDK and LlamaIndex:
pip install llama-index scrapfly-sdk
Then, the ScrapflyReader
is available for scraping any web page:
from llama_index.readers.web import ScrapflyReader
# Initiate ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
api_key="Your ScrapFly API key", # Get your API key from https://www.scrapfly.io/
ignore_scrape_failures=True, # Ignore unprocessable web pages and log their exceptions
)
# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/products"]
)
For more advanced use, the integration supports all Scrapfly Web Scrape API options matching the Python SDK signature:
from llama_index.readers.web import ScrapflyReader
# Initiate ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
api_key="Your ScrapFly API key", # Get your API key from https://www.scrapfly.io/
ignore_scrape_failures=True, # Ignore unprocessable web pages and log their exceptions
)
scrapfly_scrape_config = {
"asp": True, # Bypass scraping blocking and antibot solutions, like Cloudflare
"render_js": True, # Enable JavaScript rendering with a cloud headless browser
"proxy_pool": "public_residential_pool", # Select a proxy pool (datacenter or residential)
"country": "us", # Select a proxy location
"auto_scroll": True, # Auto scroll the page
"js": "", # Execute custom JavaScript code by the headless browser
}
# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/products"],
scrape_config=scrapfly_scrape_config, # Pass the scrape config
scrape_format="markdown", # The scrape result format, either `markdown`(default) or `text`
)
Example Use
LlamaIndex is a very large, feature-rich framework that can be daunting at first but the Scrapfly document loader
can greatly simplify the user experience by scraping provided pages as markdown for easier processing.
For this example, let's explore a simple RAG scenario and use OpenAI to query it:
- We will use the
ScrapflyReader
to scrape pages as markdown into a local index
- Load documents into OpenAI query engine
- Create a prompt template and execute some prompts on scraped data
This whole process LlamaIndex would look as simple as the script below:
import os
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.readers.web import ScrapflyReader
# 1. Add OpenAI API key
os.environ["OPENAI_API_KEY"] = "YOUR OPEN API KEY"
# 2. Set up ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
api_key="YOUR SCRAPFLY KEY",
)
# 3. scrape web pages as markdown
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/product/1"],
scrape_config={"render_js": True}, # note here you can configure scrape options
scrape_format="markdown",
)
# 4. Create index and query engine
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
llm=OpenAI(model="gpt-3.5-turbo-0125"),
)
# 5. Define prompt template
prompt_template = "Given the data fetched from the specified product URLs, \
find the following product fields {fields} in the \
provided markdown and return a JSON".format
# 6. run query
response = query_engine.query(prompt_template(
fields=["price", "title"],
))
print(response)
# prints:
# {
# "price": "$9.99 from $12.99",
# "title": "Box of Chocolate Candy"
# }
Let's unpack the above example step-by-step. We first set up our open api key and ScrapflyReader
.
Then, execute a scrape command to generate a list of documents (in this case 1 document).
scrapfly_reader = ScrapflyReader(api_key="YOUR SCRAPFLY KEY")
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/product/1"],
scrape_config={"render_js": True}, # note here you can configure scrape options
scrape_format="markdown",
)
Now, we can generate the index and query engine to prompt our scraped documents using LLMs:
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
llm=OpenAI(model="gpt-3.5-turbo-0125"),
)
prompt_template = "Given the data fetched from the specified product URLs, \
find the following product fields {fields} in the \
provided markdown and return a JSON".format
response = query_engine.query(prompt_template(
fields=["price", "title"],
))
print(response)
# prints:
# {
# "price": "$9.99 from $12.99",
# "title": "Box of Chocolate Candy"
# }
This is a very basic use example though it's still a very powerful tool for data parsing using LLMs.
You can load more than one document and design your own scraping-based knowledgebase that you can query
with LLMs very easily as demonstrated in this example.
Errors
LlamaIndex will display the Scrapfly API error message in the standard Scrapfly API error message format. For more see:
Pricing
No additional costs.