Scrapfly LlamaIndex Integration

Scrapfly is available on LlamaIndex - a popular data framework for building LLM applications.

For LlamaIndex, Scrapfly is available as a Web Page Reader object which uses Scrapfly Web Scrape API to retrieve web page data for use within the LlamaIndex ecosystem.

Many functionalities of LlamaIndex are already provided by Scrapfly's Extraction API through the LLM Extraction feature if you're looking for a more streamlined LLM parsing solution.

Usage

To start get your Srapfly API key on your dashboard. Then install Scrapfly Python SDK and LlamaIndex:

pip install llama-index scrapfly-sdk

Then, the ScrapflyReader is available for scraping any web page:

from llama_index.readers.web import ScrapflyReader

# Initiate ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
    api_key="Your ScrapFly API key",  # Get your API key from https://www.scrapfly.io/
    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions
)

# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/products"]
)

For more advanced use, the integration supports all Scrapfly Web Scrape API options matching the Python SDK signature:

from llama_index.readers.web import ScrapflyReader

# Initiate ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
    api_key="Your ScrapFly API key",  # Get your API key from https://www.scrapfly.io/
    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions
)

scrapfly_scrape_config = {
    "asp": True,  # Bypass scraping blocking and antibot solutions, like Cloudflare
    "render_js": True,  # Enable JavaScript rendering with a cloud headless browser
    "proxy_pool": "public_residential_pool",  # Select a proxy pool (datacenter or residential)
    "country": "us",  # Select a proxy location
    "auto_scroll": True,  # Auto scroll the page
    "js": "",  # Execute custom JavaScript code by the headless browser
}

# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/products"],
    scrape_config=scrapfly_scrape_config,  # Pass the scrape config
    scrape_format="markdown",  # The scrape result format, either `markdown`(default) or `text`
)

Example Use

LlamaIndex is a very large, feature-rich framework that can be daunting at first but the Scrapfly document loader can greatly simplify the user experience by scraping provided pages as markdown for easier processing.

For this example, let's explore a simple RAG scenario and use OpenAI to query it:

We will use the ScrapflyReader to scrape pages as markdown into a local index
Load documents into OpenAI query engine
Create a prompt template and execute some prompts on scraped data

This whole process LlamaIndex would look as simple as the script below:

import os
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.readers.web import ScrapflyReader

# 1. Add OpenAI API key
os.environ["OPENAI_API_KEY"] = "YOUR OPEN API KEY"

# 2. Set up ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
    api_key="YOUR SCRAPFLY KEY",
)

# 3. scrape web pages as markdown
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/product/1"],
    scrape_config={"render_js": True},  # note here you can configure scrape options
    scrape_format="markdown",
)


# 4. Create index and query engine
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
    llm=OpenAI(model="gpt-3.5-turbo-0125"),
)
# 5. Define prompt template
prompt_template = "Given the data fetched from the specified product URLs, \
                   find the following product fields {fields} in the \
                   provided markdown and return a JSON".format

# 6. run query
response = query_engine.query(prompt_template(
    fields=["price", "title"],
))
print(response)
# prints:
# {
#     "price": "$9.99 from $12.99",
#     "title": "Box of Chocolate Candy"
# }

Let's unpack the above example step-by-step. We first set up our open api key and ScrapflyReader. Then, execute a scrape command to generate a list of documents (in this case 1 document).

scrapfly_reader = ScrapflyReader(api_key="YOUR SCRAPFLY KEY")
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/product/1"],
    scrape_config={"render_js": True},  # note here you can configure scrape options
    scrape_format="markdown",
)

Now, we can generate the index and query engine to prompt our scraped documents using LLMs:

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
    llm=OpenAI(model="gpt-3.5-turbo-0125"),
)
prompt_template = "Given the data fetched from the specified product URLs, \
                   find the following product fields {fields} in the \
                   provided markdown and return a JSON".format

response = query_engine.query(prompt_template(
    fields=["price", "title"],
))
print(response)
# prints:
# {
#     "price": "$9.99 from $12.99",
#     "title": "Box of Chocolate Candy"
# }

This is a very basic use example though it's still a very powerful tool for data parsing using LLMs. You can load more than one document and design your own scraping-based knowledgebase that you can query with LLMs very easily as demonstrated in this example.

Errors

LlamaIndex will display the Scrapfly API error message in the standard Scrapfly API error message format. For more see:

Pricing

No additional costs.

Scrapfly LlamaIndex Integration

Usage

Example Use

Errors

Pricing

Summary