LlamaIndex

Data framework for LLM applications. Connect Scrapfly to LlamaIndex agents and workflows for intelligent web data ingestion and RAG applications.

AI Framework Python TypeScript Official Website

Prerequisites

Before getting started, make sure you have the following:

Python 3.8+ or Node.js 18+ installed
llama-index and llama-index-tools-mcp packages installed
Your Scrapfly API key (only if not using OAuth2)

Setup Instructions

LlamaIndex supports MCP servers through the tools integration. Follow these steps to connect Scrapfly for web data ingestion.

Install Required Packages

Install LlamaIndex and the MCP tools integration:

Python:

pip install llama-index llama-index-tools-mcp llama-index-llms-anthropic

TypeScript:

npm install llamaindex @modelcontextprotocol/sdk

Tip: Development Environment

Use a virtual environment for Python projects:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install llama-index llama-index-tools-mcp

Initialize Scrapfly MCP Tools

Connect to the Scrapfly MCP server and load tools into LlamaIndex:

Python Example with OAuth2 (Recommended)

from llama_index.tools.mcp import MCPToolProvider

# Initialize Scrapfly MCP provider with OAuth2
scrapfly_provider = MCPToolProvider(
    server_name="scrapfly",
    command="npx",
    args=["mcp-remote", "https://mcp.scrapfly.io/mcp"]
)

# Get Scrapfly tools
scrapfly_tools = scrapfly_provider.get_tools()

print(f"Loaded {len(scrapfly_tools)} Scrapfly tools:")
for tool in scrapfly_tools:
    print(f"  - {tool.metadata.name}: {tool.metadata.description}")

Why OAuth2?

No API keys stored in code or config files
Automatic token rotation for enhanced security
Instant revocation if needed
Full audit trail of authentication events

Team collaboration: See project-scoped setup to share configuration with your team via version control.

Python Example with API Key

from llama_index.tools.mcp import MCPToolProvider

# Initialize Scrapfly MCP provider with API key
scrapfly_provider = MCPToolProvider(
    server_name="scrapfly",
    command="npx",
    args=["mcp-remote", "https://mcp.scrapfly.io/mcp?key="]
)

scrapfly_tools = scrapfly_provider.get_tools()

Create a LlamaIndex Agent with Scrapfly

Build an agent that can scrape web data and integrate it into your LlamaIndex workflow:

from llama_index.core.agent import ReActAgent
from llama_index.llms.anthropic import Anthropic
from llama_index.tools.mcp import MCPToolProvider

# Initialize Scrapfly MCP tools
scrapfly_provider = MCPToolProvider(
    server_name="scrapfly",
    command="npx",
    args=["mcp-remote", "https://mcp.scrapfly.io/mcp"]
)

scrapfly_tools = scrapfly_provider.get_tools()

# Initialize LLM
llm = Anthropic(model="claude-3-5-sonnet-20241022")

# Create agent with Scrapfly tools
agent = ReActAgent.from_tools(
    tools=scrapfly_tools,
    llm=llm,
    verbose=True
)

# Use the agent to scrape and process data
response = agent.chat(
    "Scrape the top posts from Hacker News and create a summary"
)

print(response)

Pro Tip: The agent will automatically call scraping_instruction_enhanced to get authentication before scraping!

Build a RAG Pipeline with Web Data

Use Scrapfly to ingest live web content into a LlamaIndex RAG application:

from llama_index.core import VectorStoreIndex, Document
from llama_index.core.agent import ReActAgent
from llama_index.llms.anthropic import Anthropic
from llama_index.tools.mcp import MCPToolProvider

# Initialize Scrapfly MCP
scrapfly_provider = MCPToolProvider(
    server_name="scrapfly",
    command="npx",
    args=["mcp-remote", "https://mcp.scrapfly.io/mcp"]
)

scrapfly_tools = scrapfly_provider.get_tools()
llm = Anthropic(model="claude-3-5-sonnet-20241022")

# Create agent that can scrape web data
agent = ReActAgent.from_tools(
    tools=scrapfly_tools,
    llm=llm,
    verbose=True
)

# Scrape web content
response = agent.chat(
    "Scrape the documentation from https://scrapfly.io/docs and return the content"
)

# Convert scraped content to documents
documents = [
    Document(text=response.response)
]

# Build vector index from scraped data
index = VectorStoreIndex.from_documents(documents)

# Query the index
query_engine = index.as_query_engine()
result = query_engine.query("What is web scraping?")
print(result)

Example Prompts

RAG with Live Web Data

Scrape documentation from https://web-scraping.dev and answer questions about it

Knowledge Base Construction

Scrape blog posts from multiple sources and build a searchable knowledge base

Research Agent with Web Access

Research the latest AI trends by scraping tech news sites and summarize findings

Document Ingestion Pipeline

Scrape product documentation from competitor sites and compare features

Troubleshooting

Problem: ModuleNotFoundError: No module named 'llama_index.tools.mcp'

Solution:

Install the package: pip install llama-index-tools-mcp
Verify Python environment: which python
Try upgrading: pip install --upgrade llama-index-tools-mcp
Check LlamaIndex version is 0.10.0+: pip show llama-index

Problem: MCPToolProvider cannot execute npx command

Solution:

Ensure Node.js 18+ is installed: node --version
Verify npx is in PATH: npx --version
Restart terminal after installing Node.js
Try specifying full path: command="/usr/local/bin/npx"

Problem: OAuth2 cannot open browser in headless environment

Solution:

Use API key authentication for production deployments
Store API key in environment variable: SCRAPFLY_API_KEY
Load from environment: args=["mcp-remote", f"https://mcp.scrapfly.io/mcp?key={os.getenv('SCRAPFLY_API_KEY')}"]

Problem: Agent does not call Scrapfly tools when asked to scrape

Solution:

Verify tools loaded: print([tool.metadata.name for tool in scrapfly_tools])
Check LLM supports function calling (Claude 3+, GPT-4+)
Use explicit prompts mentioning "scrape" or "web data"
Enable verbose mode: verbose=True in agent creation

Problem: Scraped content cannot be converted to Document objects

Solution:

Ensure scraped content is text/markdown format
Check response format from Scrapfly MCP tools
Parse response before creating Document: Document(text=str(response))
Handle empty or malformed responses with error checking

Next Steps

Explore available MCP tools and their capabilities
See real-world examples of what you can build
Learn about authentication methods in detail
Read the FAQ for common questions

Back to All Integrations

LlamaIndex

Prerequisites

Setup Instructions

Python Example with OAuth2 (Recommended)

Python Example with API Key

Example Prompts

RAG with Live Web Data

Knowledge Base Construction

Research Agent with Web Access

Document Ingestion Pipeline

Troubleshooting

Import Error: llama-index-tools-mcp not found

npx Command Not Found

OAuth2 in Production/CI Environments

Agent Not Using Scrapfly Tools

Document Ingestion Errors

Next Steps

Summary