LlamaIndex
Data framework for LLM applications. Connect Scrapfly to LlamaIndex agents and workflows for intelligent web data ingestion and RAG applications.
Prerequisites
Before getting started, make sure you have the following:
- Python 3.8+ or Node.js 18+ installed
llama-indexandllama-index-tools-mcppackages installed- Your Scrapfly API key (only if not using OAuth2)
Setup Instructions
LlamaIndex supports MCP servers through the tools integration. Follow these steps to connect Scrapfly for web data ingestion.
-
Install Required Packages
Install LlamaIndex and the MCP tools integration:
Python:
TypeScript:
Tip: Development Environment
Use a virtual environment for Python projects:
-
Initialize Scrapfly MCP Tools
Connect to the Scrapfly MCP server and load tools into LlamaIndex:
Python Example with OAuth2 (Recommended)
Why OAuth2?- No API keys stored in code or config files
- Automatic token rotation for enhanced security
- Instant revocation if needed
- Full audit trail of authentication events
Team collaboration: See project-scoped setup to share configuration with your team via version control.
Python Example with API Key
-
Create a LlamaIndex Agent with Scrapfly
Build an agent that can scrape web data and integrate it into your LlamaIndex workflow:
Pro Tip: The agent will automatically callscraping_instruction_enhancedto get authentication before scraping! -
Build a RAG Pipeline with Web Data
Use Scrapfly to ingest live web content into a LlamaIndex RAG application:
Example Prompts
RAG with Live Web Data
Knowledge Base Construction
Research Agent with Web Access
Document Ingestion Pipeline
Troubleshooting
Problem: ModuleNotFoundError: No module named 'llama_index.tools.mcp'
Solution:
- Install the package:
pip install llama-index-tools-mcp - Verify Python environment:
which python - Try upgrading:
pip install --upgrade llama-index-tools-mcp - Check LlamaIndex version is 0.10.0+:
pip show llama-index
Problem: MCPToolProvider cannot execute npx command
Solution:
- Ensure Node.js 18+ is installed:
node --version - Verify
npxis in PATH:npx --version - Restart terminal after installing Node.js
- Try specifying full path:
command="/usr/local/bin/npx"
Problem: OAuth2 cannot open browser in headless environment
Solution:
- Use API key authentication for production deployments
- Store API key in environment variable:
SCRAPFLY_API_KEY - Load from environment:
args=["mcp-remote", f"https://mcp.scrapfly.io/mcp?key={os.getenv('SCRAPFLY_API_KEY')}"]
Problem: Agent does not call Scrapfly tools when asked to scrape
Solution:
- Verify tools loaded:
print([tool.metadata.name for tool in scrapfly_tools]) - Check LLM supports function calling (Claude 3+, GPT-4+)
- Use explicit prompts mentioning "scrape" or "web data"
- Enable verbose mode:
verbose=Truein agent creation
Problem: Scraped content cannot be converted to Document objects
Solution:
- Ensure scraped content is text/markdown format
- Check response format from Scrapfly MCP tools
- Parse response before creating Document:
Document(text=str(response)) - Handle empty or malformed responses with error checking
Next Steps
- Explore available MCP tools and their capabilities
- See real-world examples of what you can build
- Learn about authentication methods in detail
- Read the FAQ for common questions