LlamaIndex

LlamaIndex logo

Data framework for LLM applications. Connect Scrapfly to LlamaIndex agents and workflows for intelligent web data ingestion and RAG applications.

AI Framework Python TypeScript Official Website

Prerequisites

Before getting started, make sure you have the following:

  • Python 3.8+ or Node.js 18+ installed
  • llama-index and llama-index-tools-mcp packages installed
  • Your Scrapfly API key (only if not using OAuth2)

Setup Instructions

LlamaIndex supports MCP servers through the tools integration. Follow these steps to connect Scrapfly for web data ingestion.

  1. Install Required Packages

    Install LlamaIndex and the MCP tools integration:

    Python:

    TypeScript:

    Tip: Development Environment

    Use a virtual environment for Python projects:

  2. Initialize Scrapfly MCP Tools

    Connect to the Scrapfly MCP server and load tools into LlamaIndex:

    Python Example with OAuth2 (Recommended)

    Why OAuth2?
    • No API keys stored in code or config files
    • Automatic token rotation for enhanced security
    • Instant revocation if needed
    • Full audit trail of authentication events

    Team collaboration: See project-scoped setup to share configuration with your team via version control.

    Python Example with API Key

  3. Create a LlamaIndex Agent with Scrapfly

    Build an agent that can scrape web data and integrate it into your LlamaIndex workflow:

    Pro Tip: The agent will automatically call scraping_instruction_enhanced to get authentication before scraping!
  4. Build a RAG Pipeline with Web Data

    Use Scrapfly to ingest live web content into a LlamaIndex RAG application:

Example Prompts

RAG with Live Web Data
Scrape documentation from https://web-scraping.dev and answer questions about it
Knowledge Base Construction
Scrape blog posts from multiple sources and build a searchable knowledge base
Research Agent with Web Access
Research the latest AI trends by scraping tech news sites and summarize findings
Document Ingestion Pipeline
Scrape product documentation from competitor sites and compare features

Troubleshooting

Problem: ModuleNotFoundError: No module named 'llama_index.tools.mcp'

Solution:

  • Install the package: pip install llama-index-tools-mcp
  • Verify Python environment: which python
  • Try upgrading: pip install --upgrade llama-index-tools-mcp
  • Check LlamaIndex version is 0.10.0+: pip show llama-index

Problem: MCPToolProvider cannot execute npx command

Solution:

  • Ensure Node.js 18+ is installed: node --version
  • Verify npx is in PATH: npx --version
  • Restart terminal after installing Node.js
  • Try specifying full path: command="/usr/local/bin/npx"

Problem: OAuth2 cannot open browser in headless environment

Solution:

  • Use API key authentication for production deployments
  • Store API key in environment variable: SCRAPFLY_API_KEY
  • Load from environment: args=["mcp-remote", f"https://mcp.scrapfly.io/mcp?key={os.getenv('SCRAPFLY_API_KEY')}"]

Problem: Agent does not call Scrapfly tools when asked to scrape

Solution:

  • Verify tools loaded: print([tool.metadata.name for tool in scrapfly_tools])
  • Check LLM supports function calling (Claude 3+, GPT-4+)
  • Use explicit prompts mentioning "scrape" or "web data"
  • Enable verbose mode: verbose=True in agent creation

Problem: Scraped content cannot be converted to Document objects

Solution:

  • Ensure scraped content is text/markdown format
  • Check response format from Scrapfly MCP tools
  • Parse response before creating Document: Document(text=str(response))
  • Handle empty or malformed responses with error checking

Next Steps

Summary