# Scrapfly Documentation

## Table of Contents

### Dashboard

- [Intro](https://scrapfly.io/docs)
- [Project](https://scrapfly.io/docs/project)
- [Account](https://scrapfly.io/docs/account)
- [Workspace & Team](https://scrapfly.io/docs/workspace-and-team)
- [Billing](https://scrapfly.io/docs/billing)

### Products

#### MCP Server

- [Getting Started](https://scrapfly.io/docs/mcp/getting-started)
- [Tools & API Spec](https://scrapfly.io/docs/mcp/tools)
- [Authentication](https://scrapfly.io/docs/mcp/authentication)
- [Examples & Use Cases](https://scrapfly.io/docs/mcp/examples)
- [FAQ](https://scrapfly.io/docs/mcp/faq)
##### Integrations

- [Overview](https://scrapfly.io/docs/mcp/integrations)
- [Claude Desktop](https://scrapfly.io/docs/mcp/integrations/claude-desktop)
- [Claude Code](https://scrapfly.io/docs/mcp/integrations/claude-code)
- [ChatGPT](https://scrapfly.io/docs/mcp/integrations/chatgpt)
- [Cursor](https://scrapfly.io/docs/mcp/integrations/cursor)
- [Cline](https://scrapfly.io/docs/mcp/integrations/cline)
- [Windsurf](https://scrapfly.io/docs/mcp/integrations/windsurf)
- [Zed](https://scrapfly.io/docs/mcp/integrations/zed)
- [Roo Code](https://scrapfly.io/docs/mcp/integrations/roo-code)
- [VS Code](https://scrapfly.io/docs/mcp/integrations/vscode)
- [LangChain](https://scrapfly.io/docs/mcp/integrations/langchain)
- [LlamaIndex](https://scrapfly.io/docs/mcp/integrations/llamaindex)
- [CrewAI](https://scrapfly.io/docs/mcp/integrations/crewai)
- [OpenAI](https://scrapfly.io/docs/mcp/integrations/openai)
- [n8n](https://scrapfly.io/docs/mcp/integrations/n8n)
- [Make](https://scrapfly.io/docs/mcp/integrations/make)
- [Zapier](https://scrapfly.io/docs/mcp/integrations/zapier)
- [Vapi AI](https://scrapfly.io/docs/mcp/integrations/vapi)
- [Agent Builder](https://scrapfly.io/docs/mcp/integrations/agent-builder)
- [Custom Client](https://scrapfly.io/docs/mcp/integrations/custom-client)


#### Web Scraping API

- [Getting Started](https://scrapfly.io/docs/scrape-api/getting-started)
- [API Specification]()
- [Monitoring](https://scrapfly.io/docs/monitoring)
- [Customize Request](https://scrapfly.io/docs/scrape-api/custom)
- [Debug](https://scrapfly.io/docs/scrape-api/debug)
- [Anti Scraping Protection](https://scrapfly.io/docs/scrape-api/anti-scraping-protection)
- [Proxy](https://scrapfly.io/docs/scrape-api/proxy)
- [Proxy Mode](https://scrapfly.io/docs/scrape-api/proxy-mode)
- [Proxy Mode - Screaming Frog](https://scrapfly.io/docs/scrape-api/proxy-mode/screaming-frog)
- [Proxy Mode - Apify](https://scrapfly.io/docs/scrape-api/proxy-mode/apify)
- [(Auto) Data Extraction](https://scrapfly.io/docs/scrape-api/extraction)
- [Javascript Rendering](https://scrapfly.io/docs/scrape-api/javascript-rendering)
- [Javascript Scenario](https://scrapfly.io/docs/scrape-api/javascript-scenario)
- [SSL](https://scrapfly.io/docs/scrape-api/ssl)
- [DNS](https://scrapfly.io/docs/scrape-api/dns)
- [Cache](https://scrapfly.io/docs/scrape-api/cache)
- [Batch (Multi-URL Scraping)](https://scrapfly.io/docs/scrape-api/batch)
- [Session](https://scrapfly.io/docs/scrape-api/session)
- [Webhook](https://scrapfly.io/docs/scrape-api/webhook)
- [Screenshot](https://scrapfly.io/docs/scrape-api/screenshot)
- [Errors](https://scrapfly.io/docs/scrape-api/errors)
- [Timeout](https://scrapfly.io/docs/scrape-api/understand-timeout)
- [Throttling](https://scrapfly.io/docs/throttling)
- [Troubleshoot](https://scrapfly.io/docs/scrape-api/troubleshoot)
- [Billing](https://scrapfly.io/docs/scrape-api/billing)
- [FAQ](https://scrapfly.io/docs/scrape-api/faq)

#### Crawler API

- [Getting Started](https://scrapfly.io/docs/crawler-api/getting-started)
- [API Specification]()
- [Retrieving Results](https://scrapfly.io/docs/crawler-api/results)
- [WARC Format](https://scrapfly.io/docs/crawler-api/warc-format)
- [Data Extraction](https://scrapfly.io/docs/crawler-api/extraction-rules)
- [Webhook](https://scrapfly.io/docs/crawler-api/webhook)
- [Billing](https://scrapfly.io/docs/crawler-api/billing)
- [Errors](https://scrapfly.io/docs/crawler-api/errors)
- [Troubleshoot](https://scrapfly.io/docs/crawler-api/troubleshoot)
- [FAQ](https://scrapfly.io/docs/crawler-api/faq)

#### Screenshot API

- [Getting Started](https://scrapfly.io/docs/screenshot-api/getting-started)
- [API Specification]()
- [Accessibility Testing](https://scrapfly.io/docs/screenshot-api/accessibility)
- [Webhook](https://scrapfly.io/docs/screenshot-api/webhook)
- [Billing](https://scrapfly.io/docs/screenshot-api/billing)
- [Errors](https://scrapfly.io/docs/screenshot-api/errors)

#### Extraction API

- [Getting Started](https://scrapfly.io/docs/extraction-api/getting-started)
- [API Specification]()
- [Rules Template](https://scrapfly.io/docs/extraction-api/rules-and-template)
- [LLM Extraction](https://scrapfly.io/docs/extraction-api/llm-prompt)
- [AI Auto Extraction](https://scrapfly.io/docs/extraction-api/automatic-ai)
- [Webhook](https://scrapfly.io/docs/extraction-api/webhook)
- [Billing](https://scrapfly.io/docs/extraction-api/billing)
- [Errors](https://scrapfly.io/docs/extraction-api/errors)
- [FAQ](https://scrapfly.io/docs/extraction-api/faq)

#### Proxy Saver

- [Getting Started](https://scrapfly.io/docs/proxy-saver/getting-started)
- [Fingerprints](https://scrapfly.io/docs/proxy-saver/fingerprints)
- [Optimizations](https://scrapfly.io/docs/proxy-saver/optimizations)
- [SSL Certificates](https://scrapfly.io/docs/proxy-saver/certificates)
- [Protocols](https://scrapfly.io/docs/proxy-saver/protocols)
- [Pacfile](https://scrapfly.io/docs/proxy-saver/pacfile)
- [Secure Credentials](https://scrapfly.io/docs/proxy-saver/security)
- [Billing](https://scrapfly.io/docs/proxy-saver/billing)

#### Cloud Browser API

- [Getting Started](https://scrapfly.io/docs/cloud-browser-api/getting-started)
- [Proxy & Geo-Targeting](https://scrapfly.io/docs/cloud-browser-api/proxy)
- [Unblock API](https://scrapfly.io/docs/cloud-browser-api/unblock)
- [File Downloads](https://scrapfly.io/docs/cloud-browser-api/file-downloads)
- [Session Resume](https://scrapfly.io/docs/cloud-browser-api/session-resume)
- [Human-in-the-Loop](https://scrapfly.io/docs/cloud-browser-api/human-in-the-loop)
- [Debug Mode](https://scrapfly.io/docs/cloud-browser-api/debug-mode)
- [Bring Your Own Proxy](https://scrapfly.io/docs/cloud-browser-api/bring-your-own-proxy)
- [Browser Extensions](https://scrapfly.io/docs/cloud-browser-api/extensions)
- [Native Browser MCP](https://scrapfly.io/docs/cloud-browser-api/mcp)
- [DevTools Protocol](https://scrapfly.io/docs/cloud-browser-api/cdp-reference)
##### Integrations

- [Puppeteer](https://scrapfly.io/docs/cloud-browser-api/puppeteer)
- [Playwright](https://scrapfly.io/docs/cloud-browser-api/playwright)
- [Selenium](https://scrapfly.io/docs/cloud-browser-api/selenium)
- [Vercel Agent Browser](https://scrapfly.io/docs/cloud-browser-api/agent-browser)
- [Browser Use](https://scrapfly.io/docs/cloud-browser-api/browser-use)
- [Stagehand](https://scrapfly.io/docs/cloud-browser-api/stagehand)

- [Billing](https://scrapfly.io/docs/cloud-browser-api/billing)
- [Errors](https://scrapfly.io/docs/cloud-browser-api/errors)


### Tools

- [Antibot Detector](https://scrapfly.io/docs/tools/antibot-detector)

### SDK

- [Golang](https://scrapfly.io/docs/sdk/golang)
- [Python](https://scrapfly.io/docs/sdk/python)
- [Rust](https://scrapfly.io/docs/sdk/rust)
- [TypeScript](https://scrapfly.io/docs/sdk/typescript)
- [Scrapy](https://scrapfly.io/docs/sdk/scrapy)

### Integrations

- [Getting Started](https://scrapfly.io/docs/integration/getting-started)
- [LangChain](https://scrapfly.io/docs/integration/langchain)
- [LlamaIndex](https://scrapfly.io/docs/integration/llamaindex)
- [CrewAI](https://scrapfly.io/docs/integration/crewai)
- [Zapier](https://scrapfly.io/docs/integration/zapier)
- [Make](https://scrapfly.io/docs/integration/make)
- [n8n](https://scrapfly.io/docs/integration/n8n)

### Academy

- [Overview](https://scrapfly.io/academy)
- [Web Scraping Overview](https://scrapfly.io/academy/scraping-overview)
- [Tools](https://scrapfly.io/academy/tools-overview)
- [Reverse Engineering](https://scrapfly.io/academy/reverse-engineering)
- [Static Scraping](https://scrapfly.io/academy/static-scraping)
- [HTML Parsing](https://scrapfly.io/academy/html-parsing)
- [Dynamic Scraping](https://scrapfly.io/academy/dynamic-scraping)
- [Hidden API Scraping](https://scrapfly.io/academy/hidden-api-scraping)
- [Headless Browsers](https://scrapfly.io/academy/headless-browsers)
- [Hidden Web Data](https://scrapfly.io/academy/hidden-web-data)
- [JSON Parsing](https://scrapfly.io/academy/json-parsing)
- [Data Processing](https://scrapfly.io/academy/data-processing)
- [Scaling](https://scrapfly.io/academy/scaling)
- [Walkthrough Summary](https://scrapfly.io/academy/walkthrough-summary)
- [Scraper Blocking](https://scrapfly.io/academy/scraper-blocking)
- [Proxies](https://scrapfly.io/academy/proxies)

---

# Data Processing, Output &amp; Validation

 The last step of every web scraping project is final data processing. This often involved data validation, cleanup and storage.

 As scrapers are dealing with data from an unknown source data processing can be a surprisingly complex challenge. For long-term scraping, data validation can be crucial for scraper maintenance. Using tools to ensure scraped result quality can prevent scrapers from silently breaking or performing sub-optimally.

## Data Formats

 Web scraping datasets can vary greatly from small predictable structures to large, complex data graphs.

 Most commonly the CSV and JSON formats are used.

### CSV

 CSV is great for flat datasets with a consistent structure. This format can be directly imported to spreadsheet software (Excel, Google Sheets, etc) and doesn't require compression as the format is already very compact.

 Here's a short example scraper that stores data to CSV:

- [Python](#example-csv)
- [Scrapfly.py](#example-csv-sf)
 
 ```
import httpx
import csv
from parsel import Selector

urls = [
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
    "https://web-scraping.dev/product/4",
    "https://web-scraping.dev/product/5",
]

writer = csv.writer(open("products.csv", "w"))
writer.writerow(["url", "name", "price"])
for url in urls:
    resp = httpx.get(url)
    sel = Selector(resp.text)
    price = sel.css(".product-price::text").get()
    name = sel.css(".product-title::text").get()
    writer.writerow([str(resp.url), name, price])
```

 

   

 

 

 ```
import csv
from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient("YOUR SCRAPFLY KEY")

urls = [
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
    "https://web-scraping.dev/product/4",
    "https://web-scraping.dev/product/5",
]

writer = csv.writer(open("products.csv", "w"))
writer.writerow(["url", "name", "price"])
for url in urls:
    result = scrapfly.scrape(ScrapeConfig(url))
    price = result.selector.css(".product-price::text").get()
    name = result.selector.css(".product-title::text").get()
    writer.writerow([result.context['url'], name, price])
```

 

   

 

 

 

 

 Some things to note when working with CSV:

- The separator characters (default `,`) have to be escaped
- CSV is a flat structure so nested datasets have to be flattened
 
### JSON

 JSON is great for complex structures as it allows easy nesting and key-to-value structuring. However, JSON datasets can take up a lot of space and require compression for best storage efficiency.

 Here's a short example scraper that stores data in JSON:

- [Python](#example-json)
- [Scrapfly.py](#example-json-sf)
 
 ```
import httpx
import json
from parsel import Selector

urls = [
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
    "https://web-scraping.dev/product/4",
    "https://web-scraping.dev/product/5",
]

results = []
for url in urls:
    resp = httpx.get(url)
    sel = Selector(resp.text)
    price = sel.css(".product-price::text").get()
    name = sel.css(".product-title::text").get()
    results.append({
        "url": str(resp.url),
        "name": name,
        "price": price
    })
with open('results.json', 'w') as f:
    json.dump(results, f, indent=2, ensure_ascii=False)
```

 

   

 

 

 ```
import json
from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient("YOUR SCRAPFLY API KEY")

urls = [
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
    "https://web-scraping.dev/product/4",
    "https://web-scraping.dev/product/5",
]

results = []
for url in urls:
    result = scrapfly.scrape(ScrapeConfig(url))
    price = result.selector.css(".product-price::text").get()
    name = result.selector.css(".product-title::text").get()
    results.append({
        "url": result.context['url'],
        "name": name,
        "price": price
    })
with open('results.json', 'w') as f:
    json.dump(results, f, indent=2, ensure_ascii=False)
```

 

   

 

 

 

 

 Some things to note when working with JSON:

- The quote (") characters have to be escaped
- Unicode support is often not enabled by default
 
### JSONL 

 JSONL is a particularly popular JSON format variant in web scraping datasets where each line is a JSON object. This behavior allows for easy result-streaming which makes scrapers easier to work with.

 Here's an example of a simple JSON Lines scraper:

- [Python](#example-jsonl)
- [Scrapfly.py](#example-jsonl-sf)
 
 ```
import httpx
import json
from parsel import Selector

urls = [
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
    "https://web-scraping.dev/product/4",
    "https://web-scraping.dev/product/5",
]

results = []
for url in urls:
    resp = httpx.get(url)
    sel = Selector(resp.text)
    price = sel.css(".product-price::text").get()
    name = sel.css(".product-title::text").get()
    item = {
        "url": str(resp.url),
        "name": name,
        "price": price
    }
    with open('results.jsonl', 'a') as f:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")
```

 

   

 

 

 ```
import json
from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient("YOUR SCRAPFLY KEY")

urls = [
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
    "https://web-scraping.dev/product/4",
    "https://web-scraping.dev/product/5",
]

results = []
for url in urls:
    result = scrapfly.scrape(ScrapeConfig(url))
    price = result.selector.css(".product-price::text").get()
    name = result.selector.css(".product-title::text").get()
    item = {
        "url": result.context['url'],
        "name": name,
        "price": price
    }
    with open('results.jsonl', 'a') as f:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")
```

 

   

 

 

 

 

 Above, we open our output file on each iteration and append a new line. This enables easy data streaming for our web scrapers which is a much easier way to handle large data flows.

### Spreadsheets 

 Spreadsheets are a natural fit for web scraping as they are designed to handle dynamic data, can be streamed (appending rows) and can be easy to work with. CSV output is already compatible with spreadsheets but other formats like Google Sheets can add extra features like online access, version control and collaboration.

 [ ##### Scraping Data to Google Sheets

Google sheets are free, accessible and feature-rich online spreadsheets - ideal for storing scraped data.

 

 ](https://scrapfly.io/blog/posts/web-scraping-to-google-sheets/) 

 

##### FAQ [How to scrape HTML table to Excel Spreadsheet (.xlsx)?](https://scrapfly.io/blog/answers/html-table-to-xlsx-python-beautifulsoup/) 



 

## Data Processing

 Most of the datapoint found on the web a free format. Dates and prices for example are expressed in text needing to be converted to matching data types. Here are some tips we cover on this subject:

 [ ##### Automatic Date Parsing with Dateparsers

Dateparser is a python library that can accurately guess date objects from freeform date strings.

 

 ](https://scrapfly.io/blog/posts/parsing-datetime-strings-with-python-and-dateparser/) [ ##### Tips for Scraping Emails

Intro to email scraping and parsing which has its own unique data processing challenges like deobfuscation.

 

 ](https://scrapfly.io/blog/posts/how-to-scrape-emails-using-python/) [ ##### Tips for Scraping Phone Numbers

Intro to phone number scraping which can be difficult to successfully validate and process.

 

 ](https://scrapfly.io/blog/posts/how-to-scrape-phone-numbers-with-python/) 

## Data Validation

 When scraping at scale data validation is vital for consistent results as real web pages change unpredictably and often.

 There are multiple ways to approach validation but most importantly tracking results and matching them against schema and regular expression patterns can catch 99% of failures.

 

 [ ##### Intro to Scraped Data Validation

This introduction covers two popular data validation tools: strict types and schema validation.

 

 ](https://scrapfly.io/blog/posts/how-to-ensure-web-scrapped-data-quality/) 

 

## Next - Blocking

 We've covered most of web scraping subjects that web scraper developers come across when developing web scraping programs. Though, by far the biggest barrier in scraping is scraper blocking and next, let's take a look at what it is and how to avoid it.

 [&lt;](https://scrapfly.io/academy/json-parsing "Previous Page") [&gt;](https://scrapfly.io/academy/scraper-blocking "Next Page")