# Scrapfly Documentation

## Table of Contents

### Dashboard

- [Intro](https://scrapfly.io/docs)
- [Project](https://scrapfly.io/docs/project)
- [Account](https://scrapfly.io/docs/account)
- [Workspace & Team](https://scrapfly.io/docs/workspace-and-team)
- [Billing](https://scrapfly.io/docs/billing)

### Products

#### MCP Server

- [Getting Started](https://scrapfly.io/docs/mcp/getting-started)
- [Tools & API Spec](https://scrapfly.io/docs/mcp/tools)
- [Authentication](https://scrapfly.io/docs/mcp/authentication)
- [Examples & Use Cases](https://scrapfly.io/docs/mcp/examples)
- [FAQ](https://scrapfly.io/docs/mcp/faq)
##### Integrations

- [Overview](https://scrapfly.io/docs/mcp/integrations)
- [Claude Desktop](https://scrapfly.io/docs/mcp/integrations/claude-desktop)
- [Claude Code](https://scrapfly.io/docs/mcp/integrations/claude-code)
- [ChatGPT](https://scrapfly.io/docs/mcp/integrations/chatgpt)
- [Cursor](https://scrapfly.io/docs/mcp/integrations/cursor)
- [Cline](https://scrapfly.io/docs/mcp/integrations/cline)
- [Windsurf](https://scrapfly.io/docs/mcp/integrations/windsurf)
- [Zed](https://scrapfly.io/docs/mcp/integrations/zed)
- [Roo Code](https://scrapfly.io/docs/mcp/integrations/roo-code)
- [VS Code](https://scrapfly.io/docs/mcp/integrations/vscode)
- [LangChain](https://scrapfly.io/docs/mcp/integrations/langchain)
- [LlamaIndex](https://scrapfly.io/docs/mcp/integrations/llamaindex)
- [CrewAI](https://scrapfly.io/docs/mcp/integrations/crewai)
- [OpenAI](https://scrapfly.io/docs/mcp/integrations/openai)
- [n8n](https://scrapfly.io/docs/mcp/integrations/n8n)
- [Make](https://scrapfly.io/docs/mcp/integrations/make)
- [Zapier](https://scrapfly.io/docs/mcp/integrations/zapier)
- [Vapi AI](https://scrapfly.io/docs/mcp/integrations/vapi)
- [Agent Builder](https://scrapfly.io/docs/mcp/integrations/agent-builder)
- [Custom Client](https://scrapfly.io/docs/mcp/integrations/custom-client)


#### Web Scraping API

- [Getting Started](https://scrapfly.io/docs/scrape-api/getting-started)
- [API Specification]()
- [Monitoring](https://scrapfly.io/docs/monitoring)
- [Customize Request](https://scrapfly.io/docs/scrape-api/custom)
- [Debug](https://scrapfly.io/docs/scrape-api/debug)
- [Anti Scraping Protection](https://scrapfly.io/docs/scrape-api/anti-scraping-protection)
- [Proxy](https://scrapfly.io/docs/scrape-api/proxy)
- [Proxy Mode](https://scrapfly.io/docs/scrape-api/proxy-mode)
- [Proxy Mode - Screaming Frog](https://scrapfly.io/docs/scrape-api/proxy-mode/screaming-frog)
- [Proxy Mode - Apify](https://scrapfly.io/docs/scrape-api/proxy-mode/apify)
- [(Auto) Data Extraction](https://scrapfly.io/docs/scrape-api/extraction)
- [Javascript Rendering](https://scrapfly.io/docs/scrape-api/javascript-rendering)
- [Javascript Scenario](https://scrapfly.io/docs/scrape-api/javascript-scenario)
- [SSL](https://scrapfly.io/docs/scrape-api/ssl)
- [DNS](https://scrapfly.io/docs/scrape-api/dns)
- [Cache](https://scrapfly.io/docs/scrape-api/cache)
- [Session](https://scrapfly.io/docs/scrape-api/session)
- [Webhook](https://scrapfly.io/docs/scrape-api/webhook)
- [Screenshot](https://scrapfly.io/docs/scrape-api/screenshot)
- [Errors](https://scrapfly.io/docs/scrape-api/errors)
- [Timeout](https://scrapfly.io/docs/scrape-api/understand-timeout)
- [Throttling](https://scrapfly.io/docs/throttling)
- [Troubleshoot](https://scrapfly.io/docs/scrape-api/troubleshoot)
- [Billing](https://scrapfly.io/docs/scrape-api/billing)
- [FAQ](https://scrapfly.io/docs/scrape-api/faq)

#### Crawler API

- [Getting Started](https://scrapfly.io/docs/crawler-api/getting-started)
- [API Specification]()
- [Retrieving Results](https://scrapfly.io/docs/crawler-api/results)
- [WARC Format](https://scrapfly.io/docs/crawler-api/warc-format)
- [Data Extraction](https://scrapfly.io/docs/crawler-api/extraction-rules)
- [Webhook](https://scrapfly.io/docs/crawler-api/webhook)
- [Billing](https://scrapfly.io/docs/crawler-api/billing)
- [Errors](https://scrapfly.io/docs/crawler-api/errors)
- [Troubleshoot](https://scrapfly.io/docs/crawler-api/troubleshoot)
- [FAQ](https://scrapfly.io/docs/crawler-api/faq)

#### Screenshot API

- [Getting Started](https://scrapfly.io/docs/screenshot-api/getting-started)
- [API Specification]()
- [Accessibility Testing](https://scrapfly.io/docs/screenshot-api/accessibility)
- [Webhook](https://scrapfly.io/docs/screenshot-api/webhook)
- [Billing](https://scrapfly.io/docs/screenshot-api/billing)
- [Errors](https://scrapfly.io/docs/screenshot-api/errors)

#### Extraction API

- [Getting Started](https://scrapfly.io/docs/extraction-api/getting-started)
- [API Specification]()
- [Rules Template](https://scrapfly.io/docs/extraction-api/rules-and-template)
- [LLM Extraction](https://scrapfly.io/docs/extraction-api/llm-prompt)
- [AI Auto Extraction](https://scrapfly.io/docs/extraction-api/automatic-ai)
- [Webhook](https://scrapfly.io/docs/extraction-api/webhook)
- [Billing](https://scrapfly.io/docs/extraction-api/billing)
- [Errors](https://scrapfly.io/docs/extraction-api/errors)
- [FAQ](https://scrapfly.io/docs/extraction-api/faq)

#### Proxy Saver

- [Getting Started](https://scrapfly.io/docs/proxy-saver/getting-started)
- [Fingerprints](https://scrapfly.io/docs/proxy-saver/fingerprints)
- [Optimizations](https://scrapfly.io/docs/proxy-saver/optimizations)
- [SSL Certificates](https://scrapfly.io/docs/proxy-saver/certificates)
- [Protocols](https://scrapfly.io/docs/proxy-saver/protocols)
- [Pacfile](https://scrapfly.io/docs/proxy-saver/pacfile)
- [Secure Credentials](https://scrapfly.io/docs/proxy-saver/security)
- [Billing](https://scrapfly.io/docs/proxy-saver/billing)

#### Cloud Browser API

- [Getting Started](https://scrapfly.io/docs/cloud-browser-api/getting-started)
- [Proxy & Geo-Targeting](https://scrapfly.io/docs/cloud-browser-api/proxy)
- [Unblock API](https://scrapfly.io/docs/cloud-browser-api/unblock)
- [File Downloads](https://scrapfly.io/docs/cloud-browser-api/file-downloads)
- [Session Resume](https://scrapfly.io/docs/cloud-browser-api/session-resume)
- [Human-in-the-Loop](https://scrapfly.io/docs/cloud-browser-api/human-in-the-loop)
- [Debug Mode](https://scrapfly.io/docs/cloud-browser-api/debug-mode)
- [Bring Your Own Proxy](https://scrapfly.io/docs/cloud-browser-api/bring-your-own-proxy)
- [Browser Extensions](https://scrapfly.io/docs/cloud-browser-api/extensions)
##### Integrations

- [Puppeteer](https://scrapfly.io/docs/cloud-browser-api/puppeteer)
- [Playwright](https://scrapfly.io/docs/cloud-browser-api/playwright)
- [Selenium](https://scrapfly.io/docs/cloud-browser-api/selenium)
- [Vercel Agent Browser](https://scrapfly.io/docs/cloud-browser-api/agent-browser)
- [Browser Use](https://scrapfly.io/docs/cloud-browser-api/browser-use)
- [Stagehand](https://scrapfly.io/docs/cloud-browser-api/stagehand)

- [Billing](https://scrapfly.io/docs/cloud-browser-api/billing)
- [Errors](https://scrapfly.io/docs/cloud-browser-api/errors)


### Tools

- [Antibot Detector](https://scrapfly.io/docs/tools/antibot-detector)

### SDK

- [Golang](https://scrapfly.io/docs/sdk/golang)
- [Python](https://scrapfly.io/docs/sdk/python)
- [TypeScript](https://scrapfly.io/docs/sdk/typescript)
- [Scrapy](https://scrapfly.io/docs/sdk/scrapy)

### Integrations

- [Getting Started](https://scrapfly.io/docs/integration/getting-started)
- [LangChain](https://scrapfly.io/docs/integration/langchain)
- [LlamaIndex](https://scrapfly.io/docs/integration/llamaindex)
- [CrewAI](https://scrapfly.io/docs/integration/crewai)
- [Zapier](https://scrapfly.io/docs/integration/zapier)
- [Make](https://scrapfly.io/docs/integration/make)
- [n8n](https://scrapfly.io/docs/integration/n8n)

### Academy

- [Overview](https://scrapfly.io/academy)
- [Web Scraping Overview](https://scrapfly.io/academy/scraping-overview)
- [Tools](https://scrapfly.io/academy/tools-overview)
- [Reverse Engineering](https://scrapfly.io/academy/reverse-engineering)
- [Static Scraping](https://scrapfly.io/academy/static-scraping)
- [HTML Parsing](https://scrapfly.io/academy/html-parsing)
- [Dynamic Scraping](https://scrapfly.io/academy/dynamic-scraping)
- [Hidden API Scraping](https://scrapfly.io/academy/hidden-api-scraping)
- [Headless Browsers](https://scrapfly.io/academy/headless-browsers)
- [Hidden Web Data](https://scrapfly.io/academy/hidden-web-data)
- [JSON Parsing](https://scrapfly.io/academy/json-parsing)
- [Data Processing](https://scrapfly.io/academy/data-processing)
- [Scaling](https://scrapfly.io/academy/scaling)
- [Walkthrough Summary](https://scrapfly.io/academy/walkthrough-summary)
- [Scraper Blocking](https://scrapfly.io/academy/scraper-blocking)
- [Proxies](https://scrapfly.io/academy/proxies)

---

# HTML Parsing

 Almost every web scraper has to parse data out of HTML for one reason or another. Be it finding data on the page or selecting elements to interact with.

 Fortunately, HTML is a *designed* to be machine parsable. HTML documents are essentially tree structures of `<tag>` elements. For example, the following HTML document can be interpreted as an element tree:

 ```
<head>
  <title>
  </title>
</head>
<body>
  <h1>Introduction</h1>
  <div>
    <p>some description text:</p>
    <a class="link" href="http://web-scraping.dev">
      example link
    </a>
  </div>
</body>

```

 

   

 

 

    

  

  

 

  

 

 From this view, the tree structure is pretty clear and note that each node can have an unlimited amount of parameters attached to it such as `href` for URL links, or `text` for text data.

 There are three standard ways to parse HTML trees:

 [##### CSS Selectors

The most common way to select HTML nodes. Simple but robust.

 

 ](https://scrapfly.io/blog/posts/parsing-html-with-css/) [##### XPath

Powerful XML query language with its own special syntax and customizable functions.

 

 ](https://scrapfly.io/blog/posts/parsing-html-with-xpath/) [##### Object Based

 HTML nodes can be interpreted as programming objects and compiled as object trees. The most popular example of this is Python's Beautifulsoup package.

 

 ](https://scrapfly.io/blog/posts/web-scraping-with-python-beautifulsoup/) 

 As for which parsing option is the best depends on available tools and personal preference though at Scrapfly we have this handy evaluation rule:

1. CSS selectors where possible - they're simple and reliable.
2. XPath as a fallback for more complex selections.
3. Beautifulsoup, lxml (or any other object based HTML parser) for complex algorithmic parsing.
 
> Scrapfly's [Python](https://scrapfly.io/docs/sdk/python), [Typescript](https://scrapfly.io/docs/sdk/typescript) and [Golang](https://scrapfly.io/docs/sdk/golang) SDKs come with HTML parser included for both CSS and XPath selectors through the `result.selector` shortcut!

 Let's continue with our static HTML page example and apply some data parsing.

## Example Scraper

 For this example scraper we'll be using Python with `parsel` as our HTML parser. It can be installed using `pip install parsel`.

 We'll be scraping all product URLs from [web-scraping.dev/products](https://web-scraping.dev/products) static HTML page we scraped in the previous section.

 Let's select all product links using XPath or CSS selectors. For that, we need to take a look at the HTML page structure to figure out how to build our selectors:

   

 Using browser devtools we can see that:

- All products are contained under `div class="products"`
- Each product is under `div class="product"`
- Each URL is under `h3`
 
 With this, we can easily build out our CSS and XPath selectors:

- [Python](#delay)
- [Scrapfly.py](#selector)
 
 ```
from parsel import Selector
import httpx

# 1. retrieve the page HTML (from prev section)
response = httpx.get(url="https://web-scraping.dev/products")

# 2. Build a selector object
selector = Selector(response.text)
# we can use CSS selectors
product_urls = selector.css(".product h3 a::attr(href)").getall()
other_page_urls = set(selector.css(".paging>a::attr(href)").getall())

# or XPath for more advance selections, e.g. only products that contain "Potion" in the name:
drink_product_urls = selector.xpath("//h3/a[contains(text(), 'Potion')]/@href").getall()

```

 

   

 

 

 ```
from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
  url="https://web-scraping.dev/products"
))

# we can use CSS selectors
product_urls = result.selector.css(".product h3 a::attr(href)").getall()
other_page_urls = set(result.selector.css(".paging>a::attr(href)").getall())

# or XPath for more advance selections, e.g. only products that contain "Potion" in the name:
drink_product_urls = result.selector.xpath("//h3/a[contains(text(), 'Potion')]/@href").getall()

```

 

   

 

 

 

 

 Above, we retrieve HTML data as covered in the previous section. Then, we load it as a `Selector` object which builds a tree object we can use to select data from with XPath or CSS selectors.

 We're ready for the full static paging exercise now - see the full exercise below 👇

 [](https://scrapfly.io/scrapeground/paging/static) ##### Scrapeground Exercise: Static Paging Scraping

   See this in-depth tutorial on Scrapfly Scrapeground full more detailed example of this scraper. This includes examples in different languages and libraries as well as more scenarios.

 

 

## Getting Good at HTML Parsing

 HTML can get very confusing and complex quickly. This complexity with the fact that pages tend to change often means that the HTML parser code has to be robust and flexible.

 To start, see our interactive cheatsheet pages for quickly navigating CSS Selector and XPath syntax.

 [ ##### CSS Selector Cheatsheet

Interactive reference for all css selector features used in web scraping.

 

 ](https://scrapfly.io/blog/posts/css-selector-cheatsheet/) [ ##### XPath Cheatsheet

Interactive reference for all XPath features used in web scraping.

 

 ](https://scrapfly.io/blog/posts/xpath-cheatsheet/) 

 Many shortcuts and tools can help with figuring out the right XPath or CSS selectors when developing web scrapers but to be honest - they're not good enough. It's best to take the time to learn the syntax and how to use the right tools for the job though AI assistants can be good tutors:

 [ ##### Using AI To Parse HTML

 ChatGPT and other large language model assistants can be pretty good teachers when it comes to HTML parsing.

 

 ](https://scrapfly.io/blog/posts/parsing-html-with-chatgpt-code-interpreter/) [ ##### Using AI to Create XPath and CSS Selectors

It can also be used to generate XPath and CSS selectors from provided HTML samples.

 

 ](https://scrapfly.io/blog/posts/finding-web-selectors-with-chatgpt/) [How to handle dynamic CSS classes?](https://scrapfly.io/blog/answers/how-to-parse-dynamic-classes/) [How to turn HTML to text?](https://scrapfly.io/blog/answers/how-to-turn-html-to-text-in-python/) [More FAQ at Scrapfly Knowledgebase](https://scrapfly.io/blog/tags/data-parsing/) 

## Parsing XML - Feeds and Sitemaps

 XML is rarely encountered in web scraping and since HTML is a subset of XML, the same parsing techniques and tools can be used for XML documents with a few minor differences.

 Most commonly XML is encountered in scraping additional website structures like feeds and sitemaps.

 [ ##### Intro to parsing XML

 How to parse HTML in web scraping and different programming languages and what common pitfalls to avoid.

 

 ](https://scrapfly.io/blog/posts/how-to-parse-xml/) [ ##### Intro to scraping sitemaps

 How to discover web scraping targets using official sitemap XML feed structures.

 

 ](https://scrapfly.io/blog/posts/how-to-scrape-sitemaps/) 

## Next up - Dynamic Page Scraping

 Scraping static pages and parsing HTML isn't too difficult, but modern web design is moving away from static pages. Next, let's take a look at how dynamic web pages and web apps like SPA's, websites powered by NEXT.js, React etc. are scraped.



 [&lt;](https://scrapfly.io/academy/static-scraping "Previous Page") [&gt;](https://scrapfly.io/academy/dynamic-scraping "Next Page")