[Blog](https://scrapfly.io/blog)   /  [data-parsing](https://scrapfly.io/blog/tag/data-parsing)   /  [Python lxml Tutorial: How to Parse HTML and XML](https://scrapfly.io/blog/posts/intro-to-parsing-html-xml-python-lxml)   # Python lxml Tutorial: How to Parse HTML and XML

 by [Mazen Ramadan](https://scrapfly.io/blog/author/mazen) Apr 20, 2026 20 min read [\#data-parsing](https://scrapfly.io/blog/tag/data-parsing) [\#python](https://scrapfly.io/blog/tag/python) [\#tools](https://scrapfly.io/blog/tag/tools) 

 [  ](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fintro-to-parsing-html-xml-python-lxml "Share on LinkedIn")    

 
Most Python developers reach for BeautifulSoup the first time they parse HTML, and most of them eventually hit the same wall: parsing speed becomes the bottleneck once a scraper is processing thousands of pages an hour. The Python lxml library solves that problem by wrapping the libxml2 and libxslt C parsers behind a Pythonic API, delivering parse times that are roughly an order of magnitude faster than pure Python parsers while supporting both XPath and CSS selectors out of the box.

In this tutorial, we walk through Python lxml end to end. We cover installation, the ElementTree API for building and navigating document trees, parsing HTML with the lenient `lxml.html` module, parsing XML and handling namespaces, querying with `.xpath()` and `.cssselect()`, and a complete web scraping example using lxml together with httpx. By the end, you will know which lxml entry point to reach for in any parsing scenario.

## Key Takeaways

Use Python lxml for fast HTML and XML parsing with a single API that exposes the full power of XPath and CSS selectors.

- lxml wraps the C libraries libxml2 and libxslt, giving you parse speeds roughly 10x faster than pure Python parsers.
- The `lxml.etree` module is for strict XML and `lxml.html` is for real-world web pages that may contain broken markup.
- `etree.fromstring()` parses strings while `etree.parse()` reads files and URLs into an ElementTree.
- The `.xpath()` method always returns a list and supports XPath 1.0 with bound variables for safe parameter substitution.
- The `.cssselect()` method converts CSS selectors to XPath internally and is bundled with lxml at install time.
- The `lxml.html` parser silently recovers from missing tags, mismatched nesting, and other real-world HTML defects that break stricter parsers.

**Get web scraping tips in your inbox**Trusted by 100K+ developers and 30K+ enterprises. Unsubscribe anytime.


## What Is lxml and Why Use It?

[lxml](https://lxml.de/) is a high-performance Python library for parsing and manipulating HTML and XML documents. It wraps the C libraries [libxml2](https://gitlab.gnome.org/GNOME/libxml2) and [libxslt](https://gitlab.gnome.org/GNOME/libxslt), combining native parsing speed with a Pythonic API that follows the standard [ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html) interface.

lxml treats every document as a tree of `Element` objects. Each element exposes:

- A **tag name** (e.g. `div`, `span`, `item`)
- An **attributes dictionary** (e.g. `{"class": "price", "id": "main"}`)
- A **text value** and optional **tail text**
- A **list of child elements**

This is the same model used by Python's built-in `xml.etree.ElementTree`, so existing ElementTree code typically runs against lxml unchanged while picking up XPath, CSS selectors, and faster parsing for free.

Several popular libraries use lxml under the hood. [Scrapy](https://scrapy.org/) and [Parsel](https://pypi.org/project/parsel/) both wrap lxml to provide higher level scraping APIs, and BeautifulSoup can use lxml as a parser backend by passing `"lxml"` as the parser argument.

[Guide to Parsel - the Best HTML Parsing in PythonLearn to extract data from websites with Parsel, a Python library for HTML parsing using CSS selectors and XPath.](https://scrapfly.io/blog/posts/guide-to-html-parsing-with-parsel-python)


## How to Install lxml in Python

Install lxml with a single pip command. On most systems, pip downloads a pre-built wheel so no compilation is required:

shell```shell
pip install lxml
```


The lxml package is not pure Python and depends on the C libraries `libxml2` and `libxslt`. Pre-built wheels bundle those libraries for Windows, macOS, and most Linux distributions, so the install above usually works out of the box.

If pip falls back to building lxml from source on a stripped down Linux image, install the system development packages first. On Debian and Ubuntu the packages are `libxml2-dev`, `libxslt1-dev`, and `python3-dev`. After installing those, rerun the pip command above and the source build will succeed.

Since the web scraping example later in this tutorial fetches pages over HTTP, install the [httpx](https://www.python-httpx.org/) client as well:

shell```shell
pip install httpx
```


## How Does the ElementTree API Work in lxml?

The ElementTree API treats every HTML or XML document as a tree of `Element` objects. Each element has a tag, an attributes dictionary, a text value, and zero or more child elements. The lxml library implements that API and adds helpers for moving between parents, children, and siblings that the standard library does not include.

This section focuses on building and navigating trees in memory. Parsing strings, files, and URLs is covered in the dedicated HTML and XML sections later in the tutorial.

### How to Create and Build Document Trees with lxml

Use `etree.Element()` to create the root of a new tree and `etree.SubElement()` to attach children to any existing element. Both helpers return an `Element` instance you can keep working with, which makes nested construction straightforward:

python```python
from lxml import etree

# create the root element with an attribute
root = etree.Element("products", category="drinks")

# create a child product and set its text
product = etree.SubElement(root, "product", id="2")
name = etree.SubElement(product, "name")
name.text = "Dark Red Energy Potion"
price = etree.SubElement(product, "price", currency="USD")
price.text = "4.99"

# serialize the tree back to a string
xml_bytes = etree.tostring(root, pretty_print=True)
print(xml_bytes.decode())
```


This builds a small product document from scratch and attaches attributes through keyword arguments. The call to `etree.tostring()` serializes the tree back to bytes, and `pretty_print=True` adds newlines and indentation so the output is easy to read:

shell```shell
<products category="drinks">
  <product id="2">
    <name>Dark Red Energy Potion</name>
    <price currency="USD">4.99</price>
  </product>
</products>
```


### How to Navigate Parent, Child, and Sibling Elements

Every lxml `Element` behaves like a Python list of its direct children. Indexing, slicing, iteration, and `len()` all work as expected. On top of that, lxml adds navigation helpers that the standard ElementTree does not provide: `getparent()`, `getnext()`, and `getprevious()`:

python```python
from lxml import etree

xml = """
<products>
  <product id="1"><name>Box of Chocolate Candy</name></product>
  <product id="2"><name>Dark Red Energy Potion</name></product>
  <product id="3"><name>Teal Energy Potion</name></product>
</products>
"""
root = etree.fromstring(xml.strip())

print(f"children count: {len(root)}")
print(f"first child id: {root[0].get('id')}")

second = root[1]
print(f"parent tag: {second.getparent().tag}")
print(f"previous id: {second.getprevious().get('id')}")
print(f"next id: {second.getnext().get('id')}")
```


Here, indexing with `root[0]` and `root[1]` returns direct children. Calling `getparent()` walks up the tree, while `getprevious()` and `getnext()` move sideways across siblings of the same parent:

shell```shell
children count: 3
first child id: 1
parent tag: products
previous id: 1
next id: 3
```


### How to Access Element Attributes and Text

Each lxml element exposes three key properties for reading content:

- `.text` holds the inner text of the element.
- `.tail` holds any text that follows the element's closing tag.
- `.attrib` is the full attributes dictionary.

The `.get()` and `.set()` methods are convenient shortcuts that mirror standard dict access:

python```python
from lxml import etree

html = '<a href="/product/2" class="title">Dark Red Energy Potion</a> in stock'
link = etree.fromstring(f"<wrap>{html}</wrap>")[0]

print(f"text: {link.text}")
print(f"tail: {link.tail}")
print(f"href attribute: {link.get('href')}")
print(f"all attributes: {dict(link.attrib)}")

# update an existing attribute
link.set("class", "title featured")
print(f"updated class: {link.get('class')}")
```


The fragment is wrapped in a `<wrap>` tag so the trailing text after the link can also be parsed. Reading `.text` gives the visible link label, `.tail` captures the `" in stock"` string that sits outside the closing `</a>` tag, and `.set()` overwrites an attribute in place:

shell```shell
text: Dark Red Energy Potion
tail:  in stock
href attribute: /product/2
all attributes: {'href': '/product/2', 'class': 'title'}
updated class: title featured
```


## How to Parse HTML Documents with lxml

Use `lxml.html.fromstring()` to parse HTML strings and `lxml.html.parse()` to parse HTML files or URLs. The `lxml.html` module is built on top of `lxml.etree` but adds an HTML aware parser that recovers from broken markup and a set of convenience methods designed specifically for web pages.

### What Is the Difference Between lxml.etree and lxml.html?

The `lxml.etree` module is a strict XML parser and rejects documents that violate XML well-formedness rules, such as unquoted attributes or unclosed tags. The `lxml.html` module is a lenient HTML parser that auto-recovers from those same defects and adds helper methods like `text_content()`, `find_class()`, and `iterlinks()` that have no XML equivalent.

The simplest way to see the difference is to feed the same intentionally broken document to both modules:

python```python
from lxml import etree, html

broken = "<div><p>Hello<span>world</div>"

# strict XML parser fails
try:
    etree.fromstring(broken)
except etree.XMLSyntaxError as e:
    print(f"etree error: {e}")

# lenient HTML parser silently recovers
tree = html.fromstring(broken)
print(f"html parsed root tag: {tree.tag}")
print(f"recovered html: {html.tostring(tree).decode()}")
```


The script above shows how `lxml.etree` raises an `XMLSyntaxError` on the missing `</span>` and `</p>` tags, while `lxml.html` quietly inserts the missing closing tags and returns a usable tree. As a rule, reach for `lxml.html` whenever the input came from a real web page and reserve `lxml.etree` for documents you control or known well-formed XML feeds:

shell```shell
etree error: Opening and ending tag mismatch: span line 1 and div, line 1, column 31
html parsed root tag: div
recovered html: <div><p>Hello<span>world</span></p></div>
```


### How Does lxml Handle Broken or Malformed HTML?

The `lxml.html` parser ships with three entry points that handle slightly different inputs. Use `fromstring()` for any HTML snippet, `document_fromstring()` when you need a full `<html>` document wrapped around the content, and `fragment_fromstring()` when the input is guaranteed to be a single tag. Once parsed, `lxml.html` adds three convenience methods that save a lot of XPath boilerplate:

python```python
from lxml import html

page = """
<html><body>
  <h1>Energy Potions</h1>
  <div class="product featured">
    <a href="/product/2">Dark Red Energy Potion</a>
    <span class="price">$4.99</span>
  </div>
  <div class="product">
    <a href="/product/3">Teal Energy Potion</a>
    <span class="price">$4.99</span>
  </div>
</body></html>
"""
tree = html.document_fromstring(page)

print(f"all visible text: {tree.text_content().strip()[:60]}...")

featured = tree.find_class("featured")[0]
print(f"featured product text: {featured.text_content().strip()}")

for element, attribute, link, pos in tree.iterlinks():
    print(f"link: {link} (in <{element.tag}>)")
```


The script above parses a small product listing and then uses three `lxml.html` helpers in sequence. `text_content()` flattens every descendant into a single string, `find_class()` selects by CSS class without writing an XPath predicate, and `iterlinks()` yields every URL the document references along with the element that holds it:

shell```shell
all visible text: Energy Potions

    Dark Red Energy Potion
    $4.99


    Teal En...
featured product text: Dark Red Energy Potion
    $4.99
featured product text: Dark Red Energy Potion
    $4.99
link: /product/2 (in <a>)
link: /product/3 (in <a>)
```


## How to Parse XML Documents with lxml

Use `etree.fromstring()` to parse XML from a string and `etree.parse()` to parse XML from a file path, file-like object, or URL. The two functions return slightly different objects, and remembering the difference avoids one of the most common lxml beginner mistakes.

### How to Parse XML from Strings, Files, and URLs

`etree.fromstring()` returns an `Element` that is already the root of the tree. `etree.parse()` returns an `ElementTree` wrapper around the same content, and you call `.getroot()` to reach the root element. Both functions accept a custom parser through `etree.XMLParser()` if you need options like `recover=True` for damaged XML or `ns_clean=True` to strip redundant namespace declarations.

python```python
from io import BytesIO
from lxml import etree

xml_string = """
<products>
  <product id="2"><name>Dark Red Energy Potion</name></product>
</products>
"""

# fromstring returns an Element directly
root = etree.fromstring(xml_string.strip())
print(f"fromstring returns: {type(root).__name__}")
print(f"root tag: {root.tag}")

# parse returns an ElementTree, call getroot() for the root Element
tree = etree.parse(BytesIO(xml_string.strip().encode()))
print(f"parse returns: {type(tree).__name__}")
print(f"root tag via getroot: {tree.getroot().tag}")
```


The script above demonstrates both entry points using the same XML payload. `etree.fromstring()` hands back a ready to use `Element`, while `etree.parse()` returns an `ElementTree` object that wraps the document and exposes the root through `.getroot()`:

shell```shell
fromstring returns: _Element
parse returns: _ElementTree
root tag: products
root tag via getroot: products
```


For very large XML feeds that do not fit in memory, lxml provides `etree.iterparse()` which streams elements one at a time and lets you discard each element after processing. The streaming API is the right tool for multi gigabyte sitemaps or data dumps, and the [official iterparse documentation](https://lxml.de/parsing.html#iterparse-and-iterwalk) covers the patterns in detail.

### How to Handle XML Namespaces in lxml

XML documents that declare namespaces require extra care because lxml represents namespaced tags using Clark notation, where `{namespace_uri}localname` is the full tag name. When writing XPath against a namespaced document, pass a `namespaces` dictionary that maps short prefixes to the namespace URIs:

python```python
from lxml import etree

xml = """
<feed xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <title>Dark Red Energy Potion</title>
  </entry>
</feed>
"""
root = etree.fromstring(xml.strip())

# the default namespace shows up in Clark notation
print(f"root tag: {root.tag}")

# bind a prefix to query namespaced elements with XPath
ns = {"atom": "http://www.w3.org/2005/Atom"}
title = root.xpath("//atom:entry/atom:title/text()", namespaces=ns)[0]
print(f"first entry title: {title}")
```


The script above parses an Atom feed and shows the typical namespace gotcha. The root tag prints in Clark notation, and the XPath query only works once the `atom` prefix is mapped to the feed's default namespace through the `namespaces` parameter:

shell```shell
root tag: {http://www.w3.org/2005/Atom}feed
first entry title: Dark Red Energy Potion
```


[How to Parse XMLIn this article, we'll explain about XML parsing. We'll start by defining XML files, their format and how to navigate them for data extraction.](https://scrapfly.io/blog/posts/how-to-parse-xml)


## How to Use XPath Selectors with lxml

Call the `.xpath()` method on any lxml `Element` or `ElementTree` to run an XPath 1.0 expression and get the matching nodes, attribute values, or text strings back as a Python list. lxml's XPath support is one of its biggest advantages over BeautifulSoup, which has no native XPath engine.

### What Does the .xpath() Method Return?

The `.xpath()` method always returns a list, even when only one match is expected. The list contents depend on what the XPath expression selects, with element nodes returning as `Element` objects, `text()` and `@attr` selectors returning as plain Python strings, and aggregate functions like `count()` returning as floats or booleans. Always index with `[0]` or guard against an empty list before unpacking results.

XPath also supports parameter substitution through keyword arguments, which keeps user input out of the expression string and avoids the XPath equivalent of injection bugs:

python```python
from lxml import html

snippet = """
<div class="product">
  <h3 class="title">Dark Red Energy Potion</h3>
  <span class="price">$4.99</span>
</div>
"""
tree = html.fromstring(snippet)

target_class = "title"
title = tree.xpath("//h3[@class=$cls]/text()", cls=target_class)[0]
print(f"title via bound variable: {title}")
```


### What Are the Most Useful XPath Patterns for Extracting Data?

Most real-world scraping reduces to a handful of XPath patterns. The four below cover element selection, attribute filtering, text extraction, and attribute extraction, which together handle the majority of structured data on a typical product page:

python```python
from lxml import html

page = """
<div id="products">
  <div class="product" data-id="2">
    <a href="/product/2"><h3>Dark Red Energy Potion</h3></a>
    <span class="price">$4.99</span>
    <span class="rating" data-reviews="774">4.7</span>
  </div>
  <div class="product" data-id="3">
    <a href="/product/3"><h3>Teal Energy Potion</h3></a>
    <span class="price">$4.99</span>
    <span class="rating" data-reviews="392">4.5</span>
  </div>
</div>
"""
tree = html.fromstring(page)

# 1. select all product cards by class
products = tree.xpath("//div[@class='product']")
print(f"products found: {len(products)}")

# 2. extract names from headings inside each card
names = tree.xpath("//div[@class='product']//h3/text()")
print(f"names: {names}")

# 3. extract attribute values directly with @
links = tree.xpath("//div[@class='product']/a/@href")
print(f"links: {links}")

# 4. combine selectors with predicates for precise targeting
top_rated = tree.xpath("//span[@class='rating' and number(text()) > 4.6]/text()")
print(f"top rated values: {top_rated}")
```


The script above demonstrates the four core patterns against a small product listing. Indexing by class returns Element objects, the `//h3/text()` form returns visible text, the `@href` form returns attribute strings, and the predicate in the last query filters by a numeric comparison on the rating value:

shell```shell
products found: 2
names: ['Dark Red Energy Potion', 'Teal Energy Potion']
links: ['/product/2', '/product/3']
top rated values: ['4.7']
```


[Parsing HTML with XpathIntroduction to xpath in the context of web-scraping. How to extract data from HTML documents using xpath, best practices and available tools.](https://scrapfly.io/blog/posts/parsing-html-with-xpath)


Scrapfly

#### Extract structured data automatically?

Scrapfly's Extraction API uses AI to turn any webpage into structured data — no selectors needed.

[Try Free →](https://scrapfly.io/register)## How to Use CSS Selectors with lxml

Use the `.cssselect()` method on any lxml element to select descendants with CSS selector syntax. lxml ships with the [cssselect](https://cssselect.readthedocs.io/en/latest/) package bundled at install time, so no extra dependency is needed for basic CSS support.

### How Does .cssselect() Work in lxml?

Under the hood, `.cssselect()` translates the CSS expression to an equivalent XPath query and runs the query through lxml's native engine. The translation step means that anything you can express in CSS is also expressible as XPath, and any limitation of CSS as a selector language carries over to `.cssselect()` as well.

Practical CSS patterns work the way they do in a browser, with dot notation for class, hash for id, square brackets for attributes, and whitespace for descendant combinators.

python```python
from lxml import html

page = """
<div id="products">
  <div class="product featured" data-id="2">
    <h3 class="title">Dark Red Energy Potion</h3>
    <span class="price">$4.99</span>
  </div>
  <div class="product" data-id="3">
    <h3 class="title">Teal Energy Potion</h3>
    <span class="price">$4.99</span>
  </div>
</div>
"""
tree = html.fromstring(page)

# select by class
titles = [h.text for h in tree.cssselect(".product .title")]
print(f"titles: {titles}")

# select by id then descendant
prices = [s.text for s in tree.cssselect("#products .price")]
print(f"prices: {prices}")

# select by attribute presence and value
featured = tree.cssselect(".product.featured")[0]
print(f"featured id: {featured.get('data-id')}")
```


The script above runs three CSS selectors against the same product listing. The first uses descendant combinators, the second narrows by id, and the third stacks two class selectors together to find the featured product card:

shell```shell
titles: ['Dark Red Energy Potion', 'Teal Energy Potion']
prices: ['$4.99', '$4.99']
featured id: 2
```


### When Should You Use CSS Selectors vs XPath in lxml?

CSS selectors are concise and familiar to anyone who has written stylesheets, which makes CSS the more readable choice for plain element selection. CSS cannot extract text or attribute values directly, cannot navigate upward in the tree, and has no equivalent of XPath functions like `contains()` or `starts-with()`.

A practical rule is to use CSS for element selection when the goal is "find this card on the page" and switch to XPath the moment the goal is "extract this attribute or text" or "filter by computed conditions". Many lxml users mix the two by selecting cards with `.cssselect()` and then running `.xpath()` on the resulting elements to pull out the actual data.

[Parsing HTML with CSS SelectorsIntroduction to using CSS selectors to parse web-scraped content. Best practices, available tools and common challenges by interactive examples.](https://scrapfly.io/blog/posts/parsing-html-with-css)


## How to Scrape Web Pages with Python lxml

To scrape web pages with lxml, fetch the HTML with an HTTP client such as httpx and then parse the response body with `lxml.html.fromstring()`. From there, the same XPath and CSS patterns from the previous sections extract structured data into Python dictionaries.

The walkthrough below scrapes the public sandbox at `https://web-scraping.dev/products`, which mirrors a small e-commerce listing and is safe to use in tutorials. The first step is inspecting the page in browser DevTools to see which elements hold the data:

Each product card uses a `row product` class, the title sits inside an `<a>` tag wrapped in a description block, and the price lives in a `.price` element. With those selectors in mind, the scraper below fetches every page concurrently with httpx and parses each response with lxml:

python```python
import asyncio
import json
from typing import List, Dict
from httpx import AsyncClient, Response
from lxml import html

client = AsyncClient(
    headers={
        "Accept-Language": "en-US,en;q=0.9",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      "AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/120.0.0.0 Safari/537.36",
    },
    timeout=20,
)


def parse_products(response: Response) -> List[Dict]:
    """Extract product data from a single listing page."""
    tree = html.fromstring(response.text)
    products = []
    for card in tree.cssselect("div.row.product"):
        link = card.xpath(".//div[contains(@class,'description')]/h3/a/@href")[0]
        name = card.xpath(".//div[contains(@class,'description')]/h3/a/text()")[0]
        price = float(card.xpath(".//div[@class='price']/text()")[0])
        products.append({
            "product_id": int(link.rsplit("/", 1)[-1]),
            "name": name.strip(),
            "link": link,
            "price": price,
        })
    return products


async def scrape_products(base_url: str) -> List[Dict]:
    """Fetch every paginated listing page and parse each response."""
    first = await client.get(base_url)
    results = parse_products(first)

    other_pages = [client.get(f"{base_url}?page={n}") for n in range(2, 6)]
    for response in asyncio.as_completed(other_pages):
        results.extend(parse_products(await response))

    print(f"scraped {len(results)} products")
    return results


async def run():
    data = await scrape_products("https://web-scraping.dev/products")
    print(json.dumps(data[:3], indent=2))


if __name__ == "__main__":
    asyncio.run(run())
```


The scraper combines `.cssselect()` for the outer card selection with `.xpath()` for the precise text and attribute extraction, which mirrors the decision rule from the CSS section. Concurrent requests are launched with `asyncio.as_completed()` so that pages are processed as soon as they arrive, and `html.fromstring()` rather than `etree.fromstring()` is used because the response is real-world HTML that may contain minor defects:

shell```shell
scraped 25 products
[
  {
    "product_id": 1,
    "name": "Box of Chocolate Candy",
    "link": "https://web-scraping.dev/product/1",
    "price": 24.99
  },
  {
    "product_id": 2,
    "name": "Dark Red Energy Potion",
    "link": "https://web-scraping.dev/product/2",
    "price": 4.99
  },
  {
    "product_id": 3,
    "name": "Teal Energy Potion",
    "link": "https://web-scraping.dev/product/3",
    "price": 4.99
  }
]
```


[How to Web Scrape with HTTPX and PythonIntro to using Python's httpx library for web scraping. Proxy and user agent rotation and common web scraping challenges, tips and tricks.](https://scrapfly.io/blog/posts/web-scraping-with-python-httpx)


## How Does lxml Compare to BeautifulSoup and Other Parsers?

lxml is significantly faster than BeautifulSoup, supports XPath natively, and ships with a lenient HTML parser plus a strict XML parser in the same package. BeautifulSoup has a friendlier API for absolute beginners and is somewhat more forgiving on extreme edge cases, while [Parsel](https://pypi.org/project/parsel/) sits on top of lxml and adds a unified Selector class designed for web scraping.

The table below puts the four most common Python parsing options side by side:

| Feature | lxml | BeautifulSoup | Python ElementTree | Parsel |
|---|---|---|---|---|
| **Speed** | Very fast (C bindings) | Slow (~10x slower) | Moderate | Fast (wraps lxml) |
| **XPath support** | Native | None | None | Native |
| **CSS selectors** | Via `.cssselect()` | Via `.select()` | None | Native |
| **Handles broken HTML** | Yes (`lxml.html`) | Yes (very forgiving) | No | Yes (uses lxml) |
| **Learning curve** | Moderate | Easy | Easy | Moderate |
| **Best for** | Performance critical parsing | Quick scripts and learning | Simple, well-formed XML | Web scraping projects |

A quick decision rule covers most situations. Pick lxml when parse speed matters, when XPath is involved, or when the same project handles both HTML and XML. Pick BeautifulSoup for short scripts where readability matters more than throughput.

[How to Parse Web Data with Python and BeautifulsoupBeautifulsoup is one the most popular libraries in web scraping. In this tutorial, we'll take a hand-on overview of how to use it, what is it good for and explore a real -life web scraping example.](https://scrapfly.io/blog/posts/web-scraping-with-python-beautifulsoup)


## FAQ

Is lxml faster than BeautifulSoup?Yes. lxml is roughly 10x faster than BeautifulSoup for HTML parsing because lxml runs on the libxml2 C parser while BeautifulSoup is implemented in pure Python. In community benchmarks, lxml parses a typical Wikipedia page in around 6 milliseconds while BeautifulSoup with `html.parser` takes closer to 75 milliseconds on the same input.


Can I use lxml as a BeautifulSoup parser backend?Yes. Pass `"lxml"` as the second argument when constructing the soup, for example `BeautifulSoup(html, "lxml")`. The combination keeps BeautifulSoup's API and gives the underlying parse the speed of lxml. The lxml package still has to be installed separately for the backend to be available.


Why use lxml instead of Python's built-in ElementTree?lxml extends the standard `xml.etree.ElementTree` API with XPath 1.0 support, CSS selectors via `.cssselect()`, a lenient HTML parser in `lxml.html`, and significantly faster parse times. The standard library version cannot parse HTML, has no XPath engine, and slows down quickly on documents in the megabyte range.


\--------------------------Old Version------------------------------


Legal Disclaimer and PrecautionsThis tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect:

- Do not scrape at rates that could damage the website.
- Do not scrape data that's not available publicly.
- Do not store PII of EU citizens protected by GDPR.
- Do not repurpose *entire* public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow. For more you should consult a lawyer.

 
    Table of Contents- [Key Takeaways](#key-takeaways)
- [What Is lxml and Why Use It?](#what-is-lxml-and-why-use-it)
- [How to Install lxml in Python](#how-to-install-lxml-in-python)
- [How Does the ElementTree API Work in lxml?](#how-does-the-elementtree-api-work-in-lxml)
- [How to Create and Build Document Trees with lxml](#how-to-create-and-build-document-trees-with-lxml)
- [How to Navigate Parent, Child, and Sibling Elements](#how-to-navigate-parent-child-and-sibling-elements)
- [How to Access Element Attributes and Text](#how-to-access-element-attributes-and-text)
- [How to Parse HTML Documents with lxml](#how-to-parse-html-documents-with-lxml)
- [What Is the Difference Between lxml.etree and lxml.html?](#what-is-the-difference-between-lxml-etree-and-lxml-html)
- [How Does lxml Handle Broken or Malformed HTML?](#how-does-lxml-handle-broken-or-malformed-html)
- [How to Parse XML Documents with lxml](#how-to-parse-xml-documents-with-lxml)
- [How to Parse XML from Strings, Files, and URLs](#how-to-parse-xml-from-strings-files-and-urls)
- [How to Handle XML Namespaces in lxml](#how-to-handle-xml-namespaces-in-lxml)
- [How to Use XPath Selectors with lxml](#how-to-use-xpath-selectors-with-lxml)
- [What Does the .xpath() Method Return?](#what-does-the-xpath-method-return)
- [What Are the Most Useful XPath Patterns for Extracting Data?](#what-are-the-most-useful-xpath-patterns-for-extracting-data)
- [How to Use CSS Selectors with lxml](#how-to-use-css-selectors-with-lxml)
- [How Does .cssselect() Work in lxml?](#how-does-cssselect-work-in-lxml)
- [When Should You Use CSS Selectors vs XPath in lxml?](#when-should-you-use-css-selectors-vs-xpath-in-lxml)
- [How to Scrape Web Pages with Python lxml](#how-to-scrape-web-pages-with-python-lxml)
- [How Does lxml Compare to BeautifulSoup and Other Parsers?](#how-does-lxml-compare-to-beautifulsoup-and-other-parsers)
- [FAQ](#faq)
 
    Join the Newsletter  Get monthly web scraping insights 

 
Scale Your Web Scraping

Anti-bot bypass, browser rendering, and rotating proxies, all in one API. Start with 1,000 free credits.

  No credit card required  1,000 free API credits  Anti-bot bypass included 

 [Start Free](https://scrapfly.io/register) [View Docs](https://scrapfly.io/docs/onboarding) 

 Not ready? Get our newsletter instead. 

 
## Explore this Article with AI

 [ ChatGPT ](https://chat.openai.com/?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fintro-to-parsing-html-xml-python-lxml) [ Gemini ](https://www.google.com/search?udm=50&aep=11&q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fintro-to-parsing-html-xml-python-lxml) [ Grok ](https://x.com/i/grok?text=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fintro-to-parsing-html-xml-python-lxml) [ Perplexity ](https://www.perplexity.ai/search/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fintro-to-parsing-html-xml-python-lxml) [ Claude ](https://claude.ai/new?q=Summarize%20this%20page%3A%20https%3A%2F%2Fscrapfly.io%2Fblog%2Fposts%2Fintro-to-parsing-html-xml-python-lxml) 


 ## Related Articles

 [  

 python data-parsing 

### How to Parse Web Data with Python and Beautifulsoup

Beautifulsoup is one the most popular libraries in web scraping. In this tutorial, we'll take a hand-on overview of how ...

 
 ](https://scrapfly.io/blog/posts/web-scraping-with-python-beautifulsoup) [  

 python data-parsing 

### Parsing HTML with Xpath

Introduction to xpath in the context of web-scraping. How to extract data from HTML documents using xpath, best practice...

 
 ](https://scrapfly.io/blog/posts/parsing-html-with-xpath) [  

 python data-parsing 

### How to Parse XML

In this article, we'll explain about XML parsing. We'll start by defining XML files, their format and how to navigate th...

 
 ](https://scrapfly.io/blog/posts/how-to-parse-xml) 

  ## Related Questions

- [ Q XPath vs CSS selectors: what's the difference? ](https://scrapfly.io/blog/answers/xpath-vs-css-selectors)
- [ Q What are some BeautifulSoup alternatives in Python? ](https://scrapfly.io/blog/answers/what-are-some-beautifulsoup-alternatives)
- [ Q How to find sibling HTML nodes using BeautifulSoup and Python? ](https://scrapfly.io/blog/answers/how-to-find-siblings-nodes-with-beautifulsoup)
- [ Q How to find HTML elements by class? ](https://scrapfly.io/blog/answers/how-to-find-html-elements-by-class)
 
  
 Extract structured data with AI, **1,000 free credits** [Start Free](https://scrapfly.io/register)