Top 10 Web Scraping Packages for Python

by Bernardas Ališauskas Aug 22, 2024

#python

Python is by far the most popular language used for web scraping. It's easy to learn, has a huge community and a massive ecosystem of libraries.

In this quick overview article, we'll be taking a look at the top 10 web scraping packages that every web scraper should know. Covering various niches like:

HTTP Connections
Browser Automation
HTML and JSON Parsing
Data Quality and Validation

we use all of these libraries in our web scraping guide series if you want to see them in action

HTTPX

HTTPX is by far the most complete and modern HTTP client package for Python. It is inspired by the popular requests library but it takes things even further by supporting modern Python and HTTP features.

To start, it supports both asynchronous and synchronous Python APIs. This makes it easy to scale httpx-powered requests while also staying accessible in simple scripts:

import httpx

# use Sync API
with httpx.Client() as client:
    response = client.get("https://web-scraping.dev/product/1")
    print(response.text)

# use Async API:
import asyncio

async def run():
    async with httpx.AsyncClient() as client:
        response = await client.get("https://web-scraping.dev/product/1")
        print(response.text)

asyncio.run(run())

HTTPX also supports HTTP/2 which is much less likely to be blocked as most real human traffic is using http2/3:

with httpx.Client(http2=True) as client:
    response = client.get("https://web-scraping.dev/product/1")
    print(response.text)

Finally, httpx respects HTTP standards and forms requests more similarly to a real web browser which can drastically reduce the likelyhood of being blocked. Like respecting header ordering:

How to Web Scrape with HTTPX and Python

Intro to using Python's httpx library for web scraping. Proxy and user agent rotation and common web scraping challenges, tips and tricks.

Parsel and LXML

LXML is a fast and feature-rich HTML/XML parser for Python. It is a wrapper around the C library libxml2 and it is the fastest and most reliable way to parse HTML in Python.

LXML uses powerful and flexible XPath Selectors to find elements in HTML:

from lxml import html
html = """
<html>
  <title>My Title</title>
  <body>
    <div class="title">Product Title</div>
  </body>
</html>
"""
tree = html.fromstring(html)
print(tree.xpath("//title/text()"))
['My Title']
print(tree.xpath("//div[@class='title']/text()"))
['Product Title']

To take things even further, lxml is extended by Parsel which simplifies the use API and adds CSS selector support:

from parsel import Selector

html = """
<html>
  <title>My Title</title>
  <body>
    <div class="title">Product Title</div>
  </body>
</html>
"""

selector = Selector(html)
# use XPath
print(selector.xpath("//title/text()").get())
"My Title"
# or CSS selectors
print(selector.css(".title::text").get())
"Product Title"

Parsel is becoming the de facto way to parse large sets of HTML documents as it combines the speed of lxml with the ease of use of CSS and XPath selectors.

BeautifulSoup

Beautifulsoup (aka bs4) is another HTML parser library in Python. Though it's much more than that.

Unlike LXML and Parsel, bs4 support parsing using accessible Pythonic methods like find and find_all:

from bs4 import BeautifulSoup

html = """
<html>
  <title>My Title</title>
  <body>
    <div class="title">Product Title</div>
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')  # note: bs4 can use lxml under the hood which makes it super fast!
# find single element by node name
print(soup.find("title").text)
"My Title"
# find multiple using find_all and attribute matching
for element in soup.find_all("div", class_="title"):
    print(element.text)

This approach is more accessible and more readable than CSS or XPath selectors for beginners and often easier to maintain and develop when working with highly complex HTML documents.

BeautilfulSoup4 also comes with many utility functions like HTML formatting and editing. For example:

from bs4 import BeautifulSoup

html = """
<div><h1>The Avangers: </h1><a>End Game</a><p>is one of the most popular Marvel movies</p></div>
"""
soup = BeautifulSoup(html)
soup.prettify()
"""
<html>
 <body>
  <div>
   <h1>
    The Avangers:
   </h1>
   <a>
    End Game
   </a>
   <p>
    is one of the most popular Marvel movies
   </p>
  </div>
 </body>
</html>
"""

As well as many other utility functions like HTML modification, selective parsing and cleanup.

How to Parse Web Data with Python and Beautifulsoup

Beautifulsoup is one the most popular libraries in web scraping. In this tutorial, we'll take a hand-on overview of how to use it, what is it good for and explore a real -life web scraping example.

JMESPath and JSONPath

JMESPath and JSONPath are two libraries that allow you to query JSON data using a query language similar to XPath.

As JSON is becoming bigger and bigger in the web scraping space, these libraries are becoming essential.

For example, with JMESPath we can query and reshape and flatten JSON datasets with ease:

data = {
  "people": [
    {
      "name": "Foo", 
      "age": 33, 
      "addresses": [
        "123 Main St", "California", "US"
      ],
      "primary_email": "foo@email.com",
      "secondary_email": "bar@email.com",
    },
    ...
  ]
}
jmespath.search("""
  people[].{
    first_name: name, 
    age_in_years: age,
    address: addresses[0],
    state: addresses[1],
    country: addresses[2],
    emails: [primary_email, secondary_email]
  }
""", data)
[
  {
    'address': '123 Main St',
    'state': 'California'
    'country': 'US',
    'age_in_years': 33,
    'emails': ['foo@email.com', 'bar@email.com'],
    'first_name': 'Foo',
  },
  ...
]

This feature is especially useful when scraping hidden web data which can result in massive JSON datasets that are hard to navigate and digest.

Quick Intro to Parsing JSON with JMESPath in Python

Introduction to JMESPath - JSON query language which is used in web scraping to parse JSON datasets for scrape data.

Alternatively, JSONPath focuses less on reshaping datasets and more on selecting values from complex, heavily nested JSON datasets and supports advanced matchign functions:

import jsonpath_ng.ext as jp

data = {
    "products": [
        {"name": "Apple", "price": 12.88, "tags": ["fruit", "red"]},
        {"name": "Peach", "price": 27.25, "tags": ["fruit", "yellow"]},
        {"name": "Cake", "tags": ["pastry", "sweet"]},
    ]
}

# find all product names:
query = jp.parse("products[*].name")
for match in query.find(data):
    print(match.value)

# find all products with price > 20
query = jp.parse("products[?price>20].name")
for match in query.find(data):
    print(match.value)

The killer feature of JSONPath is the recursive $.. selector which allows to quickly select values by key anywhere in the dataset (similar to the beloved XPath's // selector):

import jsonpath_ng.ext as jp

data = {
    "products": [
        {"name": "Apple", "price": 12.88},
        {"name": "Peach", "price": 27.25},
        {"name": "Cake"},
        {"multiproduct": [{"name": "Carrot"}, {"name": "Pineapple"}]}
    ]
}

# find all "name" fields no matter where they are:
query = jp.parse("$..name")
for match in query.find(data):
    print(match.value)

Introduction to Parsing JSON with Python JSONPath

Intro to using Python and JSONPath library and a query language for parsing JSON datasets.

Playwright and Selenium

Headless browsers are becoming very popular in web scraping as a way to deal with dynamic javascript and scraper blocking.

Many websites use complex front-end that generated data on demand either through background requests or javascript functions. To access this data we need a javascript execution environment and there's no better way than to employ a real headless web browser.

Javascript fingerprinting is also becoming a required step to scrape many modern websites.

So, to control a headless browser for web scraping in Python we have two popular libraries: Playwright and Selenium.

Selenium is one of the first major browser automation libraries and it can do almost anything a browser can do:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# configure browser
options = Options()
options.headless = True 
options.add_argument("--window-size=1920,1080")
options.add_argument("start-maximized")

driver = webdriver.Chrome(options=options)
driver.get("https://web-scraping.dev/product/1")
# wait for page to load by waiting for reviews to appear on the page
element = WebDriverWait(driver=driver, timeout=5).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, '#reviews'))
)
# click buttons:
button = wait.until(EC.element_to_be_clickable((By.ID, "load-more-reviews")))
button.click()

# return rendered HTML:
print(driver.page_source)

Web Scraping with Selenium and Python

Introduction to web scraping dynamic javascript powered websites and web apps using Selenium browser automation library and Python.

Alternatively, Playwright is a modern take on browser automation offering all of these capabilities in modern asynchronous and synchronous APIs:

from playwright.sync_api import sync_playwright

# Using synchronous Python:
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(viewport={'width': 1920, 'height': 1080})
    page = context.new_page()
    page.goto("https://web-scraping.dev/product/1")

    # wait for the element to be present in the DOM
    page.wait_for_selector('#reviews')

    # click the button
    page.wait_for_selector("#load-more-reviews", state="attached")
    page.click("#load-more-reviews")

    # print the HTML
    print(page.content())

    browser.close()

# or asynchronous
import asyncio
from playwright.async_api import async_playwright

async def run():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(viewport={'width': 1920, 'height': 1080})
        page = await context.new_page()
        await page.goto("https://web-scraping.dev/product/1")

        # wait for the element to be present in the DOM
        await page.wait_for_selector('#reviews')

        # click the button
        await page.wait_for_selector("#load-more-reviews", state="attached")
        await page.click("#load-more-reviews")

        # print the HTML
        print(await page.content())

        await browser.close()

asyncio.run(run())

Web Scraping with Playwright and Python

Playwright is the new, big browser automation toolkit - can it be used for web scraping? In this introduction article, we'll take a look how can we use Playwright and Python to scrape dynamic websites.

Cerberus and Pydantic

An often overlooked process of web scraping is the data quality assurance step. Web scraping is a unique niche where result datasets are highly dynamic making quality testing very difficult.

Fortunately, there are several tools that can help out with ensuring web scraped data quality.

For web scraping applications that require real-time data validation Pydantic is a great choice. Pydantic allows specifying strict data models that can be used to validate and morph scraped data:

from typing import Optional
from pydantic import BaseModel, validator

# example for scraped company data
class Company(BaseModel):
    # define allowed field names and types:
    size: int
    founded: int
    revenue_currency: str
    hq_city: str
    hq_state: Optional[str]  # some fields can be optional (i.e. have value of None)

    # then we can define any custom validation functions:
    @validator("size")
    def must_be_reasonable_size(cls, v):
        if not (0 < v < 20_000):
            raise ValueError(f"unreasonable company size: {v}")
        return v

    @validator("founded")
    def must_be_reasonable_year(cls, v):
        if not (1900 < v < 2022):
            raise ValueError(f"unreasonable found date: {v}")
        return v

    @validator("hq_state")
    def looks_like_state(cls, v):
        if len(v) != 2:
            raise ValueError(f'state should be 2 character long, got "{v}"')
        return v

For data web scrapers that require more flexibility and less strict data validation Cerberus is a great choice. Cerberus is a schema validation library that allows specifying data models using a simple dictionary syntax:

from cerberus import Validator


def validate_name(field, value, error):
    """function for validating"""
    if "." in value:
        error(field, f"contains a dot character: {value}")
    if value.lower() in ["classified", "redacted", "missing"]:
        error(field, f"redacted value: {value}")
    if "<" in value.lower() and ">" in value.lower():
        error(field, f"contains html nodes: {value}")


schema = {
    "name": {
        # name should be a string
        "type": "string",
        # between 2 and 20 characters
        "minlength": 2,
        "maxlength": 20,
        # extra validation
        "check_with": validate_name,
    },
}

v = Validator(schema)

v.validate({"name": "H."})
print(v.errors)
# {'name': ['contains a dot character: H.']}

The killer feature of Cerberus is the ease of use and ability to define flexible schemas that can work with highly dynamic web-scraped data. At Scrapfly, we use Cerberus to test all of our example scrapers.

How to Ensure Web Scrapped Data Quality

Ensuring consitent web scrapped data quality can be a difficult and exhausting task. In this article we'll be taking a look at two populat tools in Python - Cerberus and Pydantic - and how can we use them to validate data.

Scrapfly Python SDK

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - scrape web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - scrape dynamic web pages through cloud browsers.
Full browser automation - control browsers to scroll, input and click on objects.
Format conversion - scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

For example, HTML can be parsed directly using .selector attribute of a Scrapfly result:

from scrapfly import ScrapeConfig, ScrapflyClient
scrapfly = ScrapflyClient(key="YOUR KEY")
result = scrapfly.scrape(ScrapeConfig("https://httpbin.dev/html"))

# scrapfly result build parsel.Selector automatically:
page_title = result.selector.xpath("//h1/text()").get()
"Herman Melville - Moby-Dick"

Summary

In this article, we've covered the top 10 Python packages for web scraping that cover several steps of the web scraping process.

For web scraping connections, we've covered httpx which is a brilliant, feature-rich HTTP client. On the other hand, there's selenium and playwright which use a whole headless web browser.

For data parsing, we've taken a look at the three most popular HTML parsing tools: lxml, parsel and beautifulsoup. For JSON parsing jmespath and jsonpath which work great in web scraping environments.

Finally, for the final step of a web scraping process, we've looked at two different data validation tools: strict pydantic and the flexible cerberus.

Top 10 Web Scraping Packages for Python

HTTPX

How to Web Scrape with HTTPX and Python

Parsel and LXML

BeautifulSoup

How to Parse Web Data with Python and Beautifulsoup

JMESPath and JSONPath

Quick Intro to Parsing JSON with JMESPath in Python

Introduction to Parsing JSON with Python JSONPath

Playwright and Selenium

Web Scraping with Selenium and Python

Web Scraping with Playwright and Python

Cerberus and Pydantic

How to Ensure Web Scrapped Data Quality

Scrapfly Python SDK

Summary

Explore this Article with AI

Related Knowledgebase

What Python libraries support HTTP2?

Python httpx vs requests vs aiohttp - key differences

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

How to handle popup dialogs in Playwright?

How to use proxies with Python httpx?

How to scrape images from a website?

How to select dictionary key recursively in Python?

How to use cURL in Python?

Selenium: geckodriver executable needs to be in PATH?

Selenium: chromedriver executable needs to be in PATH?

How to fix Python requests MissingSchema error?

How to fix Python requests SSLError?

Related Articles

How to Scrape AutoScout24

How to Scrape Ticketmaster Event Data

How to Scrape Mouser.com

How to Scrape Zoro.com

How to Scrape YouTube in 2025

Guide to List Crawling: Everything You Need to Know