Top 10 Web Scraping Packages for Python

Top 10 Web Scraping Packages for Python

Python is by far the most popular language used for web scraping. It's easy to learn, has a huge community and a massive ecosystem of libraries.

In this quick overview article, we'll be taking a look at the top 10 web scraping packages that every web scraper should know. Covering various niches like:

  • HTTP Connections
  • Browser Automation
  • HTML and JSON Parsing
  • Data Quality and Validation

we use all of these libraries in our web scraping guide series if you want to see them in action

HTTPX

HTTPX is by far the most complete and modern HTTP client package for Python. It is inspired by the popular requests library but it takes things even further by supporting modern Python and HTTP features.

To start, it supports both asynchronous and synchronous Python APIs. This makes it easy to scale httpx-powered requests while also staying accessible in simple scripts:

import httpx

# use Sync API
with httpx.Client() as client:
    response = client.get("https://web-scraping.dev/product/1")
    print(response.text)

# use Async API:
import asyncio

async def run():
    async with httpx.AsyncClient() as client:
        response = await client.get("https://web-scraping.dev/product/1")
        print(response.text)

asyncio.run(run())

HTTPX also supports HTTP/2 which is much less likely to be blocked as most real human traffic is using http2/3:

with httpx.Client(http2=True) as client:
    response = client.get("https://web-scraping.dev/product/1")
    print(response.text)

Finally, httpx respects HTTP standards and forms requests more similarly to a real web browser which can drastically reduce the likelyhood of being blocked. Like respecting header ordering:

How to Web Scrape with HTTPX and Python

See our full introduction to HTTPX use in web scraping and solutions to commonly encountered issues like retries and blocking.

How to Web Scrape with HTTPX and Python

Parsel and LXML

LXML is a fast and feature-rich HTML/XML parser for Python. It is a wrapper around the C library libxml2 and it is the fastest and most reliable way to parse HTML in Python.

LXML uses powerful and flexible XPath Selectors to find elements in HTML:

from lxml import html
html = """
<html>
  <title>My Title</title>
  <body>
    <div class="title">Product Title</div>
  </body>
</html>
"""
tree = html.fromstring(html)
print(tree.xpath("//title/text()"))
['My Title']
print(tree.xpath("//div[@class='title']/text()"))
['Product Title']

To take things even further, lxml is extended by Parsel which simplifies the use API and adds CSS selector support:

from parsel import Selector

html = """
<html>
  <title>My Title</title>
  <body>
    <div class="title">Product Title</div>
  </body>
</html>
"""

selector = Selector(html)
# use XPath
print(selector.xpath("//title/text()").get())
"My Title"
# or CSS selectors
print(selector.css(".title::text").get())
"Product Title"

Parsel is becoming the de facto way to parse large sets of HTML documents as it combines the speed of lxml with the ease of use of CSS and XPath selectors.

BeautifulSoup

Beautifulsoup (aka bs4) is another HTML parser library in Python. Though it's much more than that.

Unlike LXML and Parsel, bs4 support parsing using accessible Pythonic methods like find and find_all:

from bs4 import BeautifulSoup

html = """
<html>
  <title>My Title</title>
  <body>
    <div class="title">Product Title</div>
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')  # note: bs4 can use lxml under the hood which makes it super fast!
# find single element by node name
print(soup.find("title").text)
"My Title"
# find multiple using find_all and attribute matching
for element in soup.find_all("div", class_="title"):
    print(element.text)

This approach is more accessible and more readable than CSS or XPath selectors for beginners and often easier to maintain and develop when working with highly complex HTML documents.

BeautilfulSoup4 also comes with many utility functions like HTML formatting and editing. For example:

from bs4 import BeautifulSoup

html = """
<div><h1>The Avangers: </h1><a>End Game</a><p>is one of the most popular Marvel movies</p></div>
"""
soup = BeautifulSoup(html)
soup.prettify()
"""
<html>
 <body>
  <div>
   <h1>
    The Avangers:
   </h1>
   <a>
    End Game
   </a>
   <p>
    is one of the most popular Marvel movies
   </p>
  </div>
 </body>
</html>
"""

As well as many other utility functions like HTML modification, selective parsing and cleanup.

Web Scraping with Python and BeautifulSoup

See our complete introduction to web scraping with beautifulsoup, all of it's utilities and an example project

Web Scraping with Python and BeautifulSoup

JMESPath and JSONPath

JMESPath and JSONPath are two libraries that allow you to query JSON data using a query language similar to XPath.

As JSON is becoming bigger and bigger in the web scraping space, these libraries are becoming essential.

For example, with JMESPath we can query and reshape and flatten JSON datasets with ease:

data = {
  "people": [
    {
      "name": "Foo", 
      "age": 33, 
      "addresses": [
        "123 Main St", "California", "US"
      ],
      "primary_email": "foo@email.com",
      "secondary_email": "bar@email.com",
    },
    ...
  ]
}
jmespath.search("""
  people[].{
    first_name: name, 
    age_in_years: age,
    address: addresses[0],
    state: addresses[1],
    country: addresses[2],
    emails: [primary_email, secondary_email]
  }
""", data)
[
  {
    'address': '123 Main St',
    'state': 'California'
    'country': 'US',
    'age_in_years': 33,
    'emails': ['foo@email.com', 'bar@email.com'],
    'first_name': 'Foo',
  },
  ...
]

This feature is especially useful when scraping hidden web data which can result in massive JSON datasets that are hard to navigate and digest.

Quick Intro to Parsing JSON with JMESPath in Python

See our full introduction to JMESPath and an example web scraping project with it.

Quick Intro to Parsing JSON with JMESPath in Python

Alternatively, JSONPath focuses less on reshaping datasets and more on selecting values from complex, heavily nested JSON datasets and supports advanced matchign functions:

import jsonpath_ng.ext as jp

data = {
    "products": [
        {"name": "Apple", "price": 12.88, "tags": ["fruit", "red"]},
        {"name": "Peach", "price": 27.25, "tags": ["fruit", "yellow"]},
        {"name": "Cake", "tags": ["pastry", "sweet"]},
    ]
}

# find all product names:
query = jp.parse("products[*].name")
for match in query.find(data):
    print(match.value)

# find all products with price > 20
query = jp.parse("products[?price>20].name")
for match in query.find(data):
    print(match.value)

The killer feature of JSONPath is the recursive $.. selector which allows to quickly select values by key anywhere in the dataset (similar to the beloved XPath's // selector):

import jsonpath_ng.ext as jp

data = {
    "products": [
        {"name": "Apple", "price": 12.88},
        {"name": "Peach", "price": 27.25},
        {"name": "Cake"},
        {"multiproduct": [{"name": "Carrot"}, {"name": "Pineapple"}]}
    ]
}

# find all "name" fields no matter where they are:
query = jp.parse("$..name")
for match in query.find(data):
    print(match.value)
Quick Intro to Parsing JSON with JSONPath in Python

See our full introduction to JSONPath and common challenges it helps to solve through an example project.

Quick Intro to Parsing JSON with JSONPath in Python

Playwright and Selenium

Headless browsers are becoming very popular in web scraping as a way to deal with dynamic javascript and scraper blocking.

Many websites use complex front-end that generated data on demand either through background requests or javascript functions. To access this data we need a javascript execution environment and there's no better way than to employ a real headless web browser.

Javascript fingerprinting is also becoming a required step to scrape many modern websites.

So, to control a headless browser for web scraping in Python we have two popular libraries: Playwright and Selenium.

Selenium is one of the first major browser automation libraries and it can do almost anything a browser can do:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# configure browser
options = Options()
options.headless = True 
options.add_argument("--window-size=1920,1080")
options.add_argument("start-maximized")

driver = webdriver.Chrome(options=options)
driver.get("https://web-scraping.dev/product/1")
# wait for page to load by waiting for reviews to appear on the page
element = WebDriverWait(driver=driver, timeout=5).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, '#reviews'))
)
# click buttons:
button = wait.until(EC.element_to_be_clickable((By.ID, "load-more-reviews")))
button.click()

# return rendered HTML:
print(driver.page_source)
Web Scraping with Selenium and Python Tutorial + Example Project

For more see our full introduction to web scraping with Selenium through an example Python project

Web Scraping with Selenium and Python Tutorial + Example Project

Alternatively, Playwright is a modern take on browser automation offering all of these capabilities in modern asynchronous and synchronous APIs:

from playwright.sync_api import sync_playwright

# Using synchronous Python:
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(viewport={'width': 1920, 'height': 1080})
    page = context.new_page()
    page.goto("https://web-scraping.dev/product/1")

    # wait for the element to be present in the DOM
    page.wait_for_selector('#reviews')

    # click the button
    page.wait_for_selector("#load-more-reviews", state="attached")
    page.click("#load-more-reviews")

    # print the HTML
    print(page.content())

    browser.close()

# or asynchronous
import asyncio
from playwright.async_api import async_playwright

async def run():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(viewport={'width': 1920, 'height': 1080})
        page = await context.new_page()
        await page.goto("https://web-scraping.dev/product/1")

        # wait for the element to be present in the DOM
        await page.wait_for_selector('#reviews')

        # click the button
        await page.wait_for_selector("#load-more-reviews", state="attached")
        await page.click("#load-more-reviews")

        # print the HTML
        print(await page.content())

        await browser.close()

asyncio.run(run())
Web Scraping with Playwright and Python

See our full introduction to web scraping using Playwright through an example project for more.

Web Scraping with Playwright and Python

Cerberus and Pydantic

An often overlooked process of web scraping is the data quality assurance step. Web scraping is a unique niche where result datasets are highly dynamic making quality testing very difficult.

Fortunately, there are several tools that can help out with ensuring web scraped data quality.

For web scraping applications that require real-time data validation Pydantic is a great choice. Pydantic allows specifying strict data models that can be used to validate and morph scraped data:

from typing import Optional
from pydantic import BaseModel, validator

# example for scraped company data
class Company(BaseModel):
    # define allowed field names and types:
    size: int
    founded: int
    revenue_currency: str
    hq_city: str
    hq_state: Optional[str]  # some fields can be optional (i.e. have value of None)

    # then we can define any custom validation functions:
    @validator("size")
    def must_be_reasonable_size(cls, v):
        if not (0 < v < 20_000):
            raise ValueError(f"unreasonable company size: {v}")
        return v

    @validator("founded")
    def must_be_reasonable_year(cls, v):
        if not (1900 < v < 2022):
            raise ValueError(f"unreasonable found date: {v}")
        return v

    @validator("hq_state")
    def looks_like_state(cls, v):
        if len(v) != 2:
            raise ValueError(f'state should be 2 character long, got "{v}"')
        return v

For data web scrapers that require more flexibility and less strict data validation Cerberus is a great choice. Cerberus is a schema validation library that allows specifying data models using a simple dictionary syntax:

from cerberus import Validator


def validate_name(field, value, error):
    """function for validating"""
    if "." in value:
        error(field, f"contains a dot character: {value}")
    if value.lower() in ["classified", "redacted", "missing"]:
        error(field, f"redacted value: {value}")
    if "<" in value.lower() and ">" in value.lower():
        error(field, f"contains html nodes: {value}")


schema = {
    "name": {
        # name should be a string
        "type": "string",
        # between 2 and 20 characters
        "minlength": 2,
        "maxlength": 20,
        # extra validation
        "check_with": validate_name,
    },
}

v = Validator(schema)

v.validate({"name": "H."})
print(v.errors)
# {'name': ['contains a dot character: H.']}

The killer feature of Cerberus is the ease of use and ability to define flexible schemas that can work with highly dynamic web-scraped data. At Scrapfly, we use Cerberus to test all of our example scrapers.

How to Ensure Web Scrapped Data Quality

See our complete introduction to data validation which covers both Pydantic and Cerberus use in web scraping

How to Ensure Web Scrapped Data Quality

Scrapfly Python SDK

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

scrapfly middleware

For example, HTML can be parsed directly using .selector attribute of a Scrapfly result:

from scrapfly import ScrapeConfig, ScrapflyClient
scrapfly = ScrapflyClient(key="YOUR KEY")
result = scrapfly.scrape(ScrapeConfig("https://httpbin.dev/html"))

# scrapfly result build parsel.Selector automatically:
page_title = result.selector.xpath("//h1/text()").get()
"Herman Melville - Moby-Dick"

Summary

In this article, we've covered the top 10 Python packages for web scraping that cover several steps of the web scraping process.

For web scraping connections, we've covered httpx which is a brilliant, feature-rich HTTP client. On the other hand, there's selenium and playwright which use a whole headless web browser.

For data parsing, we've taken a look at the three most popular HTML parsing tools: lxml, parsel and beautifulsoup. For JSON parsing jmespath and jsonpath which work great in web scraping environments.

Finally, for the final step of a web scraping process, we've looked at two different data validation tools: strict pydantic and the flexible cerberus.

Related Posts

Playwright Examples for Web Scraping and Automation

Learn Playwright with Python and JavaScript examples for automating browsers like Chromium, WebKit, and Firefox.

How to use wget in Python

Learn how to use wget in Python through subprocess calls and what are other options.

Ultimate Guide to JSON Parsing in Python

Learn JSON parsing in Python with this ultimate guide. Explore basic and advanced techniques using json, and tools like ijson and nested-lookup