Python is by far the most popular language used for web scraping. It's easy to learn, has a huge community and a massive ecosystem of libraries.
In this quick overview article, we'll be taking a look at the top 10 web scraping packages that every web scraper should know. Covering various niches like:
HTTPX is by far the most complete and modern HTTP client package for Python. It is inspired by the popular requests library but it takes things even further by supporting modern Python and HTTP features.
To start, it supports both asynchronous and synchronous Python APIs. This makes it easy to scale httpx-powered requests while also staying accessible in simple scripts:
import httpx
# use Sync API
with httpx.Client() as client:
response = client.get("https://web-scraping.dev/product/1")
print(response.text)
# use Async API:
import asyncio
async def run():
async with httpx.AsyncClient() as client:
response = await client.get("https://web-scraping.dev/product/1")
print(response.text)
asyncio.run(run())
HTTPX also supports HTTP/2 which is much less likely to be blocked as most real human traffic is using http2/3:
with httpx.Client(http2=True) as client:
response = client.get("https://web-scraping.dev/product/1")
print(response.text)
Finally, httpx respects HTTP standards and forms requests more similarly to a real web browser which can drastically reduce the likelyhood of being blocked. Like respecting header ordering:
Parsel and LXML
LXML is a fast and feature-rich HTML/XML parser for Python. It is a wrapper around the C library libxml2 and it is the fastest and most reliable way to parse HTML in Python.
LXML uses powerful and flexible XPath Selectors to find elements in HTML:
from lxml import html
html = """
<html>
<title>My Title</title>
<body>
<div class="title">Product Title</div>
</body>
</html>
"""
tree = html.fromstring(html)
print(tree.xpath("//title/text()"))
['My Title']
print(tree.xpath("//div[@class='title']/text()"))
['Product Title']
To take things even further, lxml is extended by Parsel which simplifies the use API and adds CSS selector support:
from parsel import Selector
html = """
<html>
<title>My Title</title>
<body>
<div class="title">Product Title</div>
</body>
</html>
"""
selector = Selector(html)
# use XPath
print(selector.xpath("//title/text()").get())
"My Title"
# or CSS selectors
print(selector.css(".title::text").get())
"Product Title"
Parsel is becoming the de facto way to parse large sets of HTML documents as it combines the speed of lxml with the ease of use of CSS and XPath selectors.
BeautifulSoup
Beautifulsoup (aka bs4) is another HTML parser library in Python. Though it's much more than that.
Unlike LXML and Parsel, bs4 support parsing using accessible Pythonic methods like find and find_all:
from bs4 import BeautifulSoup
html = """
<html>
<title>My Title</title>
<body>
<div class="title">Product Title</div>
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml') # note: bs4 can use lxml under the hood which makes it super fast!
# find single element by node name
print(soup.find("title").text)
"My Title"
# find multiple using find_all and attribute matching
for element in soup.find_all("div", class_="title"):
print(element.text)
This approach is more accessible and more readable than CSS or XPath selectors for beginners and often easier to maintain and develop when working with highly complex HTML documents.
BeautilfulSoup4 also comes with many utility functions like HTML formatting and editing. For example:
from bs4 import BeautifulSoup
html = """
<div><h1>The Avangers: </h1><a>End Game</a><p>is one of the most popular Marvel movies</p></div>
"""
soup = BeautifulSoup(html)
soup.prettify()
"""
<html>
<body>
<div>
<h1>
The Avangers:
</h1>
<a>
End Game
</a>
<p>
is one of the most popular Marvel movies
</p>
</div>
</body>
</html>
"""
As well as many other utility functions like HTML modification, selective parsing and cleanup.
JMESPath and JSONPath
JMESPath and JSONPath are two libraries that allow you to query JSON data using a query language similar to XPath.
As JSON is becoming bigger and bigger in the web scraping space, these libraries are becoming essential.
For example, with JMESPath we can query and reshape and flatten JSON datasets with ease:
This feature is especially useful when scraping hidden web data which can result in massive JSON datasets that are hard to navigate and digest.
Alternatively, JSONPath focuses less on reshaping datasets and more on selecting values from complex, heavily nested JSON datasets and supports advanced matchign functions:
import jsonpath_ng.ext as jp
data = {
"products": [
{"name": "Apple", "price": 12.88, "tags": ["fruit", "red"]},
{"name": "Peach", "price": 27.25, "tags": ["fruit", "yellow"]},
{"name": "Cake", "tags": ["pastry", "sweet"]},
]
}
# find all product names:
query = jp.parse("products[*].name")
for match in query.find(data):
print(match.value)
# find all products with price > 20
query = jp.parse("products[?price>20].name")
for match in query.find(data):
print(match.value)
The killer feature of JSONPath is the recursive $.. selector which allows to quickly select values by key anywhere in the dataset (similar to the beloved XPath's // selector):
import jsonpath_ng.ext as jp
data = {
"products": [
{"name": "Apple", "price": 12.88},
{"name": "Peach", "price": 27.25},
{"name": "Cake"},
{"multiproduct": [{"name": "Carrot"}, {"name": "Pineapple"}]}
]
}
# find all "name" fields no matter where they are:
query = jp.parse("$..name")
for match in query.find(data):
print(match.value)
Playwright and Selenium
Headless browsers are becoming very popular in web scraping as a way to deal with dynamic javascript and scraper blocking.
Many websites use complex front-end that generated data on demand either through background requests or javascript functions. To access this data we need a javascript execution environment and there's no better way than to employ a real headless web browser.
So, to control a headless browser for web scraping in Python we have two popular libraries: Playwright and Selenium.
Selenium is one of the first major browser automation libraries and it can do almost anything a browser can do:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
# configure browser
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1080")
options.add_argument("start-maximized")
driver = webdriver.Chrome(options=options)
driver.get("https://web-scraping.dev/product/1")
# wait for page to load by waiting for reviews to appear on the page
element = WebDriverWait(driver=driver, timeout=5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, '#reviews'))
)
# click buttons:
button = wait.until(EC.element_to_be_clickable((By.ID, "load-more-reviews")))
button.click()
# return rendered HTML:
print(driver.page_source)
Alternatively, Playwright is a modern take on browser automation offering all of these capabilities in modern asynchronous and synchronous APIs:
from playwright.sync_api import sync_playwright
# Using synchronous Python:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(viewport={'width': 1920, 'height': 1080})
page = context.new_page()
page.goto("https://web-scraping.dev/product/1")
# wait for the element to be present in the DOM
page.wait_for_selector('#reviews')
# click the button
page.wait_for_selector("#load-more-reviews", state="attached")
page.click("#load-more-reviews")
# print the HTML
print(page.content())
browser.close()
# or asynchronous
import asyncio
from playwright.async_api import async_playwright
async def run():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(viewport={'width': 1920, 'height': 1080})
page = await context.new_page()
await page.goto("https://web-scraping.dev/product/1")
# wait for the element to be present in the DOM
await page.wait_for_selector('#reviews')
# click the button
await page.wait_for_selector("#load-more-reviews", state="attached")
await page.click("#load-more-reviews")
# print the HTML
print(await page.content())
await browser.close()
asyncio.run(run())
Cerberus and Pydantic
An often overlooked process of web scraping is the data quality assurance step. Web scraping is a unique niche where result datasets are highly dynamic making quality testing very difficult.
Fortunately, there are several tools that can help out with ensuring web scraped data quality.
For web scraping applications that require real-time data validation Pydantic is a great choice. Pydantic allows specifying strict data models that can be used to validate and morph scraped data:
from typing import Optional
from pydantic import BaseModel, validator
# example for scraped company data
class Company(BaseModel):
# define allowed field names and types:
size: int
founded: int
revenue_currency: str
hq_city: str
hq_state: Optional[str] # some fields can be optional (i.e. have value of None)
# then we can define any custom validation functions:
@validator("size")
def must_be_reasonable_size(cls, v):
if not (0 < v < 20_000):
raise ValueError(f"unreasonable company size: {v}")
return v
@validator("founded")
def must_be_reasonable_year(cls, v):
if not (1900 < v < 2022):
raise ValueError(f"unreasonable found date: {v}")
return v
@validator("hq_state")
def looks_like_state(cls, v):
if len(v) != 2:
raise ValueError(f'state should be 2 character long, got "{v}"')
return v
For data web scrapers that require more flexibility and less strict data validation Cerberus is a great choice. Cerberus is a schema validation library that allows specifying data models using a simple dictionary syntax:
from cerberus import Validator
def validate_name(field, value, error):
"""function for validating"""
if "." in value:
error(field, f"contains a dot character: {value}")
if value.lower() in ["classified", "redacted", "missing"]:
error(field, f"redacted value: {value}")
if "<" in value.lower() and ">" in value.lower():
error(field, f"contains html nodes: {value}")
schema = {
"name": {
# name should be a string
"type": "string",
# between 2 and 20 characters
"minlength": 2,
"maxlength": 20,
# extra validation
"check_with": validate_name,
},
}
v = Validator(schema)
v.validate({"name": "H."})
print(v.errors)
# {'name': ['contains a dot character: H.']}
The killer feature of Cerberus is the ease of use and ability to define flexible schemas that can work with highly dynamic web-scraped data. At Scrapfly, we use Cerberus to test all of our example scrapers.
For example, HTML can be parsed directly using .selector attribute of a Scrapfly result:
from scrapfly import ScrapeConfig, ScrapflyClient
scrapfly = ScrapflyClient(key="YOUR KEY")
result = scrapfly.scrape(ScrapeConfig("https://httpbin.dev/html"))
# scrapfly result build parsel.Selector automatically:
page_title = result.selector.xpath("//h1/text()").get()
"Herman Melville - Moby-Dick"
Summary
In this article, we've covered the top 10 Python packages for web scraping that cover several steps of the web scraping process.
For web scraping connections, we've covered httpx which is a brilliant, feature-rich HTTP client. On the other hand, there's selenium and playwright which use a whole headless web browser.
For data parsing, we've taken a look at the three most popular HTML parsing tools: lxml, parsel and beautifulsoup. For JSON parsing jmespath and jsonpath which work great in web scraping environments.
Finally, for the final step of a web scraping process, we've looked at two different data validation tools: strict pydantic and the flexible cerberus.