Python is by far the most popular language used for web scraping. It's easy to learn, has a huge community and a massive ecosystem of libraries.
In this quick overview article, we'll be taking a look at the top 10 web scraping packages that every web scraper should know. Covering various niches like:
- HTTP Connections
- Browser Automation
- HTML and JSON Parsing
- Data Quality and Validation
we use all of these libraries in our web scraping guide series if you want to see them in action
HTTPX
HTTPX is by far the most complete and modern HTTP client package for Python. It is inspired by the popular requests library but it takes things even further by supporting modern Python and HTTP features.To start, it supports both asynchronous and synchronous Python APIs. This makes it easy to scale httpx-powered requests while also staying accessible in simple scripts:
import httpx
# use Sync API
with httpx.Client() as client:
response = client.get("https://web-scraping.dev/product/1")
print(response.text)
# use Async API:
import asyncio
async def run():
async with httpx.AsyncClient() as client:
response = await client.get("https://web-scraping.dev/product/1")
print(response.text)
asyncio.run(run())
HTTPX also supports HTTP/2 which is much less likely to be blocked as most real human traffic is using http2/3:
with httpx.Client(http2=True) as client:
response = client.get("https://web-scraping.dev/product/1")
print(response.text)
Finally, httpx
respects HTTP standards and forms requests more similarly to a real web browser which can drastically reduce the likelyhood of being blocked. Like respecting header ordering:
How to Web Scrape with HTTPX and Python
Intro to using Python's httpx library for web scraping. Proxy and user agent rotation and common web scraping challenges, tips and tricks.
Parsel and LXML
LXML is a fast and feature-rich HTML/XML parser for Python. It is a wrapper around the C librarylibxml2
and it is the fastest and most reliable way to parse HTML in Python.LXML uses powerful and flexible XPath Selectors to find elements in HTML:
from lxml import html
html = """
<html>
<title>My Title</title>
<body>
<div class="title">Product Title</div>
</body>
</html>
"""
tree = html.fromstring(html)
print(tree.xpath("//title/text()"))
['My Title']
print(tree.xpath("//div[@class='title']/text()"))
['Product Title']
To take things even further, lxml
is extended by Parsel which simplifies the use API and adds CSS selector support:
from parsel import Selector
html = """
<html>
<title>My Title</title>
<body>
<div class="title">Product Title</div>
</body>
</html>
"""
selector = Selector(html)
# use XPath
print(selector.xpath("//title/text()").get())
"My Title"
# or CSS selectors
print(selector.css(".title::text").get())
"Product Title"
Parsel is becoming the de facto way to parse large sets of HTML documents as it combines the speed of lxml
with the ease of use of CSS and XPath selectors.
BeautifulSoup
Beautifulsoup (aka bs4) is another HTML parser library in Python. Though it's much more than that.Unlike LXML and Parsel, bs4 support parsing using accessible Pythonic methods like find
and find_all
:
from bs4 import BeautifulSoup
html = """
<html>
<title>My Title</title>
<body>
<div class="title">Product Title</div>
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml') # note: bs4 can use lxml under the hood which makes it super fast!
# find single element by node name
print(soup.find("title").text)
"My Title"
# find multiple using find_all and attribute matching
for element in soup.find_all("div", class_="title"):
print(element.text)
This approach is more accessible and more readable than CSS or XPath selectors for beginners and often easier to maintain and develop when working with highly complex HTML documents.
BeautilfulSoup4 also comes with many utility functions like HTML formatting and editing. For example:
from bs4 import BeautifulSoup
html = """
<div><h1>The Avangers: </h1><a>End Game</a><p>is one of the most popular Marvel movies</p></div>
"""
soup = BeautifulSoup(html)
soup.prettify()
"""
<html>
<body>
<div>
<h1>
The Avangers:
</h1>
<a>
End Game
</a>
<p>
is one of the most popular Marvel movies
</p>
</div>
</body>
</html>
"""
As well as many other utility functions like HTML modification, selective parsing and cleanup.
How to Parse Web Data with Python and Beautifulsoup
Beautifulsoup is one the most popular libraries in web scraping. In this tutorial, we'll take a hand-on overview of how to use it, what is it good for and explore a real -life web scraping example.
JMESPath and JSONPath
JMESPath and JSONPath are two libraries that allow you to query JSON data using a query language similar to XPath.As JSON is becoming bigger and bigger in the web scraping space, these libraries are becoming essential.
For example, with JMESPath we can query and reshape and flatten JSON datasets with ease:
data = {
"people": [
{
"name": "Foo",
"age": 33,
"addresses": [
"123 Main St", "California", "US"
],
"primary_email": "foo@email.com",
"secondary_email": "bar@email.com",
},
...
]
}
jmespath.search("""
people[].{
first_name: name,
age_in_years: age,
address: addresses[0],
state: addresses[1],
country: addresses[2],
emails: [primary_email, secondary_email]
}
""", data)
[
{
'address': '123 Main St',
'state': 'California'
'country': 'US',
'age_in_years': 33,
'emails': ['foo@email.com', 'bar@email.com'],
'first_name': 'Foo',
},
...
]
This feature is especially useful when scraping hidden web data which can result in massive JSON datasets that are hard to navigate and digest.
Quick Intro to Parsing JSON with JMESPath in Python
Introduction to JMESPath - JSON query language which is used in web scraping to parse JSON datasets for scrape data.
Alternatively, JSONPath focuses less on reshaping datasets and more on selecting values from complex, heavily nested JSON datasets and supports advanced matchign functions:
import jsonpath_ng.ext as jp
data = {
"products": [
{"name": "Apple", "price": 12.88, "tags": ["fruit", "red"]},
{"name": "Peach", "price": 27.25, "tags": ["fruit", "yellow"]},
{"name": "Cake", "tags": ["pastry", "sweet"]},
]
}
# find all product names:
query = jp.parse("products[*].name")
for match in query.find(data):
print(match.value)
# find all products with price > 20
query = jp.parse("products[?price>20].name")
for match in query.find(data):
print(match.value)
The killer feature of JSONPath is the recursive $..
selector which allows to quickly select values by key anywhere in the dataset (similar to the beloved XPath's //
selector):
import jsonpath_ng.ext as jp
data = {
"products": [
{"name": "Apple", "price": 12.88},
{"name": "Peach", "price": 27.25},
{"name": "Cake"},
{"multiproduct": [{"name": "Carrot"}, {"name": "Pineapple"}]}
]
}
# find all "name" fields no matter where they are:
query = jp.parse("$..name")
for match in query.find(data):
print(match.value)
Introduction to Parsing JSON with Python JSONPath
Intro to using Python and JSONPath library and a query language for parsing JSON datasets.
Playwright and Selenium
Headless browsers are becoming very popular in web scraping as a way to deal with dynamic javascript and scraper blocking.Many websites use complex front-end that generated data on demand either through background requests or javascript functions. To access this data we need a javascript execution environment and there's no better way than to employ a real headless web browser.
Javascript fingerprinting is also becoming a required step to scrape many modern websites.So, to control a headless browser for web scraping in Python we have two popular libraries: Playwright and Selenium.
Selenium is one of the first major browser automation libraries and it can do almost anything a browser can do:from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
# configure browser
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1080")
options.add_argument("start-maximized")
driver = webdriver.Chrome(options=options)
driver.get("https://web-scraping.dev/product/1")
# wait for page to load by waiting for reviews to appear on the page
element = WebDriverWait(driver=driver, timeout=5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, '#reviews'))
)
# click buttons:
button = wait.until(EC.element_to_be_clickable((By.ID, "load-more-reviews")))
button.click()
# return rendered HTML:
print(driver.page_source)
Web Scraping with Selenium and Python
Introduction to web scraping dynamic javascript powered websites and web apps using Selenium browser automation library and Python.
Alternatively, Playwright is a modern take on browser automation offering all of these capabilities in modern asynchronous and synchronous APIs:
from playwright.sync_api import sync_playwright
# Using synchronous Python:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(viewport={'width': 1920, 'height': 1080})
page = context.new_page()
page.goto("https://web-scraping.dev/product/1")
# wait for the element to be present in the DOM
page.wait_for_selector('#reviews')
# click the button
page.wait_for_selector("#load-more-reviews", state="attached")
page.click("#load-more-reviews")
# print the HTML
print(page.content())
browser.close()
# or asynchronous
import asyncio
from playwright.async_api import async_playwright
async def run():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(viewport={'width': 1920, 'height': 1080})
page = await context.new_page()
await page.goto("https://web-scraping.dev/product/1")
# wait for the element to be present in the DOM
await page.wait_for_selector('#reviews')
# click the button
await page.wait_for_selector("#load-more-reviews", state="attached")
await page.click("#load-more-reviews")
# print the HTML
print(await page.content())
await browser.close()
asyncio.run(run())
Web Scraping with Playwright and Python
Playwright is the new, big browser automation toolkit - can it be used for web scraping? In this introduction article, we'll take a look how can we use Playwright and Python to scrape dynamic websites.
Cerberus and Pydantic
An often overlooked process of web scraping is the data quality assurance step. Web scraping is a unique niche where result datasets are highly dynamic making quality testing very difficult.
Fortunately, there are several tools that can help out with ensuring web scraped data quality.
For web scraping applications that require real-time data validation Pydantic is a great choice. Pydantic allows specifying strict data models that can be used to validate and morph scraped data:
from typing import Optional
from pydantic import BaseModel, validator
# example for scraped company data
class Company(BaseModel):
# define allowed field names and types:
size: int
founded: int
revenue_currency: str
hq_city: str
hq_state: Optional[str] # some fields can be optional (i.e. have value of None)
# then we can define any custom validation functions:
@validator("size")
def must_be_reasonable_size(cls, v):
if not (0 < v < 20_000):
raise ValueError(f"unreasonable company size: {v}")
return v
@validator("founded")
def must_be_reasonable_year(cls, v):
if not (1900 < v < 2022):
raise ValueError(f"unreasonable found date: {v}")
return v
@validator("hq_state")
def looks_like_state(cls, v):
if len(v) != 2:
raise ValueError(f'state should be 2 character long, got "{v}"')
return v
For data web scrapers that require more flexibility and less strict data validation Cerberus is a great choice. Cerberus is a schema validation library that allows specifying data models using a simple dictionary syntax:
from cerberus import Validator
def validate_name(field, value, error):
"""function for validating"""
if "." in value:
error(field, f"contains a dot character: {value}")
if value.lower() in ["classified", "redacted", "missing"]:
error(field, f"redacted value: {value}")
if "<" in value.lower() and ">" in value.lower():
error(field, f"contains html nodes: {value}")
schema = {
"name": {
# name should be a string
"type": "string",
# between 2 and 20 characters
"minlength": 2,
"maxlength": 20,
# extra validation
"check_with": validate_name,
},
}
v = Validator(schema)
v.validate({"name": "H."})
print(v.errors)
# {'name': ['contains a dot character: H.']}
The killer feature of Cerberus is the ease of use and ability to define flexible schemas that can work with highly dynamic web-scraped data. At Scrapfly, we use Cerberus to test all of our example scrapers.
How to Ensure Web Scrapped Data Quality
Ensuring consitent web scrapped data quality can be a difficult and exhausting task. In this article we'll be taking a look at two populat tools in Python - Cerberus and Pydantic - and how can we use them to validate data.
Scrapfly Python SDK
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
- Anti-bot protection bypass - scrape web pages without blocking!
- Rotating residential proxies - prevent IP address and geographic blocks.
- JavaScript rendering - scrape dynamic web pages through cloud browsers.
- Full browser automation - control browsers to scroll, input and click on objects.
- Format conversion - scrape as HTML, JSON, Text, or Markdown.
- Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.
For example, HTML can be parsed directly using .selector
attribute of a Scrapfly result:
from scrapfly import ScrapeConfig, ScrapflyClient
scrapfly = ScrapflyClient(key="YOUR KEY")
result = scrapfly.scrape(ScrapeConfig("https://httpbin.dev/html"))
# scrapfly result build parsel.Selector automatically:
page_title = result.selector.xpath("//h1/text()").get()
"Herman Melville - Moby-Dick"
Summary
In this article, we've covered the top 10 Python packages for web scraping that cover several steps of the web scraping process.
For web scraping connections, we've covered httpx
which is a brilliant, feature-rich HTTP client. On the other hand, there's selenium
and playwright
which use a whole headless web browser.
For data parsing, we've taken a look at the three most popular HTML parsing tools: lxml
, parsel
and beautifulsoup
. For JSON parsing jmespath
and jsonpath
which work great in web scraping environments.
Finally, for the final step of a web scraping process, we've looked at two different data validation tools: strict pydantic
and the flexible cerberus
.