Web Scraping with Python and BeautifulSoup

article feature image

When it comes to web scraping, we're generally interested in two main steps: collecting raw data (such as HTML documents) and parsing this data into something we can digest in our business logic. Today, we'll cover one of the most popular tools for HTML parsing in Python - BeautifulSoup (beautifulsoup4)

In this Python and Beautifulsoup tutorial we'll take a look at how beautiful soup with python is used to parse website data,
what are some common useful functions of this library, tips and tricks and common web scraping scenarios.

Finally, to illustrate some of the best practices in web scraping with beautifulsoup and python we'll take on a real web scraping project - we'll be scraping job listings from remotepython.com. Let's jump in!

Web Scraping With Python Tutorial

For a more detailed introduction to general scraping with python see our full introduction tutorial which covers connection, scaling and other non-beautifulsoup related questions.

Web Scraping With Python Tutorial

What is HTML?

HTML (HyperText Markup Language) is designed to be easily machine-readable and parsable. In other words, HTML follows a tree like structure of nodes (HTML tags) and their attributes, which we can easily navigate programmatically.

Let's start off with a small example page and illustrate its structure:

<head>
  <title>
  </title>
</head>
<body>
  <div>
    <h1>Introduction</h1>
    <p>some description text: </p>
    <a class="link" href="http://example.com">example link</a>
  </div>
</body>

In this basic example of a simple web page source code, we can see that the document already resembles a data tree just by looking at the indentation.
Let's go a bit further and illustrate this:

illustration of an HTML tree

Here, we can wrap our heads around it a bit more easily - it's a tree of nodes and each node consists of:

  • Node Name - aka HTML tag, e.g. <div>
  • Natural Properties - the text value and position.
  • Keyword Properties - keyword values like class, href etc.

With this basic understanding, we can see how python and beautifulsoup can help us traverse this tree to extract the data we need.

Parsing HTML with BeautifulSoup

Beautifulsoup is a python library which essentially is an HTML parser tool. Using it we can navigate HTML data to extract/delete/replace particular HTML elements. Bs4 also comes with utility functions like visual formatting and parse tree cleanup.

Let's take a quick dive into the most useful beautiful soup features in the context of web scraping. We'll start off with source code for a simple HTML page:

<head>
  <title class="page-title">Hello World!</title>
</head>
<body>
  <div id="content">
    <h1>Title</h1>
    <p>first paragraph</p>
    <p>second paragraph</p>
    <h2>Subtitle</h2>
    <p>first paragraph of subtitle</p>
  </div>
</body>

Here, we have a very simple piece of HTML data which contains basic elements of an article: title, subtitle, and some text paragraphs.

First thing first, we need to choose a backend for our beautifulsoup parser. Beautifulsoup doesn't implement its own tree navigation parser and instead relies on one of 3 available backends:

  • html.parser - python's built in parser, which is written in python but is a bit slower.
  • lxml - C based library for HTML parsing: really fast, but a bit more difficult to install.
  • html5lib - another parser written in python that is intended to be fully html5 compliant.

To summarize, it's best to stick with lxml backend because it's much faster, however html.parser should be just as good at the beginning. As for html5lib it's mostly best for edge cases where html5 specification compliance is necessary.

Let's take a look at how can we parse our document with BeautifulSoup and html.parser backend. First, in Python to install beautifulsoup we can use the pip install command:

$ pip install beautifulsoup4

Now, we can start experimenting:

from bs4 import BeautifulSoup

html = """
<head>
  <title class="page-title">Hello World!</title>
</head>
<body>
  <div id="content">
    <h1>Title</h1>
    <p>first paragraph</p>
    <p>second paragraph</p>
    <h2>Subtitle</h2>
    <p>first paragraph of subtitle</p>
  </div>
</body>
"""

# build soup object from html text
soup = BeautifulSoup(html)
# then we can navigate the html tree via python API:
# for example title is under <head> node and under <> node

print(soup.head.title)
# <title class="page-title">Hello World!</title>

# further we can get the text attribute instead of this entire node:
print(soup.head.title.text)
# Hello World!

# or it's other attributes:
print(soup.head.title["class"])
# page-title

Here, we built a beautifulsoup object from an HTML string and did simple dot-based navigation. However, this type of navigation is a bit too explicit for deeper tree structures which are much more common in real web pages: imagine if a page parse tree had 7 levels of depth, we'd have to write something like:

soup.body.div.div.div.p.a['href']

This is rather inconvenient, for this beautiful soup introduces two special methods called find() and find_all(). These two methods allow non-root based navigation, which is much more convenient when it comes to web scraping:

from bs4 import BeautifulSoup

html = """
<head>
  <title class="page-title">Hello World!</title>
</head>
<body>
  <div id="content">
    <h1>Title</h1>
    <p>first paragraph</p>
    <p>second paragraph</p>
    <h2>Subtitle</h2>
    <p>first paragraph of subtitle</p>
  </div>
</body>
"""
soup = BeautifulSoup(html)

soup.find('title').text
# Hello World

# we can also perform searching by attribute values
soup.find(class_='page-title').text
# Hello World

# We can even combine these two approaches:
soup.find('div', id='content').h2.text
# Subtitle

# Finally, we can perform partial attribute matches using regular expressions
# let's select paragraphs that contain the word "first" in it's text:
soup.find_all('p', text=re.compile('first'))
# [<p>first paragraph</p>, <p>first paragraph of subtitle</p>]

As you can see, combining beautiful soups dot-based navigation with the magic find() and find_all() methods we can easily and reliably navigate the HTML tree to extract specific information with very little code.

Beautifulsoup, is a really accessible and powerful HTML parser that fits web-scraping perfectly! Next, let's take a look at some special extra features of bs4 and some real life web-scraping scenarios.

Beautifulsoup's Extras

Other than being a great HTML parser, Beautifulsoup also includes a lot of HTML related utils and helper functions. Let's take a quick overview of utils that are often used in web-scraping.

  1. Automatic text formatting
    Often detailed text is scattered through multiple HTML nodes, for this beautifulsoup implements a convenient utils function called get_text():
from bs4 import BeautifulSoup

html = """
<div>
  <h1>The Avangers: </h1>
  <a>End Game</a>
  <p>is one of the most popular Marvel movies</p>
</div>
"""
soup = BeautifulSoup(html)
soup.div.get_text(' ', strip=True)  # strip removes trailing/leading spaces
# 'The Avangers: End Game is one of the most popular Marvel movies'
  1. Pretty formatting HTML
    Another great util offered by beautifulsoup is HTML formatter. Frequently, when web-scraping we want to either store or display HTML content somewhere for ingesting it with other tools or debugging. Beautifulsoups .prettify() method restructures HTML output to be more readable by humans:
from bs4 import BeautifulSoup

html = """
<div><h1>The Avangers: </h1><a>End Game</a><p>is one of the most popular Marvel movies</p></div>
"""
soup = BeautifulSoup(html)
soup.prettify()
# <html>
#  <body>
#   <div>
#    <h1>
#     The Avangers:
#    </h1>
#    <a>
#     End Game
#    </a>
#    <p>
#     is one of the most popular Marvel movies
#    </p>
#   </div>
#  </body>
# </html>
  1. Parsing only parts of the content
    Some web scrapers might not need the entire HTML document to extract valuable data. For example, typically when web crawling, we want to only parse <a> nodes for the links. For this BeautifulSoup offers SoupStrainer object which allows restricting our tree to specific HTML elements only:
from bs4 import BeautifulSoup, SoupStrainer
html = """
<head><title>hello world</title></head>
<body>
  <div>
      <a>Link 1</a>
      <a>Link 2</a>
      <div>
        <a>Link 3</a>
      /div>
  </div>
</body>
"""
link_strainer = SoupStrainer('a')
soup = BeautifulSoup(html, parse_only=link_strainer)
print(soup)
#<a>Link 1</a><a>Link 2</a><a>Link 3</a>
  1. Tree modification
    Since beautifulsoup is a full python wrapper for HTML tree, we can easily modify by adjusting python objects it contains:
from bs4 import BeautifulSoup
html = """
<div>
  <button class="flat-button red">Subscribe</button>
</div>
"""
soup = BeautifulSoup(html)
soup.div.button['class'] = "shiny-button blue"
soup.div.button.string = "Unsubscribe"
print(soup.prettify())
# <html>
#  <body>
#   <div>
#    <button class="shiny-button blue">
#     Unsubscribe
#    </button>
#   </div>
#  </body>
# </html>

In this section, we've covered the 4 most common beautifulsoup extra features. As you can see, beautifulsoup is not only HTML parsing library but a whole HTML suite! Finally, let's finish off this article with a real world example.

Real Life Scenario

Now that we have a bit of understanding of HTML's structure and how to use beautifulsoup to parse it, let's take it for a real spin! In this example, we'll be ScrapFly's API and beautifulsoup to parse Python job listings from https://www.remotepython.com/jobs/.

First, let's ensure we have beautifulsoup4 and ScrapFly SDK installed:

$ pip install beautifulsoup4 scrapfly-sdk

Then our goal is to extract these job fields - job url, title, company, location and when it was posted:

screengrab of job listings on remotepython.com

Job listings on remotepython.com

As you can see, the page contains 15 job listings, let's extract them with ScrapFly+beautifulsoup:

from urllib.parse import urljoin
from bs4 import BeautifulSoup
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
import json
import re

results = []
with ScrapflyClient(key="<YOUR_KEY>") as client:
    response = client.scrape(
        scrape_config=ScrapeConfig(
            url="https://www.remotepython.com/jobs/",
        )
    )
    soup = BeautifulSoup(response.scrape_result["content"])
    collected = []
    for item in soup.find_all(class_="item"):
        collected.append(
            {
                "title": item.find("h3").text,
                "url": urljoin(response.scrape_result['url'], item.find("h3").a["href"]),
                "company": item.find("h5").find("span", class_="color-black").text,
                "location": item.find("h5").find("span", class_="color-white-mute").text,
                "posted_on": item.find("span", class_="color-white-mute", text=re.compile("Posted:")).text,
            }
        )
print(json.dumps(collected, indent=2))

In this little web-scraper we're using ScrapFly API to retrieve the HTML contents and build a soup object from it. Using soup object, we can iterate through each job listing box denoted by item class and extract job listing details. If we run this script we should get 15 results that look something like this:

[
  {
    "title": "Full Stack Python Developer  ",
    "url": "https://www.remotepython.com/jobs/4ea2882056334d20a376abce8d59c034/",
    "company": "ChangeEngine",
    "location": "London, United Kingdom",
    "posted_on": "Posted: Dec. 15, 2021"
  },
  ...
]

FAQ

How is beautifulsoup different from other HTML parsing libraries (like one used by Scrapy)?

Beautifulsoup implements many ways to select HTML elements and supports many parse tree backends - it provides a lot of options. That being said a lot of these options can be confusing for beginners, so we advise sticking with CSS and XPATH selectors and lxml backend as these are the de facto ways of parsing HTML.

Is BeautifulSoup a web scraping library?

No, beautifulsoup is an HTML parsing library so while it's used for web scraping it's not a full web scraping suite/framework like scrapy. Beautifulsoup HTML parser needs to be paired with HTTP client library like the requests library (or other alternatives like httpx, selenium) to retrieve HTML pages.

Why does my beautifulsoup code structures parse tree differently than the web browser?

When developing parsing code using web browser developer tools we can see a parse tree mismatch between these two technologies. The parse tree in the web browser is slightly different, why? This happens because of slightly different interpretation rules different beautifulsoup backends have. For parse tree that looks the most like that of a web browser make sure to use html5lib backend.

Parsing HTML table in python with beautifulsoup

One of the most common parsing targets are HTML tables, here's a quick scraper snippet to extract them:

from bs4 import BeautifulSoup
import requests 

soup = BeautifulSoup(requests.get("https://www.w3schools.com/html/html_tables.asp").text)
# first we should find our table object:
table = soup.find('table', {"id": "customers"})
# then we can iterate through each row and extract either header or row values:
header = []
rows = []
for i, row in enumerate(table.find_all('tr')):
    if i == 0:
        header = [el.text.strip() for el in row.find_all('th')]
    else:
        rows.append([el.text.strip() for el in row.find_all('td')])
print(header)
for row in rows:
    print(row)

Summary and Further Reading

In this short article we've introduced ourselves to web scraping with python and beautifulsoup. We've learned how Beautifulsoup allows us to easily parse information from HTML files using the .find and .find_all methods which allows us to find HTML nodes by name, attribute values or text content itself.

Finally, we wrapped everything up with a real python with beautifulsoup example by scraping job listing information from remotepython.com.

For more advanced HTML parsing, we recommend taking a look at CSS and XPATH selectors, both of which are unique HTML parsing mini-languages that can be incorporated with python. For this, refer to our Web Scraping With Python article.

Related post

How to Scrape YellowPages.com

Tutorial on how to scrape yellowpages.com business and review data using Python. How to avoid blocking to scrape data at scale and other tips.

How to Scrape Amazon.com

Tutorial on how to scrape Amazon.com's product and review data using Python and how to avoid blocking to scrape this information at scale.

How to Scrape Zillow.com

Tutorial on how to scrape Zillow.com sale and rent property data, using Python and how to avoid blocking to scrape at scale.

How to Scrape TripAdvisor.com

Tutorial on how to scrape TripAdvisor.com hotel, restaurant, tour and review data using Python and how to avoid blocking to scrape at scale.