Web Scraping with Python and BeautifulSoup

Web Scraping with Python and BeautifulSoup

When it comes to web scraping, we're generally interested in two main steps: collecting raw data (such as HTML documents) and later parsing wanted data into something we can digest in our business logic. In this article, we'll cover one of the most popular tools for HTML parsing in Python: BeautifulSoup (beautifulsoup4)

We'll take a brief look at HTML structure, beautifulsoup itself and finally illustrate some of the best practices via real life examples by scraping job listings from https://remotepython.org. Let's jump in!

What is HTML?

HTML (HyperText Markup Language) is designed to be easily machine-readable and parsable. In other words, HTML follows a tree like structure of nodes and their attributes, which we can easily navigate programmatically.

Let's start off with a small example page and illustrate its structure:

<head>
  <title>
  </title>
</head>
<body>
  <div>
    <h1>Introduction</h1>
    <p>some description text: </p>
    <a class="link" href="http://example.com">example link</a>
  </div>
</body>

In this basic example of a simple web page, we can see that the document already resembles a data tree. Let's go a bit further and illustrate this:

HTML tree is made of nodes which can contain attributes such as classes, ids and text itself

Here we can wrap our heads around it a bit more easily: it's a tree of nodes and each node can also have properties attached to them like keyword attributes (like class and href) and natural attributes such as text.

For more on HTML and parsing, see our in-depth guide Web Scraping With Python 102: Parsing

With this basic understanding, we can see how beautiful soup can help us traverse this tree to extract data we need.

Parsing HTML with BeautifulSoup

Beautifulsoup is a python package that implements pythonic HTML navigation and many parsing utilities such as tree modification and textual formatting that are often used in web scraping.

Let's take a quick dive into the most useful features of this package in the context of web scraping. We'll start off with an imaginary HTML page and later take a look at a real life example too:

<head>
  <title class="page-title">Hello World!</title>
</head>
<body>
  <div id="content">
    <h1>Title</h1>
    <p>first paragraph</p>
    <p>second paragraph</p>
    <h2>Subtitle</h2>
    <p>first paragraph of subtitle</p>
  </div>
</body>

Here we have a very simple article-like web page which contains basic elements of an article: title, subtitle and some paragraphs.

First thing first, we need to choose a backend for our beautifulsoup parser. Beautiful soup doesn't implement its own tree navigation parser and instead relies on one of 3 available backends:

  • html.parser - python's built in parser, which is written in python but is a bit slower.
  • lxml - C based library for HTML parsing: really fast, but a bit more difficult to install.
  • html5lib - another parser written in python that is intended to be fully html5 compliant.

To summarize, it's best to stick with lxml backend because it's much faster, however html.parser should be just as good at the beginning. As for html5lib it's mostly best for edge cases where html5 specification compliance is necessary.

Let's take a look at how can we parse our document with BeautifulSoup and html.parser backend:

# note: you can install beautifulsoup package through "$ pip install beautifulsoup4" command in your terminal
from bs4 import BeautifulSoup

html = """
<head>
  <title class="page-title">Hello World!</title>
</head>
<body>
  <div id="content">
    <h1>Title</h1>
    <p>first paragraph</p>
    <p>second paragraph</p>
    <h2>Subtitle</h2>
    <p>first paragraph of subtitle</p>
  </div>
</body>
"""

# build soup object from html text
soup = BeautifulSoup(html)
# then we can navigate the html tree via python API:
# for example title is under <head> node and under <> node

print(soup.head.title)
# <title class="page-title">Hello World!</title>

# further we can get the text attribute instead of this entire node:
print(soup.head.title.text)
# Hello World!

# or it's other attributes:
print(soup.head.title["class"])
# page-title

Here we built a soup object from a HTML string and did simple dot-based navigation. However, this type of navigation is a bit too explicit for deeper tree structures which are much more common in real web pages: imagine if a page tree had 7 levels of depth, we'd have to write something like:

soup.body.div.div.div.p.a['href']

this is rather inconvenient, for this beautiful soup introduces two special methods called find() and find_all(). These two methods allow non-root based navigation, which is much more convenient when it comes to web scraping:

from bs4 import BeautifulSoup

html = """
<head>
  <title class="page-title">Hello World!</title>
</head>
<body>
  <div id="content">
    <h1>Title</h1>
    <p>first paragraph</p>
    <p>second paragraph</p>
    <h2>Subtitle</h2>
    <p>first paragraph of subtitle</p>
  </div>
</body>
"""
soup = BeautifulSoup(html)

soup.find('title').text
# Hello World

# we can also perform searching by attribute values
soup.find(class_='page-title').text
# Hello World

# We can even combine these two approaches:
soup.find('div', id='content').h2.text
# Subtitle

# Finally, we can perform partial attribute matches using regular expressions
# let's select paragraphs that contain the word "first" in it's text:
soup.find_all('p', text=re.compile('first'))
# [<p>first paragraph</p>, <p>first paragraph of subtitle</p>]

As you can see, combining beautiful soups dot-based navigation with the magic find() and find_all() methods we can easily and reliably navigate the HTML tree to extract specific information with very little code.

Beautifulsoup, is a really accessible and powerful HTML parser that fits web-scraping perfectly! Next, let's take a look at some special extra features of bs4 and some real life web-scraping scenarios.

Beautifulsoup's Extras

Other than being a great HTML parser, Beautifulsoup also includes a lot of HTML related utils and helper functions. Let's take a quick overview of utils that are often used in web-scraping.

  1. Automatic text formatting
    Often detailed text is scattered through multiple HTML nodes, for this beautifulsoup implements a convenient utils function called get_text():
from bs4 import BeautifulSoup

html = """
<div>
  <h1>The Avangers: </h1>
  <a>End Game</a>
  <p>is one of the most popular Marvel movies</p>
</div>
"""
soup = BeautifulSoup(html)
soup.div.get_text(' 'm strip=True)  # strip removes trailing/leading spaces
# 'The Avangers: End Game is one of the most popular Marvel movies'
  1. Pretty formatting html
    Another great util offered by beautifulsoup is HTML formatter. Frequently, when web-scraping we want to either store or display HTML content somewhere for ingesting it with other tools or debugging. Beautifulsoups .prettify() method restructures html output to be more readable by humans:
from bs4 import BeautifulSoup

html = """
<div><h1>The Avangers: </h1><a>End Game</a><p>is one of the most popular Marvel movies</p></div>
"""
soup = BeautifulSoup(html)
soup.prettify()
# <html>
#  <body>
#   <div>
#    <h1>
#     The Avangers:
#    </h1>
#    <a>
#     End Game
#    </a>
#    <p>
#     is one of the most popular Marvel movies
#    </p>
#   </div>
#  </body>
# </html>
  1. Parsing only parts of the content
    Some web scrapers might not need the entire HTML document to extract valuable data. For example, typically when web crawling, we want to only parse <a> nodes for the links. For this BeautifulSoup offers SoupStrainer object which allows restricting our tree to specific elements only:
from bs4 import BeautifulSoup, SoupStrainer
html = """
<head><title>hello world</title></head>
<body>
  <div>
      <a>Link 1</a>
      <a>Link 2</a>
      <div>
        <a>Link 3</a>
      /div>
  </div>
</body>
"""
link_strainer = SoupStrainer('a')
soup = BeautifulSoup(html, parse_only=link_strainer)
print(soup)
#<a>Link 1</a><a>Link 2</a><a>Link 3</a>
  1. Tree modification
    Since beautifulsoup is a full python wrapper for HTML tree, we can easily modify by adjusting python objects it contains:
from bs4 import BeautifulSoup
html = """
<div>
  <button class="flat-button red">Subscribe</button>
</div>
"""
soup = BeautifulSoup(html)
soup.div.button['class'] = "shiny-button blue"
soup.div.button.string = "Unsubscribe"
print(soup.prettify())
# <html>
#  <body>
#   <div>
#    <button class="shiny-button blue">
#     Unsubscribe
#    </button>
#   </div>
#  </body>
# </html>

In this section, we've covered the 4 most common beautifulsoup extra features. As you can see, beautifulsoup is not only HTML parsing library but a whole HTML suite! Finally, let's finish off this article with a real world example.

Real Life Scenario

Now that we have a bit of understanding of HTML's structure and how to use beautifulsoup to parse it, let's take it for a real spin! In this example, we'll be ScrapFly's API and beautifulsoup to parse Python job listings from https://remotepython.org/jobs.

First, let's ensure we have beautifulsoup4 and ScrapFly SDK installed:

$ pip install beautifulsoup4 scrapfly-sdk

Then our goal is to extract these job fields - job url, title, company, location and when it was posted:

Job listings on remotepython.org

As you can see, the page contains 15 job listings, let's extract them with ScrapFly+beautifulsoup:

from urllib.parse import urljoin
from bs4 import BeautifulSoup
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
import json
import re

results = []
with ScrapflyClient(key="<YOUR_KEY>") as client:
    response = client.scrape(
        scrape_config=ScrapeConfig(
            url="https://www.remotepython.com/jobs/",
        )
    )
    soup = BeautifulSoup(response.scrape_result["content"])
    collected = []
    for item in soup.find_all(class_="item"):
        collected.append(
            {
                "title": item.find("h3").text,
                "url": urljoin(response.scrape_result['url'], item.find("h3").a["href"]),
                "company": item.find("h5").find("span", class_="color-black").text,
                "location": item.find("h5").find("span", class_="color-white-mute").text,
                "posted_on": item.find("span", class_="color-white-mute", text=re.compile("Posted:")).text,
            }
        )
print(json.dumps(collected, indent=2))

In this little web-scraper we're using ScrapFly API to retrieve the html contents and build a soup object from it. Using soup object, we can iterate through each job listing box denoted by item class and extract job listing details. If we run this script we should get 15 results that look something like this:

[
  {
    "title": "Full Stack Python Developer  ",
    "url": "https://www.remotepython.com/jobs/4ea2882056334d20a376abce8d59c034/",
    "company": "ChangeEngine",
    "location": "London, United Kingdom",
    "posted_on": "Posted: Dec. 15, 2021"
  },
  ...
]

Summary and Further Reading

In this short article we've introduced ourselves with a brilliant tool for web-scraping beautifulsoup4 which allows us to easily parse information from html trees using node names and various attribute matchers.

From here on, we recommend taking a look at CSS and XPATH selectors, both of which are unique html parsing mini-languages that can be incorporated with python that making parsing even easier! For this, refer to our Web Scraping With Python 102: Parsing

Get Your FREE API Key
Discover Scrapfly

Related post

Web Scraping with Selenium and Python

Introduction to web scraping dynamic javascript powered websites and web apps using Selenium browser automation library and Python.

Scraping Dynamic Websites Using Browser

Introduction to using web automation tools such as Puppeteer, Playwright, Selenium and ScrapFly to render dynamic websites for web scraping

Web Scraping With Python 102: Parsing

Introduction to parsing content from web scraped html documents. What libraries to use and common html parsing idioms in Python.