Web Scraping With Python 102: Parsing

Web Scraping With Python 102: Parsing

One of the biggest revolutions of 21st century is the realization of how valuable data can be. The great news is that the internet is full of great, public data for you to take advantage of, and that's exactly the purpose of web-scraping: collecting this public data to bootstrap a project or a newly found business.

In this multipart tutorial series we'll take an in-depth look at what is web scraping, how everything around it functions and how can we write our own web-scrapers in Python programming language.

Previous: Web Scraping With Python 101: Connection


In the last chapter, we've learned how to retrieve documents from the web using httpx package, and it's time we start making sense of these documents rather than just hoarding them.

In this chapter, we'll take a look at how can we parse HTML documents to extract data, format it and ensure that methods are reliable. For this, we'll have a quick overview of CSS and XPATH selectors and how they are implemented in popular HTML parsing libraries in Python.

Parsing HTML Content

HTML (HyperText Markup Language) is a text data structure that powers the web. The great thing about it is that it's intended to be machine-readable text content, which is great news for web-scraping as we can easily parse the data with code!

HTML is a tree type structure that lends easily to parsing. For example, let's take this simple HTML content:

<head>
    <title>My Website</title>
</head>
<body>
    <h1>Welcome to my website!</h1>
    <div class="content">
        <p>This is my website</p>
        <p>Isn't it great?</p>
    </div>
</body>

Here we see an extremely basic HTML document that a simple website might serve. You can already see the tree like structure just by indentation of the text, but we can even go further and illustrate it:

example of a HTML node tree, note that branches are ordered (left-to-right)

This tree structure is brilliant for web-scraping as we can easily navigate the whole document.
For example, to find the title of the website, we can see that it's under <head> and under <title> nodes. In other words - if we wanted to extract 1000 titles for 1000 different pages, we would write a rule to find head->title->text for every one of them.

When it comes to HTML parsing, there are two standard ways to write these rules: CSS selectors and XPATH selectors - let's dive further and see how can we use them to parse web-scrapped data!

Intro to CSS and XPATH Selectors

There are two HTML parsing standards: CSS selectors which are more brief but less powerful and more complex XPATH selectors that are more complex but also more powerful.

Generally modern websites can be parsed with CSS selectors alone, however sometimes HTML structure can be more complex, needing that extra parsing power of XPATH. In this section, we'll take a look at both - where CSS selectors are sufficient and where XPATH selectors are needed.

First, since Python has no built-in HTML parser, we must choose a library which provides such capability. In Python there are several options, but the two biggest packages are beautifulsoup4 and parsel.
We'll be using parsel in this chapter, but once we understand CSS and XPATH selectors, we can apply this knowledge to many clients across many languages - not only Python.

Let's install parsel and take a look at a quick introduction example:

$ pip install parsel
$ pip show parsel
Name: parsel
Version: 1.6.0
...

Official parsel documentation can be found here: https://github.com/scrapy/parsel

Now with our package installed, let's give it a spin with this imaginary HTML content:

# for this example we're using a simple website page
HTML = """
<head>
    <title>My Website</title>
</head>
<body>
    <div class="content">
        <h1>First blog post</h1>
        <p>Just started this blog!</p>
        <a href="http://github.com/myuser">Checkout My Github</a>
    </div>
</body>
"""
from parsel import Selector

# first we must build parsable tree object from HTML text string
tree = Selector(HTML)
# once we have tree object we can execute css and xpath selectors on it:
title = tree.css('title').get()
github_link = tree.css('.content a::attr(href)').get()
article_text = ''.join(tree.css('.content ::text').getall()).strip()
print(title)
print(github_link)
print(article_text)
# will print:
# <title>My Website</title>
# http://github.com/myuser
# First blog post
# Just started this blog!
# Checkout My Github

Alright, let's quickly unpack what's going here.
First, we define our HTML content, which usually we'd get by requesting a page but for convenience we just made one of our own.
Then, we have to build a Selector object which parses the text into a navigable tree structure.
Finally, we can execute css/xpath selectors on this tree to parse out details we actually want.

In this example, we've extracted the title, Github link from the article's body and all the text the article contains. Let's quickly overview how CSS selectors work.

CSS selectors are created to apply CSS, so if you ever worked with stylesheets, you're probably already familiar with them!

CSS selectors allow HTML node selection by either node name (e.g. <body>, <div>, <a> etc.) or by node attributes such as class (e.g. <div class="content"></div>) or id (e.g. <div id="name"><div>).
These features can be combined, mixed and match to parse exactly what we want from the page.

In our example above we used a simple selector title which selects all <title> nodes in the HTML tree:

tree = Selector(text="<title>My Website</title>")
tree.css('title').getall()
['<title>My Website</title>']
# since we can safely assume website will have only one title we can take just the first one 
tree.css('title').get()
'<title>My Website</title>'

Further, we also select the first link in the article's content. For this we are using a multipart selector made from multiple relative instructions separated by space: .content a::attr(href) which translates to:

  1. take node with class content
  2. then take any a nodes that are under it
  3. then take its attribute's href value.

Finally, for our last selector we're using the special keyword ::text which as you might have guessed selects the text value of the node:

html = """
<div class='content'>
    <p>paragraph 1</p>
    <a>link 1</a>
    <p>paragraph 2</p>
    <a>link 2</a>
</div>
"""
tree = Selector(text=html)
text = tree.css('.content ::text').getall()
#                        ^ note:space here means any descendant under .content
print(''.join(text))
# will print:
# paragraph 1
# link 1
# paragraph 2
# link 2

Now that we have a decent grasp of CSS selectors, let's take a look at more complex examples and how XPATH selectors can lend us a hand.

XPath Selectors

XPATH stands for XML Path Language, and it's a powerful tool when it comes to traversing complex HTML trees. To understand what advantages XPATH brings over CSS selectors, let's take a quick overview of this mini language, and then we'll experiment with real life examples:


Navigation:

  • // selects any descendant in the document
    e.g. //a will select all <a> nodes in the document
  • / selects direct child of the node
    e.g. //div/a will select all <a> nodes that are children of <div> node
  • .. selects direct parent of the node
    e.g. //a/.. will select all parents of <a> nodes
  • @ is attribute selector
    e.g. @class will select class attribute value of the node
  • following-sibling::<node> / preceding-sibling::<node> allows to select nodes following and preceding siblings:
    e.g. //div/a/following-sibling::a will select <a> the sibling that is below the current node, on the other hand preceding would select above one

Constraints:

  • [] is condition matcher
    e.g. //div[@class="content"] will select all <div> nodes that have class attribute equal to "content"

Functions:

  • text() is a special function that select the text content of the node
    e.g. //a/text() will select text value of all <a> nodes.
  • contains(a, b) checks whether a contains b
    e.g. //a[contains(@class, "social")] will select all <a> nodes that contain "social" class

As you might have noticed, compared to CSS selectors, XPATH provides some additional powerful features: inline functions and multi-directional tree parsing.
Let's take a look at how these two features can help us create really powerful extraction rules.

You can find more information about xpath selectors on official MDN documentation: https://developer.mozilla.org/en-US/docs/Web/XPath

First, let's start with multi-directional tree parsing. As we've discovered in the last section we can use CSS selectors to select nodes and their descendants, however XPATH selectors go beyond that and allow selecting ancestors and siblings:

from parsel import Selector

# here we have 2 <a> nodes
# 1st has bold text and indicates important document
# 2nd one has no bold text and indicates non-important document
html = """
<body>
  <a href="http://important.com">
    <b>important: </b>
    this should be read by everyone 
  </a>
  <a href="http://not-important.com">
    additional content to read
  </a>
</body>
"""
# we want to select only the important link
tree = Selector(html)
# for this we can select any <b> node, then select it's parent and 
# finally it's attribute "href":
important = tree.xpath("//b/../@href").get()
print(important)
# will print:
# http://important.com

As you can see in the example above, we've built an imaginary scenario where two different links are present: one contains bold text and the other one doesn't. We can easily achieve this by taking advantage of multi-directional parsing capabilities of XPATH selector!

For testings xpath selectors there are many helper utils available like https://extendsclass.com/xpath-tester.html

The other feature that sets XPATH selectors apart is inline functions, which allows us to filter results based on some fancy logic. The most widely used function in this regard is probably the contains() function, which can check whether the value contains a text string. Let's take a quick look at how can it be used in action with our previous example:

from parsel import Selector

# this time around our important link has no bold
html = """
<body>
  <a href="http://important.com">
    important: this should be read by everyone 
  </a>
  <a href="http://not-important.com">
    additional content to read
  </a>
</body>
"""

tree = Selector(html)
# for this we can select any <a> node which contains "important:" in it's
# text value and then select this nodes href attribute
important = tree.xpath("//a[contains(text(), 'important:']/@href").get()

In this example, we relied on contains() function to find the right link node and from there we could select its attribute. Pretty powerful!

We can combine, mix and match all of these logic elements to create really powerful and reliable selectors for our web scrapers. The time has come to put everything we know together!

Putting It All Together

In the last article we've covered how to download HTML documents using httpx client and in this article we've figured how to use CSS and XPATH selectors to parse HTML data using parsel package. Now let's put all of this together and write a small scraper!

In this section we'll be scraping https://producthunt.com which is essentially a technical product directory where people submit and discuss new products.

import httpx
import json
from parsel import Selector

headers = {
    # lets use Chrome browser on Windows:
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
}
urls = [
    "https://www.producthunt.com/posts/notion-7",
    "https://www.producthunt.com/posts/obsidian-3",
    "https://www.producthunt.com/posts/evernote",
]

def parse_product(response):
    tree = Selector(response.text)
    return {
        "url": str(response.url),
        'name': tree.css('h1 ::text').get(),
        'subtitle': tree.css('h2 ::text').get(),
        # votes are located under <span> which contains bigButtonCount in class names
        'votes': tree.css("span[class*='bigButtonCount']::text").get(),
        # tags is our most complex location
        # tag links are under div which contains topicPriceWrap class 
        # and tag links are only valid if they have /topic/ in them
        'tags': tree.xpath(
            "//div[contains(@class,'topicPriceWrap')]"
            "//a[contains(@href, '/topics/')]/text()"
        ).getall(),

    }

results = []
with httpx.Client(headers=headers) as session:
    for url in urls:
        response = session.get(url)
        results.append(parse_product(response))

print(json.dumps(results, indent=2))

Which results in:

[
  {
    "url": "https://www.producthunt.com/posts/notion-7",
    "name": "Notion",
    "subtitle": "Artificial intelligence-powered email.",
    "tags": [
      "Android",
      "iPhone",
      "Email"
    ],
    "votes": "1,650"
  },
  {
    "url": "https://www.producthunt.com/posts/obsidian-3",
    "name": "Obsidian",
    "subtitle": "A powerful knowledge base that works on local Markdown files",
    "tags": [
      "Productivity",
      "Note"
    ],
    "votes": "1,706"
  },
  {
    "url": "https://www.producthunt.com/posts/evernote",
    "name": "Evernote",
    "subtitle": "Note taking made easy",
    "tags": [
      "Android",
      "iPhone",
      "iPad"
    ],
    "votes": "300"
  }
]

In this little web scraper, we've combined everything we've learned:

  • We used User-Agent and Accept headers to mask our scraper as a web browser.
  • We used httpx.Client to establish a persistent connection.
  • We defined our parsing class where we used both css and xpath selectors to find product name, subtitle, tags and vote counts
  • Finally, we wrapped everything with a for loop to gather all results and dump them to JSON format

Thanks to the rich python's ecosystem, we've accomplished this tiny scraper in under 40 lines of code - awesome!

ScrapFly and Graphql

Now that we've learned how to write web scrapers in Python, let's take a look at how we can go further with ScrapFly's graphql feature!

You can read more about graphql technology over at https://graphql.org/

ScrapFly allows convenient graphql based parsing straight in Python SDK. Let's take a look at how the scraper we wrote would look like in ScrapFly:

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
import json

graphql = """
{
    page {
        products: query(selector: "body") {
            title: text(selector: "h1")
            website: text(selector: "h2")
            votes: int(selector: "span[class*='bigButtonCount']")
            tags: query(selector: "div[class*='topicPriceWrap'] a[href*='/topics/']") {
                name: text(selector: "")
            }
        }
    }
}
"""

urls = [
    "https://www.producthunt.com/posts/notion-7",
    "https://www.producthunt.com/posts/obsidian-3",
    "https://www.producthunt.com/posts/evernote",
]
results = []
with ScrapflyClient(key="2fbb226543e44335adb496d30c3ba92d") as client:
    for url in urls:
        response = client.scrape(
            scrape_config=ScrapeConfig(
                url=url,
                graphql=graphql,
            )
        )
        results.extend(response.scrape_result["content"]["page"]["products"])

print(json.dumps(results, indent=2))

Which results in:

[
  {
    "title": "Notion",
    "website": "Artificial intelligence-powered email.",
    "votes": 1650,
    "tags": [
      {
        "name": "Android"
      },
      {
        "name": "iPhone"
      },
      {
        "name": "Email"
      }
    ]
  },
  {
    "title": "Obsidian",
    "website": "A powerful knowledge base that works on local Markdown files",
    "votes": 1706,
    "tags": [
      {
        "name": "Productivity"
      },
      {
        "name": "Note"
      }
    ]
  },
  {
    "title": "Evernote",
    "website": "Note taking made easy",
    "votes": 300,
    "tags": [
      {
        "name": "Android"
      },
      {
        "name": "iPhone"
      },
      {
        "name": "iPad"
      }
    ]
  }
]

You can read more about ScrapFly's graphql feature here: https://scrapfly.io/docs/scrape-api/graphql

Summary and Further Reading

In this article, we took a deep dive into HTML parsing. We learned how we can use CSS and XPATH selectors to extract specific parts of an HTML page, as well as the difference between these two approaches.
We used parsel HTML parsing library that is also used by scrapy web scraping framework we've glossed over in chapter 1 and ScrapFly's own graphql based approach to data extraction.

For more on data parsing using these technologies, we recommend checking out xpath and css-selectors tags on stackoverflow - both of which have very active communities.

Finally, what are your next web-scraping steps? For small scrapers, you're ready to go and build something! However, we have not covered any advanced scaling techniques like asynchronous programming, proxy usage or task queues.
These should be your subjects of interest, or just check out ScrapFly's free plan, where we handle all these difficult things for you!

Previous: Web Scraping With Python 101: Connection

Get Your FREE API Key
Discover Scrapfly

Related post

Web Scraping with Selenium and Python

Introduction to web scraping dynamic javascript powered websites and web apps using Selenium browser automation library and Python.

Web Scraping with Python and BeautifulSoup

Introduction to web scraping with Python and BeautifulSoup package. Tips, tricks and best practices as well as real life example.

Scraping Dynamic Websites Using Browser

Introduction to using web automation tools such as Puppeteer, Playwright, Selenium and ScrapFly to render dynamic websites for web scraping