How to find all links using BeautifulSoup and Python?

by scrapecrow Oct 24, 2022

# beautifulsoup # data-parsing # crawling

BeautifulSoup is a popular HTML parsing library used in web scraping with Python. With BeautifulSoup, to find all links on the page we can use the find_all() method or CSS selectors and the select() method:

import bs4
soup = bs4.BeautifulSoup("""
<a href="/pricing">Pricing</a>
<a href="https://example.com/blog">Blog</a>
<a href="https://twitter.com/@company">Twitter</a>
""")
links = [node.get('href') for node in soup.find_all("a")]
[
    "/pricing",   
    "https://example.com/blog",
    "https://twitter.com/@company",
]
# or with css selectors:
link = [node.get('href') for node in soup.select('a')]

It should be noted that bs4 extracts links as they appear on the page. Links can be:

Relative to the current website like /pricing
Absolute like like https://example.com/blog
Absolute outbound like https://twitter.com/@company

We can convert all relative urls to absolute using urllib.parse.urljoin function:

from urllib.parse import urljoin

base_url = "https://example.com"
links = [urljoin(base_url, link) for link in links]
print(links)
# will print
"https://example.com/pricing"
"https://example.com/blog"
"https://twitter.com/@company"

We can also filter out outbound URLs if we want to restrict our scraper to a particular website. For this https://pypi.org/project/tldextract/ library can be used to find the top level domain (TLD):

import tldextract

allowed_domain = "example.com"
for link in links:
    tld = tldextract.extract("link").registered_domain
    if tld != allowed_domain:
        continue
    else:
        print(link)
# will print
"https://example.com/pricing"
"https://example.com/blog"
# notice the twitter url is missing

How to find all links using BeautifulSoup and Python?

Related Articles

Guide to List Crawling: Everything You Need to Know

Intro to Web Scraping Images with Python

How to Scrape Sitemaps to Discover Scraping Targets

Creating Search Engine for any Website using Web Scraping

How to Parse Web Data with Python and Beautifulsoup

Crawl4AI Explained: The AI-Friendly Web Crawling Framework