How to find all links using BeautifulSoup and Python?

BeautifulSoup is a popular HTML parsing library used in web scraping with Python. With BeautifulSoup, to find all links on the page we can use the find_all() method or CSS selectors and the select() method:

import bs4
soup = bs4.BeautifulSoup("""
<a href="/pricing">Pricing</a>
<a href="https://example.com/blog">Blog</a>
<a href="https://twitter.com/@company">Twitter</a>
""")
links = [node.get('href') for node in soup.find_all("a")]
[
    "/pricing",   
    "https://example.com/blog",
    "https://twitter.com/@company",
]
# or with css selectors:
link = [node.get('href') for node in soup.select('a')]

It should be noted that bs4 extracts links as they appear on the page. Links can be:

  • Relative to the current website like /pricing
  • Absolute like like https://example.com/blog
  • Absolute outbound like https://twitter.com/@company

We can convert all relative urls to absolute using urllib.parse.urljoin function:

from urllib.parse import urljoin

base_url = "https://example.com"
links = [urljoin(base_url, link) for link in links]
print(links)
# will print
"https://example.com/pricing"
"https://example.com/blog"
"https://twitter.com/@company"

We can also filter out outbound URLs if we want to restrict our scraper to a particular website. For this https://pypi.org/project/tldextract/ library can be used to find the top level domain (TLD):

import tldextract

allowed_domain = "example.com"
for link in links:
    tld = tldextract.extract("link").registered_domain
    if tld != allowed_domain:
        continue
    else:
        print(link)
# will print
"https://example.com/pricing"
"https://example.com/blog"
# notice the twitter url is missing

Related Articles

Guide to List Crawling: Everything You Need to Know

In-depth look at list crawling - how to extract valuable data from list-formatted content like tables, listicles and paginated pages.

CRAWLING
BEAUTIFULSOUP
PYTHON
Guide to List Crawling: Everything You Need to Know

Intro to Web Scraping Images with Python

In this guide, we’ll explore how to scrape images from websites using different methods. We'll also cover the most common image scraping challenges and how to overcome them. By the end of this article, you will be an image scraping master!

INTRO
CRAWLING
DATA-PARSING
PYTHON
Intro to Web Scraping Images with Python

How to Scrape Sitemaps to Discover Scraping Targets

Usually to find scrape targets we look at site search or category pages but there's a better way - sitemaps! In this tutorial, we'll be taking a look at how to find and scrape sitemaps for target locations.

CRAWLING
PYTHON
NODEJS
DATA-PARSING
How to Scrape Sitemaps to Discover Scraping Targets

Creating Search Engine for any Website using Web Scraping

Guide for creating a search engine for any website using web scraping in Python. How to crawl data, index it and display it via js powered GUI.

DATA-PARSING
CRAWLING
SEARCH-ENGINE
Creating Search Engine for any Website using Web Scraping

How to Parse Web Data with Python and Beautifulsoup

Beautifulsoup is one the most popular libraries in web scraping. In this tutorial, we'll take a hand-on overview of how to use it, what is it good for and explore a real -life web scraping example.

BEAUTIFULSOUP
DATA-PARSING
PYTHON
How to Parse Web Data with Python and Beautifulsoup

What is Rate Limiting? Everything You Need to Know

Discover what rate limiting is, why it matters, how it works, and how developers can implement it to build stable, scalable applications.

BLOCKING
CRAWLING
HTTP
What is Rate Limiting? Everything You Need to Know