How to find all links using BeautifulSoup and Python?

BeautifulSoup is a popular HTML parsing library used in web scraping with Python. With BeautifulSoup, to find all links on the page we can use the find_all() method or CSS selectors and the select() method:

import bs4
soup = bs4.BeautifulSoup("""
<a href="/pricing">Pricing</a>
<a href="https://example.com/blog">Blog</a>
<a href="https://twitter.com/@company">Twitter</a>
""")
links = [node.get('href') for node in soup.find_all("a")]
[
    "/pricing",   
    "https://example.com/blog",
    "https://twitter.com/@company",
]
# or with css selectors:
link = [node.get('href') for node in soup.select('a')]

It should be noted that bs4 extracts links as they appear on the page. Links can be:

  • Relative to the current website like /pricing
  • Absolute like like https://example.com/blog
  • Absolute outbound like https://twitter.com/@company

We can convert all relative urls to absolute using urllib.parse.urljoin function:

from urllib.parse import urljoin

base_url = "https://example.com"
links = [urljoin(base_url, link) for link in links]
print(links)
# will print
"https://example.com/pricing"
"https://example.com/blog"
"https://twitter.com/@company"

We can also filter out outbound URLs if we want to restrict our scraper to a particular website. For this https://pypi.org/project/tldextract/ library can be used to find the top level domain (TLD):

import tldextract

allowed_domain = "example.com"
for link in links:
    tld = tldextract.extract("link").registered_domain
    if tld != allowed_domain:
        continue
    else:
        print(link)
# will print
"https://example.com/pricing"
"https://example.com/blog"
# notice the twitter url is missing

Provided by Scrapfly

This knowledgebase is provided by Scrapfly data APIs, check us out! 👇