BeautifulSoup is a popular HTML parsing library used in web scraping with Python. With BeautifulSoup, to find all links on the page we can use the find_all()
method or CSS selectors and the select()
method:
import bs4
soup = bs4.BeautifulSoup("""
<a href="/pricing">Pricing</a>
<a href="https://example.com/blog">Blog</a>
<a href="https://twitter.com/@company">Twitter</a>
""")
links = [node.get('href') for node in soup.find_all("a")]
[
"/pricing",
"https://example.com/blog",
"https://twitter.com/@company",
]
# or with css selectors:
link = [node.get('href') for node in soup.select('a')]
It should be noted that bs4 extracts links as they appear on the page. Links can be:
- Relative to the current website like
/pricing
- Absolute like like
https://example.com/blog
- Absolute outbound like
https://twitter.com/@company
We can convert all relative urls to absolute using urllib.parse.urljoin
function:
from urllib.parse import urljoin
base_url = "https://example.com"
links = [urljoin(base_url, link) for link in links]
print(links)
# will print
"https://example.com/pricing"
"https://example.com/blog"
"https://twitter.com/@company"
We can also filter out outbound URLs if we want to restrict our scraper to a particular website. For this https://pypi.org/project/tldextract/ library can be used to find the top level domain (TLD):
import tldextract
allowed_domain = "example.com"
for link in links:
tld = tldextract.extract("link").registered_domain
if tld != allowed_domain:
continue
else:
print(link)
# will print
"https://example.com/pricing"
"https://example.com/blog"
# notice the twitter url is missing