Web Scraping with Python and BeautifulSoup
Beautifulsoup is one the most popular libraries in web scraping. In this tutorial, we'll take a hand-on overview of how to use it, what is it good for and explore a real -life web scraping example.
BeautifulSoup is a popular HTML parsing library used in web scraping with Python. With BeautifulSoup, to find all links on the page we can use the
find_all() method or CSS selectors and the
import bs4 soup = bs4.BeautifulSoup(""" <a href="/pricing">Pricing</a> <a href="https://example.com/blog">Blog</a> <a href="https://twitter.com/@company">Twitter</a> """) links = [node.get('href') for node in soup.find_all("a")] [ "/pricing", "https://example.com/blog", "https://twitter.com/@company", ] # or with css selectors: link = [node.get('href') for node in soup.select('a')]
It should be noted that bs4 extracts links as they appear on the page. Links can be:
We can convert all relative urls to absolute using
from urllib.parse import urljoin base_url = "https://example.com" links = [urljoin(base_url, link) for link in links] print(links) # will print "https://example.com/pricing" "https://example.com/blog" "https://twitter.com/@company"
We can also filter out outbound URLs if we want to restrict our scraper to a particular website. For this https://pypi.org/project/tldextract/ library can be used to find the top level domain (TLD):
import tldextract allowed_domain = "example.com" for link in links: tld = tldextract.extract("link").registered_domain if tld != allowed_domain: continue else: print(link) # will print "https://example.com/pricing" "https://example.com/blog" # notice the twitter url is missing