Have you wondered how businesses seem to have an endless list of email contacts? Email scraping can do that!
In this article, we'll explore how to scrape emails from websites with Python. We'll also cover the most common email scraping challenges and how to overcome them. Let's dig in!
How Websites Store Emails in HTML?
The most common way of storing emails in HTML is using mailto links. Which is a clickable link that opens the default email-sending client and adds the recipient's email automatically. These email links can be found in the HTML under a tags:
<a href="mailto:email@example.com">Send Email</a>
Websites can also store emails in HTML as plain text:
<p>Email: email@example.com</p>
It can be straightforward to scrape emails from the above HTML sources. However, some websites use obfuscation techniques to prevent scrapers from accessing emails as a fear of email spam.
We'll cover these obfuscation techniques in a dedicated section. But before that, let's take a look at the tools we'll use in this tutorial.
Setup
In this email scraping guide, we'll use this page on Yellowpages as our scraping target:
We have covered scraping Yellowpages in detail before. In this guide, we'll use it to scrape emails.
To scrape emails, we'll use httpx for sending requests and BeautifulSoup for HTML parsing. These libraries can be installed using the pip command:
pip install httpx bs4
How to Scrape a Website For Email Addresses?
To scrape emails from websites we really need to focus on HTML parsing techniques. Since all emails follow a predictable structure like something@something.something the easiest way to find emails in an HTML page is to use regex. These regex patterns can match against the full email structure or against a specific criteria like mailto links:
# Matching against the email structure:
email_pattern = re.compile(r'([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})')
email_matches = re.findall(email_pattern, page_html)
# Matching against emails links with mailto: text
email_matches = soup.findAll("a", attrs={"href": re.compile("^mailto:")})
The first method is suitable for scraping emails when they are found as plain text. However, it can be time-consuming as the script will search through all the HTML.
Now let's apply email regex matching to our target website. It has multiple company pages, each page stores emails as links under a tags with a mailto text:
We'll use httpx and BeautifulSoup to get all page links. Then, we'll scrape emails from all pages using each page link:
Python
ScrapFly
import httpx
from bs4 import BeautifulSoup
import re
import json
def scrape_yellowpages_search(main_url: str):
links = []
response = httpx.get(url=main_url)
soup = BeautifulSoup(response.text, "html.parser")
# Loop through all page boxes in the search result page
for link_box in soup.select("div.info-section.info-primary"):
# Extract the link of each page and add it to the main website URL
link = "https://www.yellowpages.com" + link_box.select_one("a").attrs["href"]
links.append(link)
return links
def scrape_emails(links: list):
emails = {}
for link in links:
# Send a GET request to each company page
page_response = httpx.get(url=link)
soup = BeautifulSoup(page_response.text, "html.parser")
# Extract the company name from the HTML
company_name = soup.select_one("h1.dockable.business-name").text
# Find all a tags with href that contain (mailto:) text
for link in soup.findAll("a", attrs={"href": re.compile("^mailto:")}):
# Extract the email address from the href attribute
email = link.get("href").replace("mailto:", "")
# Check if the company name exists in the emails dictionary and add it if not
if company_name not in emails:
emails[company_name] = []
emails[company_name].append(email)
return emails
# Scrape all links and save it to a list
links = scrape_yellowpages_search("https://www.yellowpages.com/search?search_terms=software+company&geo_location_terms=San+Francisco%2C+CA")
# Scrape all emails from each page link
emails = scrape_emails(links)
# Print the result in JSON format
print(json.dumps(emails, indent=4))
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
import json
scrapfly = ScrapflyClient(key="Your API key")
def scrape_yellowpages_search(main_url: str):
links = []
api_response: ScrapeApiResponse = scrapfly.scrape(
scrape_config=ScrapeConfig(
url=main_url
)
)
# Use the built in selector in the api_response
selector = api_response.selector
# Loop through all page boxes in the search result page
for link_box in selector.css("div.info-section.info-primary"):
link = "https://www.yellowpages.com" + link_box.css("a").attrib["href"]
links.append(link)
return links
def scrape_emails(links: list):
emails = {}
for link in links:
# Scrape each company page
api_response: ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url=link))
selector = api_response.selector
# Extract the company name from the HTML
company_name = selector.css("h1.dockable.business-name::text").get()
# Find all a tags with href that contain (mailto:) text
email_links = selector.css('a[href^="mailto:"]')
for link in email_links:
email = link.attrib['href'].replace("mailto:", "")
if company_name not in emails:
emails[company_name] = []
emails[company_name].append(email)
return emails
# Scrape all links and save it to a list
links = scrape_yellowpages_search("https://www.yellowpages.com/search?search_terms=software+company&geo_location_terms=San+Francisco%2C+CA")
# Scrape all emails from each page link
emails = scrape_emails(links)
# Print the result in JSON format
print(json.dumps(emails, indent=4))
The above code is split into two main functions. The scrape_yellowpages_search is used to scrape all page links in the main search page and append them to the links array.
Next, we use scrape_emails to send a request to each page link and get the page HTML. Then, we use the findAll method to get all a tags and use the regex compile method to only select a tags with the mailto text. Finally, we append all emails on the page to each company_name variable and print the final emails dictionary.
Found all emails using a simple regex pattern ^mailto:
Our email scraper got all the emails it could find in the HTML for every yelp business page.
However, due to the email obfuscation techniques, scraping emails isn't always this easy. Let's take a look at some of challenges when it comes to creating a perfect free email scraper tool.
How to Scrape Emails With Obfuscation Challenges?
The goal of email obfuscation challenges is to prevent automated tools like web scrapers from finding emails. The implementation of most challenges is very similar - render emails into HTML using JavaScript. This works against bots as most bots don't support JavaScript as that requires a real web browser automation which is much more expensive.
There are different types of obfuscation challenges. Here's a list of the most popular ones:
Encoding email links.
Which requires clicking buttons to decode the link and reveal the real email address:
Encrypting email addresses.
which requires enabling JavaScript to decrypt email tokens or reverse engineering and replicating the decryption algorithm:
We can see the email address on the web page and the HTML code. Let's try to scrape it:
import httpx
from bs4 import BeautifulSoup
url = "https://yellowpages.com.eg/en"
r = httpx.get(url)
soup = BeautifulSoup(r.text, "html.parser")
email = soup.select("div.row-bottom-flex")[0].select("a")[2].select_one("span")
print(email)
This website uses Cloudflare email obfuscation which basically uses javaScript to decode email tokens on page load. Since we sent a request with a client that doesn't run JavaScript, we can only see the encoded email token:
To de-obfuscate Cloudflare emails, we can use a simple Python algorithm:
def DecodeEmail(encodedString):
r = int(encodedString[:2],16)
email = ''.join([chr(int(encodedString[i:i+2], 16) ^ r) for i in range(2, len(encodedString), 2)])
return email
print (DecodeEmail('a9eadcdaddc6c4ccdbeac8dbcce9f0ccc5c5c6de87cac6c487ccce'))
# CustomerCare@Yellow.com.eg
Email decoding and deobfuscation is not always this easy. Many custom methods can be surprisingly complex to reverse engineer so sometimes running a real web browser using a a headless browser for web scraping is a more robust solution.
That being said, running headless browsers really slow and consume a lot of resources. Let's take a look at a better solution with Scrapfly's cloud browsers!
Scrape Emails with ScrapFly
Scraping emails is not particularly challenging but scaling up operations like this can quickly introduce new unforseen challenges like blocking and this is where Scrapfly can be of assistance.
For example, by using the ScrapFly render_js feature with the previous example. We can easily enable JavaSscript and scrape emails without facing obfuscation challenges:
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
api_response: ScrapeApiResponse = scrapfly.scrape(
ScrapeConfig(
url="https://www.yellowpages.com/mill-valley-ca/mip/buckeye-roadhouse-4193022",
# Activate the render_js feature to scrape dynamically loaded content
render_js=True,
# Bypass anti scraping protections
asp=True,
# Set any geographical country
country="US",
)
)
email = api_response.selector.css(".email-business::attr(href)").get()
print(email.split('mailto:')[-1])
# buckeyeroadhouse@comcast.net
FAQ
To wrap up this guide on how to scrape emails, let's take a look at some frequently asked questions.
Is email scraping legal?
It's perfectly legal to scrape emails for non-commercial use as long as you respect the website's terms of service. However, in some regions like the European Union, email scraping may violate the protection of the individuals' personal data - known as the GDRP.
What's regex for email?
For general text the best email regex is ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ and clickable emails in HTML the best regex is "mailto:.+?" which capture every address that implements the mailto: protocol.
What is an email scraper?
Email scrapers are small programs or scripts that can collect public email addresses available on the web. This tool is a great way to discover new leads or to collect emails for analytics and sentiment analysis.
Email Scraping Summary
In this article, we created an email scraping tool using nothing but Python. To start, we used httpx to crawl pages that might contain email addresses. Then, we used HTML parsing tool beautifulsoup4 to find emails using CSS selectors and email regex patterns.
We also explained the most common obfuscation challenges that's used to block email scraping by rendering emails using JavaScript. In a nutshell, these challenges include: