Web Scraping Emails using Python

Web Scraping Emails using Python

Have you wondered how businesses seem to have an endless list of email contacts? Email scraping can do that!

In this article, we'll explore how to scrape emails from websites with Python. We'll also cover the most common email scraping challenges and how to overcome them. Let's dig in!

How Websites Store Emails in HTML?

The most common way of storing emails in HTML is using mailto links. Which is a clickable link that opens the default email-sending client and adds the recipient's email automatically. These email links can be found in the HTML under a tags:

<a href="mailto:email@example.com">Send Email</a>

Websites can also store emails in HTML as plain text:

<p>Email: email@example.com</p>

It can be straightforward to scrape emails from the above HTML sources. However, some websites use obfuscation techniques to prevent scrapers from accessing emails as a fear of email spam.

We'll cover these obfuscation techniques in a dedicated section. But before that, let's take a look at the tools we'll use in this tutorial.

Setup

In this email scraping guide, we'll use this page on Yellowpages as our scraping target:

Email scraping target website

We have covered scraping Yellowpages in detail before. In this guide, we'll use it to scrape emails.

How to Scrape YellowPages.com

For more on yellowpages scraping see our full, complete tutorial on how to scrape YelloPages.com using Python

How to Scrape YellowPages.com

To scrape emails, we'll use httpx for sending requests and BeautifulSoup for HTML parsing. These libraries can be installed using the pip command:

pip install httpx bs4

How to Scrape a Website For Email Addresses?

To scrape emails from websites we really need to focus on HTML parsing techniques. Since all emails follow a predictable structure like something@something.something the easiest way to find emails in an HTML page is to use regex. These regex patterns can match against the full email structure or against a specific criteria like mailto links:

# Matching against the email structure:
email_pattern = re.compile(r'([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})')
email_matches = re.findall(email_pattern, page_html)

# Matching against emails links with mailto: text
email_matches = soup.findAll("a", attrs={"href": re.compile("^mailto:")})

The first method is suitable for scraping emails when they are found as plain text. However, it can be time-consuming as the script will search through all the HTML.

Now let's apply email regex matching to our target website. It has multiple company pages, each page stores emails as links under a tags with a mailto text:

Inspect the page to see how it stores emails in HTML
Inspect the page to see how it stores emails in HTML

We'll use httpx and BeautifulSoup to get all page links. Then, we'll scrape emails from all pages using each page link:

Python
ScrapFly
import httpx
from bs4 import BeautifulSoup
import re
import json

def scrape_yellowpages_search(main_url: str):
    links = []
    response = httpx.get(url=main_url)
    soup = BeautifulSoup(response.text, "html.parser")

    # Loop through all page boxes in the search result page
    for link_box in soup.select("div.info-section.info-primary"):
        # Extract the link of each page and add it to the main website URL
        link = "https://www.yellowpages.com" + link_box.select_one("a").attrs["href"]
        links.append(link)

    return links

def scrape_emails(links: list):
    emails = {}
    for link in links:
        # Send a GET request to each company page
        page_response = httpx.get(url=link)
        soup = BeautifulSoup(page_response.text, "html.parser")

        # Extract the company name from the HTML
        company_name = soup.select_one("h1.dockable.business-name").text
        # Find all a tags with href that contain (mailto:) text
        for link in soup.findAll("a", attrs={"href": re.compile("^mailto:")}):
            # Extract the email address from the href attribute
            email = link.get("href").replace("mailto:", "")
            # Check if the company name exists in the emails dictionary and add it if not
            if company_name not in emails:
                emails[company_name] = []
            emails[company_name].append(email)

    return emails

# Scrape all links and save it to a list
links = scrape_yellowpages_search("https://www.yellowpages.com/search?search_terms=software+company&geo_location_terms=San+Francisco%2C+CA")
# Scrape all emails from each page link
emails = scrape_emails(links)

# Print the result in JSON format
print(json.dumps(emails, indent=4))
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
import json

scrapfly = ScrapflyClient(key="Your API key")

def scrape_yellowpages_search(main_url: str):
    links = []
    api_response: ScrapeApiResponse = scrapfly.scrape(
        scrape_config=ScrapeConfig(
            url=main_url
        )
    )
    # Use the built in selector in the api_response
    selector = api_response.selector
    # Loop through all page boxes in the search result page
    for link_box in selector.css("div.info-section.info-primary"):
        link = "https://www.yellowpages.com" + link_box.css("a").attrib["href"]
        links.append(link)

    return links


def scrape_emails(links: list):
    emails = {}
    for link in links:
        # Scrape each company page
        api_response: ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url=link))
        selector = api_response.selector
        # Extract the company name from the HTML
        company_name = selector.css("h1.dockable.business-name::text").get()
        # Find all a tags with href that contain (mailto:) text
        email_links = selector.css('a[href^="mailto:"]')
        for link in email_links:
            email = link.attrib['href'].replace("mailto:", "")
            if company_name not in emails:
                emails[company_name] = []
            emails[company_name].append(email)
    
    return emails

# Scrape all links and save it to a list
links = scrape_yellowpages_search("https://www.yellowpages.com/search?search_terms=software+company&geo_location_terms=San+Francisco%2C+CA")
# Scrape all emails from each page link
emails = scrape_emails(links)

# Print the result in JSON format
print(json.dumps(emails, indent=4))

The above code is split into two main functions. The scrape_yellowpages_search is used to scrape all page links in the main search page and append them to the links array.

Next, we use scrape_emails to send a request to each page link and get the page HTML. Then, we use the findAll method to get all a tags and use the regex compile method to only select a tags with the mailto text. Finally, we append all emails on the page to each company_name variable and print the final emails dictionary.

He are the emails we scraped:

Output
{
    "Webinopoly | Ecommerce Shopify Experts": [
        "hello@webinopoly.com"
    ],
    "My Computer Works Inc": [
        "larrybuckmiller1@google.com"
    ],
    "Geeks on Site": [
        "vborja@ottepolo.com"
    ],
    "Playfirst Inc": [
        "mmeunier@madsenstaffing.com"
    ],
    "Piston Cloud Computing, Inc.": [
        "info@pistoncloud.com"
    ],
    "Ecairn Inc": [
        "conversation+sale@ecairn.com"
    ],
    "Revel Systems iPad POS": [
        "info@revelsystems.com"
    ],
    "Forecross Corp": [
        "business-development@forecross.com"
    ],
    "Appstem": [
        "info@appstem.com"
    ],
    "Sliderocket": [
        "sliderocket@clearslide.com"
    ],
    "Wiser": [
        "info@wiser.com"
    ],
    "Cider": [
        "ilya.lipovich@getcider.com",
        "info@getcider.com"
    ],
    "Cygent Inc.": [
        "hostmaster@cygent.com"
    ],
    "Lecco Technology Inc": [
        "kinchan@rll.com"
    ],
    "Tapjoy": [
        "jdrobick@ebay.com",
        "accountpayable@tapjoy.com",
        "bbb-hotline@tapjoy.com",
        "brett.nicholson@tapjoy.com",
        "dan.bellig@tapjoy.com",
        "escalations@tapjoy.com",
        "jinah.haytko@tapjoy.com",
        "sherry.zarabi@tapjoy.com",
        "info@tapjoy.com"
    ],
    "Splunk Inc": [
        "parvesh.jain@splunk.com",
        "DMCA@splunk.com",
        "splunkbr@splunk.com"
    ],
    "Centaur Multimedia": [
        "sales@1focus-medical-billing.com"
    ],
    "Cloud Admin Solution": [
        "benny@cloudadminsolution.com"
    ],
    "VFIX Onsite Computer Service": [
        "nikan@myvfix.com"
    ],
    "Kylie.ai": [
        "admin@kylie.ai"
    ],
    "Trash Warrior": [
        "support@trashwarrior.com",
        "raymond@trashwarrior.com"
    ],
    "TechDaddy": [
        "info@techdaddy.net"
    ],
    "DSIDEX": [
        "office@dsidex.com",
        "sales@dsidex.com",
        "info@dsidex.com"
    ],
    "RT Dynamic": [
        "katy@rtdynamic.com"
    ]
}

So to scrape emails from a website we:

  1. Retrieve the HTML pages using httpx
  2. Parse the HTML using BeautifulSoup
  3. Found all emails using a simple regex pattern ^mailto:

Our email scraper got all the emails it could find in the HTML for every yelp business page.

However, due to the email obfuscation techniques, scraping emails isn't always this easy. Let's take a look at some of challenges when it comes to creating a perfect free email scraper tool.

How to Scrape Emails With Obfuscation Challenges?

The goal of email obfuscation challenges is to prevent automated tools like web scrapers from finding emails. The implementation of most challenges is very similar - render emails into HTML using JavaScript. This works against bots as most bots don't support JavaScript as that requires a real web browser automation which is much more expensive.

There are different types of obfuscation challenges. Here's a list of the most popular ones:

  • Encoding email links.
    Which requires clicking buttons to decode the link and reveal the real email address:
<a href="mailto:%65%6d%61%69%6c%40%65%78%61%6d%70%6c%65%2e%63%6f%6d">email</a>
  • Concatenating email addresses.
    Which requires running JavaScript code to write the full email in the HTML:
<script>document.write('<a href="mailto:'+'e'+'m'+'a'+'i'+'l'+'@'+'e'+'x'+'a'+'m'+'p'+'l'+'e'+'.'+'c'+'o'+'m'+'">email</a>');</script>
  • Encrypting email addresses.
    which requires enabling JavaScript to decrypt email tokens or reverse engineering and replicating the decryption algorithm:
<a class="email" href="7179757d7854716c75796478713a777b79" rel="nofollow, noindex">email</a>
<script src="js/decrypt_email.js" defer></script>
  • Storing emails in images.
    Which requires image downloading and OCR to extract the email:
<img src="images/email.jpg" width="216" height="18" alt="email address">
  • CAPTCHA challenges.
    Some websites require bypassing captcha challenges to reveal emails.

For example, let's take a look at this localized Yellowpages website for Egypt :

Localized version of Yellowpages

We can see the email address on the web page and the HTML code. Let's try to scrape it:

import httpx
from bs4 import BeautifulSoup

url = "https://yellowpages.com.eg/en"

r = httpx.get(url)
soup = BeautifulSoup(r.text, "html.parser")

email = soup.select("div.row-bottom-flex")[0].select("a")[2].select_one("span")
print(email)

This website uses Cloudflare email obfuscation which basically uses javaScript to decode email tokens on page load. Since we sent a request with a client that doesn't run JavaScript, we can only see the encoded email token:

<span class="__cf_email__" data-cfemail="a9eadcdaddc6c4ccdbeac8dbcce9f0ccc5c5c6de87cac6c487ccce">[email protected]</span>

To de-obfuscate Cloudflare emails, we can use a simple Python algorithm:

def DecodeEmail(encodedString):
    r = int(encodedString[:2],16)
    email = ''.join([chr(int(encodedString[i:i+2], 16) ^ r) for i in range(2, len(encodedString), 2)])
    return email

print (DecodeEmail('a9eadcdaddc6c4ccdbeac8dbcce9f0ccc5c5c6de87cac6c487ccce'))
# CustomerCare@Yellow.com.eg

Email decoding and deobfuscation is not always this easy. Many custom methods can be surprisingly complex to reverse engineer so sometimes running a real web browser using a a headless browser for web scraping is a more robust solution.

That being said, running headless browsers really slow and consume a lot of resources. Let's take a look at a better solution with Scrapfly's cloud browsers!

Scrape Emails with ScrapFly

Scraping emails is not particularly challenging but scaling up operations like this can quickly introduce new unforseen challenges like blocking and this is where Scrapfly can be of assistance.

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

For example, by using the ScrapFly render_js feature with the previous example. We can easily enable JavaSscript and scrape emails without facing obfuscation challenges:

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")

api_response: ScrapeApiResponse = scrapfly.scrape(
    ScrapeConfig(
        url="https://www.yellowpages.com/mill-valley-ca/mip/buckeye-roadhouse-4193022",
        # Activate the render_js feature to scrape dynamically loaded content
        render_js=True,
        # Bypass anti scraping protections
        asp=True,
        # Set any geographical country
        country="US",
    )
)

email = api_response.selector.css(".email-business::attr(href)").get()
print(email.split('mailto:')[-1])
# buckeyeroadhouse@comcast.net

FAQ

To wrap up this guide on how to scrape emails, let's take a look at some frequently asked questions.

It's perfectly legal to scrape emails for non-commercial use as long as you respect the website's terms of service. However, in some regions like the European Union, email scraping may violate the protection of the individuals' personal data - known as the GDRP.

What's regex for email?

For general text the best email regex is ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ and clickable emails in HTML the best regex is "mailto:.+?" which capture every address that implements the mailto: protocol.

What is an email scraper?

Email scrapers are small programs or scripts that can collect public email addresses available on the web. This tool is a great way to discover new leads or to collect emails for analytics and sentiment analysis.

Email Scraping Summary

In this article, we created an email scraping tool using nothing but Python. To start, we used httpx to crawl pages that might contain email addresses. Then, we used HTML parsing tool beautifulsoup4 to find emails using CSS selectors and email regex patterns.

We also explained the most common obfuscation challenges that's used to block email scraping by rendering emails using JavaScript. In a nutshell, these challenges include:

  • Encoding email links.
  • Concatenating emails using JavaScript.
  • Encrypting emails.
  • Storing emails in images.
  • CAPTCHA challenges.

Related Posts

Playwright Examples for Web Scraping and Automation

Learn Playwright with Python and JavaScript examples for automating browsers like Chromium, WebKit, and Firefox.

How to use wget in Python

Learn how to use wget in Python through subprocess calls and what are other options.

Ultimate Guide to JSON Parsing in Python

Learn JSON parsing in Python with this ultimate guide. Explore basic and advanced techniques using json, and tools like ijson and nested-lookup