In this tutorial, we'll explain how to scrape Yellowpages - an online directory of various US-based businesses.
YellowPages.com is the digital version of telephone directories called yellow pages. It contains business information such as phone numbers, websites, and addresses as well as business reviews.
In this tutorial, we'll be using Python to scrape all of that business and review information. We'll also apply a few HTML parsing tricks to extract the data from its pages effectively. Let's dive in!
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Why Scrape YellowPages.com?
YellowPages contains thousands of businesses and their details, such as phone numbers, websites and locations. Therefore, we can utilize YellowPages web scraping for various use cases, from market and business analytics, to acquire competitive advantage or leads.
Furthermore, YallowPages features user reviews on the business. Scraping YellowPages allows for retrieving this data quickly, which can be utilized in machine learning techniques for analyzing and gaining insights into users' experiences and opinions.
Before we start, keep in mind that YellowPages is only accessible to US-based IP addresses. So, if you are located outside the US, you will need to use a US-based Proxy or VPN to access the website. Alternatively, run the full YellowPages scraper code on GitHub in the ScrapFly version.
To scrape YellowPages, we'll use Python with a few community packages:
httpx - An HTTP client library we'll use to request the YellowPages server.
parsel - An HTML parsing library we'll use to parse the HTML we get using selectors like CSS and XPath.
loguru - A logging library we'll use to monitor our YellowPages scraper.
Note that asyncio comes pre-installed in Python. So, you will only have to install other packages using the following pip command:
$ pip install httpx parsel loguru
Alternatively, feel free to swap httpx out with any other HTTP client package such as requests. As for, parsel, another great alternative is beautifulsoup.
How to Find Companies on YellowPages
Before we scrape YellowPages for company data, we need to find them first. To find them, we can use either of two approaches. The first one is using the YellowPages sitemap, which contain links for all categories and pages on the website.
However, we'll use a more flexible approach, the search pages.
We can see that upon submitting we submit a search request, YellowPages redirects us to a new URL containing pages of results. Let's scrape these results in the following section.
How to Scrape YellowPages Search
To scrape YellowPages, we need to form a search URL using a search query and a few parameters. Below is an example of using the base search URL with the minimum parameters:
The above URL include the search query, location and search page number. Let's apply this URL structure with an example search query. We'll search for
Let's apply the above URL structure with an example search query, we'll search for japanese restaurants in San Francisco, California. Here is the page we got by requesting this URL:
We'll scrape YellowPages search page data from the marked fields above. Let's start by defining our parsing logic:
def parse_search(response) -> Preview:
"""parse yellowpages.com search page for business preview data"""
sel = Selector(text=response.text)
parsed = []
for result in sel.css(".organic div.result"):
links = {}
for link in result.css("div.links>a"):
name = link.xpath("text()").get()
url = link.xpath("@href").get()
links[name] = url
first = lambda css: result.css(css).get("").strip()
many = lambda css: [value.strip() for value in result.css(css).getall()]
parsed.append(
{
"name": first("a.business-name ::text"),
"url": urljoin("https://www.yellowpages.com/", first("a.business-name::attr(href)")),
"links": links,
"phone": first("div.phone::text"),
"categories": many(".categories>a::text"),
"address": first(".adr .street-address::text"),
"location": first(".adr .locality::text"),
"rating": first(".ratings .rating div::attr(class)").split(" ", 1)[-1],
"rating_count": first(".ratings .rating span::text").strip("()"),
}
)
return parsed
Here, we define a parse_search function. It iterates over result boxes and uses the CSS selectors to extract business preview information, such as phone number, rating, name and most importantly, link to their full information page.
Next, we'll utilize the parsing logic while requesting the search pages to scrape the data:
import asyncio
from urllib.parse import urljoin
import asyncio
import json
import httpx
from parsel import Selector
from loguru import logger as log
from typing_extensions import TypedDict
from typing import List
class Preview(TypedDict):
"""Type hint container for business preview data. This object just helps us to keep track what results we'll be getting"""
name: str
url: str
links: List[str]
phone: str
categoresi: List[str]
address: str
location: str
rating: str
rating_count: str
def parse_search(response) -> Preview:
"""parse yellowpages.com search page for business preview data"""
sel = Selector(text=response.text)
parsed = []
for result in sel.css(".organic div.result"):
links = {}
for link in result.css("div.links>a"):
name = link.xpath("text()").get()
url = link.xpath("@href").get()
links[name] = url
first = lambda css: result.css(css).get("").strip()
many = lambda css: [value.strip() for value in result.css(css).getall()]
parsed.append(
{
"name": first("a.business-name ::text"),
"url": urljoin("https://www.yellowpages.com/", first("a.business-name::attr(href)")),
"links": links,
"phone": first("div.phone::text"),
"categories": many(".categories>a::text"),
"address": first(".adr .street-address::text"),
"location": first(".adr .locality::text"),
"rating": first(".ratings .rating div::attr(class)").split(" ", 1)[-1],
"rating_count": first(".ratings .rating span::text").strip("()"),
}
)
return parsed
# to avoid being instantly blocked we should use request headers that of a common web browser:
BASE_HEADERS = {
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
# to run our scraper we need to start httpx session:
async def run():
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
response = await session.get("https://www.yellowpages.com/search?search_terms=Japanese+Restaurants&geo_location_terms=San+Francisco%2C+CA")
result_search = parse_search(response)
# print the results in JSOn format
print(json.dumps(result_search, indent=2))
if __name__ == "__main__":
asyncio.run(run())
The above code can scrape a single search page. Let's modify it to crawl over other search pages:
import asyncio
from urllib.parse import urlencode, urljoin
import asyncio
import json
import math
import httpx
from parsel import Selector
from loguru import logger as log
from typing_extensions import TypedDict
from typing import List, Optional
class Preview(TypedDict):
"""Type hint container for business preview data. This object just helps us to keep track what results we'll be getting"""
name: str
url: str
links: List[str]
phone: str
categoresi: List[str]
address: str
location: str
rating: str
rating_count: str
def parse_search(response) -> Preview:
"""parse yellowpages.com search page for business preview data"""
sel = Selector(text=response.text)
parsed = []
for result in sel.css(".organic div.result"):
links = {}
for link in result.css("div.links>a"):
name = link.xpath("text()").get()
url = link.xpath("@href").get()
links[name] = url
first = lambda css: result.css(css).get("").strip()
many = lambda css: [value.strip() for value in result.css(css).getall()]
parsed.append(
{
"name": first("a.business-name ::text"),
"url": urljoin("https://www.yellowpages.com/", first("a.business-name::attr(href)")),
"links": links,
"phone": first("div.phone::text"),
"categories": many(".categories>a::text"),
"address": first(".adr .street-address::text"),
"location": first(".adr .locality::text"),
"rating": first(".ratings .rating div::attr(class)").split(" ", 1)[-1],
"rating_count": first(".ratings .rating span::text").strip("()"),
}
)
return parsed
async def search(query: str, session: httpx.AsyncClient, location: Optional[str] = None) -> List[Preview]:
"""search yellowpages.com for business preview information scraping all of the pages"""
def make_search_url(page):
base_url = "https://www.yellowpages.com/search?"
parameters = {"search_terms": query, "geo_location_terms": location, "page": page}
return base_url + urlencode(parameters)
log.info(f'scraping "{query}" in "{location}"')
first_page = await session.get(make_search_url(1))
sel = Selector(text=first_page.text)
total_results = int(sel.css(".pagination>span::text ").re("of (\d+)")[0])
total_pages = int(math.ceil(total_results / 30))
log.info(f'{query} in {location}: scraping {total_pages} of business preview pages')
previews = parse_search(first_page)
for result in await asyncio.gather(*[session.get(make_search_url(page)) for page in range(2, total_pages + 1)]):
previews.extend(parse_search(result))
log.success(f'{query} in {location}: scraped {len(previews)} total of business previews')
return previews
he above function implements a complete scraping loop. We generate search URLs from a given query and location parameters. Then, we scrape the first results page to extract the total result number and scrape the remaining pages concurrently. This is a common pagination web scraping idiom:
Our YellowPages scraper can find and scrape business data from search pages. Next, we'll scrape the dedicated business pages.
How to Scrape Yellowpages Company Data
To scrape company data, we need to request each company URL that we found previously. Let's start with an example URL of the restaurant business ozumo restaurant:
We'll scrape the marked fields in the above image. First, let's start with the scraping logic:
class Company(TypedDict):
"""type hint container for company data found on yellowpages.com"""
name: str
categories: List[str]
rating: str
rating_count: str
phone: str
website: str
address: str
work_hours: Dict[str, str]
def parse_company(response) -> Company:
"""extract company details from yellowpages.com company's page"""
sel = Selector(text=response.text)
# here we define some lamba shortcuts for parsing common data like
# selecting first elements, many elements and join all elements together and
first = lambda css: sel.css(css).get("").strip()
many = lambda css: [value.strip() for value in sel.css(css).getall()]
together = lambda css, sep=" ": sep.join(sel.css(css).getall())
# to parse working hours we need to do a bit of complex string parsing
def _parse_datetime(values: List[str]):
"""
parse datetime from yellow pages datetime strings
>>> _parse_datetime(["Fr-Sa 12:00-22:00"])
{'Fr': '12:00-22:00', 'Sa': '12:00-22:00'}
>>> _parse_datetime(["Fr 12:00-22:00"])
{'Fr': '12:00-22:00'}
>>> _parse_datetime(["Fr-Sa 12:00-22:00", "We 10:00-18:00"])
{'Fr': '12:00-22:00', 'Sa': '12:00-22:00', 'We': '10:00-18:00'}
"""
WEEKDAYS = ["Mo", "Tu", "We", "Th", "Fr", "Sa", "Su"]
results = {}
for text in values:
days, hours = text.split(" ")
if "-" in days:
day_start, day_end = days.split("-")
for day in WEEKDAYS[
WEEKDAYS.index(day_start) : WEEKDAYS.index(day_end) + 1
]:
results[day] = hours
else:
results[days] = hours
return results
return {
"name": first("h1.business-name::text"),
"categories": many(".categories>a::text"),
"rating": first(".ratings div::attr(class)").split(" ", 1)[-1],
"ratingCount": first(".ratings .count::text").strip("()"),
"phone": first(".phone::attr(href)").replace("(", "").replace(")", ""),
"website": first(".website-link::attr(href)"),
"address": together(".address::text"),
"workingHours": _parse_datetime(many(".open-details tr time::attr(datetime)")),
}
Here, we use the CSS selectors to extract specific company fields we marked earlier. We also process and clean a few fields, such as the phone numbers and unpackeding the work days from range like Mo-We to individual values like Mo,Tu,We.
Next, let's use the parse_company function we defined while requesting the company pages:
import httpx
import json
import asyncio
from parsel import Selector
from typing_extensions import TypedDict
from typing import Dict, List
class Company(TypedDict):
"""type hint container for company data found on yellowpages.com"""
name: str
categories: List[str]
rating: str
rating_count: str
phone: str
website: str
address: str
work_hours: Dict[str, str]
def parse_company(response) -> Company:
"""extract company details from yellowpages.com company's page"""
sel = Selector(text=response.text)
# here we define some lamba shortcuts for parsing common data like
# selecting first elements, many elements and join all elements together and
first = lambda css: sel.css(css).get("").strip()
many = lambda css: [value.strip() for value in sel.css(css).getall()]
together = lambda css, sep=" ": sep.join(sel.css(css).getall())
# to parse working hours we need to do a bit of complex string parsing
def _parse_datetime(values: List[str]):
"""
parse datetime from yellow pages datetime strings
>>> _parse_datetime(["Fr-Sa 12:00-22:00"])
{'Fr': '12:00-22:00', 'Sa': '12:00-22:00'}
>>> _parse_datetime(["Fr 12:00-22:00"])
{'Fr': '12:00-22:00'}
>>> _parse_datetime(["Fr-Sa 12:00-22:00", "We 10:00-18:00"])
{'Fr': '12:00-22:00', 'Sa': '12:00-22:00', 'We': '10:00-18:00'}
"""
WEEKDAYS = ["Mo", "Tu", "We", "Th", "Fr", "Sa", "Su"]
results = {}
for text in values:
days, hours = text.split(" ")
if "-" in days:
day_start, day_end = days.split("-")
for day in WEEKDAYS[
WEEKDAYS.index(day_start) : WEEKDAYS.index(day_end) + 1
]:
results[day] = hours
else:
results[days] = hours
return results
return {
"name": first("h1.business-name::text"),
"categories": many(".categories>a::text"),
"rating": first(".ratings div::attr(class)").split(" ", 1)[-1],
"ratingCount": first(".ratings .count::text").strip("()"),
"phone": first(".phone::attr(href)").replace("(", "").replace(")", ""),
"website": first(".website-link::attr(href)"),
"address": together(".address::text"),
"workingHours": _parse_datetime(many(".open-details tr time::attr(datetime)")),
}
async def scrape_company(url: str, session: httpx.AsyncClient) -> Company:
"""scrape yellowpage.com company page details"""
first_page = await session.get(url)
return parse_company(first_page)
BASE_HEADERS = {
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
async def run():
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
result_company = await scrape_company(
"https://www.yellowpages.com/san-francisco-ca/mip/ozumo-japanese-restaurant-8083027", session=session,
)
print(json.dumps(result_company))
if __name__ == "__main__":
asyncio.run(run())
Cool! With just a few lines of code, our YellowPages scraper was able to get all the essential business details. Next, we'll scrape the business reviews!
How to Scrape Yellowpages Reviews
To scrape business reviews, we'll have to send additional requests to the review pages. For example, if we go back to our Japanese restaurant listing and scroll to the bottom, we can find review paging URL format:
From the above image, we can see that we can paginate over reviews using the page parameter. And since we know the total number of reviews, we can crawl over review pages to extract all the reviews:
import asyncio
import httpx
import math
import json
from typing import List
from typing_extensions import TypedDict
from parsel import Selector
from urllib.parse import urlencode
class Company(TypedDict):
# rest of the Company class we defined earlier
def parse_company(response):
# the parse_company logic we defined earlier
class Review(TypedDict):
"""type hint for yellowpages.com scraped review"""
id: str
author: str
source: str
date: str
stars: str
title: str
text: str
def parse_reviews(response) -> List[Review]:
"""parse company page for visible reviews"""
sel = Selector(text=response.text)
reviews = []
for box in sel.css("#reviews-container>article"):
first = lambda css: box.css(css).get("").strip()
many = lambda css: [value.strip() for value in box.css(css).getall()]
reviews.append(
{
"id": box.attrib.get("id"),
"author": first("div.author::text"),
"source": first("span.attribution>a::text"),
"date": first("p.date-posted>span::text"),
"stars": len(many(".result-ratings ul>li.rating-star")),
"title": first(".review-title::text"),
"text": first(".review-response p::text"),
}
)
return reviews
class CompanyData(TypedDict):
info: Company
reviews: List[Review]
# Now we can extend our company scraper to pick up reviews as well!
async def scrape_company(url, session: httpx.AsyncClient, get_reviews=True) -> CompanyData:
"""scrape yellowpage.com company page details"""
first_page = await session.get(url)
sel = Selector(text=first_page.text)
if not get_reviews:
return parse_company(first_page)
reviews = parse_reviews(first_page)
if reviews:
total_reviews = int(sel.css(".pagination-stats::text").re(r"of (\d+)")[0])
total_pages = int(math.ceil(total_reviews / 20))
for response in await asyncio.gather(
*[session.get(url + "?" + urlencode({"page": page})) for page in range(2, total_pages + 1)]
):
reviews.extend(parse_reviews(response))
return {
"info": parse_company(first_page),
"reviews": reviews,
}
In the above code, we apply the pagination approach we used in the search scraping logic. We also utilize the company parsing logic to extract the company information alongside the reviews. Let's run our YellowPages scraping code and have a look at the results:
{
"info": {
"name": "Ozumo Japanese Restaurant",
"categories": [
"Japanese Restaurants",
"Asian Restaurants",
"Caterers",
"Japanese Restaurants",
"Asian Restaurants",
"Caterers",
"Family Style Restaurants",
"Restaurants",
"Sushi Bars"
],
"rating": "three half",
"rating_count": "72",
"phone": "(415) 882-1333",
"website": "http://www.ozumo.com",
"address": "161 Steuart St San Francisco, CA 94105",
"work_hours": {
"Mo": "16:00-22:00",
"Tu": "16:00-22:00",
"We": "16:00-22:00",
"Th": "16:00-22:00",
"Fr": "12:00-22:00",
"Sa": "12:00-22:00",
"Su": "12:00-21:00"
}
},
"reviews": [
{
"id": "<redacted for blog use>",
"author": "<redacted for blog use>",
"source": "Citysearch",
"date": "03/18/2010",
"stars": 5,
"title": "Mindblowing Japanese!",
"text": "Wow what a dinner! I went to Ozumo last night with a friend for a complimentary meal I had won by being a Citysearch Dictator. It was AMAZING! We ordered the Hanabi (halibut) and Dohyo (ahi tuna) small plates as well as the Gindara (black cod) and Burikama (roasted yellowtail). Everything was absolutely delicious. They paired our meal with a variety of unique wines and sakes. The manager, Hiro, and our waitress were extremely knowledgeable about the food and how it was prepared. We started to tease the manager that he had a story for everything. His most boring story, he said, was about edamame. It was a great experience!"
},
...
]
}
With this last feature, we can scrape YellowPages business data from company, search and review pages. However, our YellowPages scraper is very likely to get blocked after sending a few additional requests. Let's explore how we can scale it!
Bypass Yellowpages Scraping Blocking
Scraping Yellowpages isn't very complicated but scaling up such scraping operations can be difficult and this where Scrapfly can lend a hand!
To take advantage of ScrapFly's API in our YellowPages web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests:
# standard web scraping code
import httpx
from parsel import Selector
response = httpx.get("some yellowpages.com URL")
selector = Selector(response.text)
# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient
# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
response = scrapfly.scrape(ScrapeConfig(
url="some yellowpages.com URL",
asp=True, # enable the anti scraping protection to bypass blocking
country="US", # set the proxy location to a specfic country
render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))
# use the built in Parsel selector
selector = response.selector
# access the HTML content
FAQ
To wrap this guide up let's take a look at some frequently asked questions about web scraping YellowPages.
Is it legal to scrape YellowPages.com ?
Yes, YellowPages's data is publicly available and it's legal to scrape them. Scraping YellowPages.com at slow, respectful rates would fall under the ethical scraping definition. For more details, refer to our Is Web Scraping Legal? article.
Is there an API for YellowPages?
No, unfortunately, YellowPages.com doesn't offer APIs for public use. However, as we've covered in this tutorial - scraping YellowPages using Python is straightforward.
Are there alternatives for scraping YellowPages?
Yes, Yelp.com is another public website for business directories. We have covered how to scrape Yelp in a previous guide.
In this article, we explained how to scrape YellowPages in Python. We started by reverse engineering the website behavior to understand its search system and find company pages on the website. Then, we used CSS selectors to parse the HTML pages and extract business details.
Finally, we have explained how to bypass YellowPages scraping blocking using ScrapFly's web scraping API.
In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.
In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.
In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.