In this tutorial, we'll be taking a look at how to scrape YellowPages.com - an online directory of various US-based businesses.
YellowPages is the digital version of telephone directories called yellow pages. It contains business information such as phone numbers, websites, and addresses as well as business reviews.
In this tutorial, we'll be using Python to scrape all of that business and review information. Let's dive in!
YellowPages.com contains millions of businesses and their details like phone numbers, websites and locations. All of this data can be used in various market and business analytics to acquire competitive advantage or leads. Not only that but YellowPages also contains review data, business images and service menus which can further be used in market analysis. For more on scraping use cases see our extensive web scraping use case article
Project Setup
To begin we first should note that yellowpages.com is only accessible to US-based IP addresses. So, if you're located outside of the US you'll need a US-based proxy, VPN to access yellowpages.com.
As for code, in this tutorial we'll be using Python and two major community packages:
httpx - HTTP client library which will let us communicate with YellowPages.com's servers
parsel - HTML parsing library which will help us to parse our web scraped HTML files. In this tutorial, we'll stick with CSS selectors as YellowPages HTML's are quite simple.
Optionally we'll also use loguru - a pretty logging library that'll help us keep track of what's going on.
These packages can be easily installed via pip command:
$ pip install httpx parsel loguru
Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package or anything that supports CSS selectors which is what we'll be using in this tutorial.
Our first goal is to figure out how to find companies we want to scrape on YellowPages.com. There are a few ways of achieving this. First, if we'd want to scrape all of the businesses in a specific area we can use yellowpages.com/sitemap page which contains links to all categories and locations.
However, we'll be using a more flexible and easier approach by scraping YellowPages search:
0:00
/
once we click find we are redirected to results page with 427 results
We can see that when we submit a search request YellowPages takes us to a new url containing pages of results. Let's see how we can scrape this.
Scraping Yellowpages Search
To scrape YellowPages search we'll be forming search URL from given parameters and then iterating through multiple page URLs to collect all business listings.
If we take a look at the URL format:
We can see that it takes in a few key parameters: query (e.g. "Japanese Restaurant"), location and the page number. Let's take a look at how we can scrape this efficiently.
For this, we'll be using parsel package with a few CSS selectors:
import asyncio
import math
from urllib.parse import urlencode, urljoin
import httpx
from parsel import Selector
from loguru import logger as log
from typing_extensions import TypedDict
from typing import Dict, List, Optional
class Preview(TypedDict):
"""Type hint container for business preview data. This object just helps us to keep track what results we'll be getting"""
name: str
url: str
links: List[str]
phone: str
categoresi: List[str]
address: str
location: str
rating: str
rating_count: str
def parse_search(response) -> Preview:
"""parse yellowpages.com search page for business preview data"""
sel = Selector(text=response.text)
parsed = []
for result in sel.css(".organic div.result"):
links = {}
for link in result.css("div.links>a"):
name = link.xpath("text()").get()
url = link.xpath("@href").get()
links[name] = url
first = lambda css: result.css(css).get("").strip()
many = lambda css: [value.strip() for value in result.css(css).getall()]
parsed.append(
{
"name": first("a.business-name ::text"),
"url": urljoin("https://www.yellowpages.com/", first("a.business-name::attr(href)")),
"links": links,
"phone": first("div.phone::text"),
"categories": many(".categories>a::text"),
"address": first(".adr .street-address::text"),
"location": first(".adr .locality::text"),
"rating": first(".ratings .rating div::attr(class)").split(" ", 1)[-1],
"rating_count": first(".ratings .rating span::text").strip("()"),
}
)
return parsed
In the above code, we first isolate each result by its bounding box and iterate through each 30 of the result boxes:
search page parsing markup
In each iteration, we use relative CSS selectors to collect business preview information such as phone number, rating, name and most importantly, link to their full information page.
Let's run our scraper and see the results it generates:
Run code & example output
import asyncio
import json
# to avoid being instantly blocked we should use request headers that of a common web browser:
BASE_HEADERS = {
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
# to run our scraper we need to start httpx session:
async def run():
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
response = await session.get("https://www.yellowpages.com/search?search_terms=Japanese+Restaurants&geo_location_terms=San+Francisco%2C+CA")
result_search = parse_search(response)
print(json.dumps(result_search, indent=2))
if __name__ == "__main__":
asyncio.run(run())
We can scrape a single search page so now all we have to do is wrap this logic in a scraping loop:
async def search(query: str, session: httpx.AsyncClient, location: Optional[str] = None) -> List[Preview]:
"""search yellowpages.com for business preview information scraping all of the pages"""
def make_search_url(page):
base_url = "https://www.yellowpages.com/search?"
parameters = {"search_terms": query, "geo_location_terms": location, "page": page}
return base_url + urlencode(parameters)
log.info(f'scraping "{query}" in "{location}"')
first_page = await session.get(make_search_url(1))
sel = Selector(text=first_page.text)
total_results = int(sel.css(".pagination>span::text ").re("of (\d+)")[0])
total_pages = int(math.ceil(total_results / 30))
log.info(f'{query} in {location}: scraping {total_pages} of business preview pages')
previews = parse_search(first_page)
for result in await asyncio.gather(*[session.get(make_search_url(page)) for page in range(2, total_pages + 1)]):
previews.extend(parse_search(result))
log.info(f'{query} in {location}: scraped {len(previews)} total of business previews')
return previews
Run code
async def run():
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
result_search = await search("japanese restaurants", location="San Francisco, CA", session=session)
print(json.dumps(result_search, indent=2))
if __name__ == "__main__":
asyncio.run(run())
The above function implements a complete scraping loop. We generate search URL from given query and location parameters. Then, we scrape the first results page to extract the total result number and scrape the remaining pages concurrently. This is a common pagination web scraping idiom:
efficient pagination scraping: get total results from first page and then scrape the rest of the pages together!
Now that we know how to find businesses and their preview data let's take a look at how can we scrape all of the business data by scraping each of these founded pages.
Scraping Yellowpages Company Data
To scrape company data we need to request each company's Yellowpages URL we found previously. Let's start with an example URL of a restaurant business ozumo-japanese-restaurant-8083027
We can see that the page contains a lot of business data we can scrape. Let's scrape these marked-up fields:
from parsel import Selector
from typing_extensions import TypedDict
from typing import Dict, List, Optional
from loguru import logger as log
class Company(TypedDict):
"""type hint container for company data found on yellowpages.com"""
name: str
categories: List[str]
rating: str
rating_count: str
phone: str
website: str
address: str
work_hours: Dict[str, str]
def parse_company(response) -> Company:
"""extract company details from yellowpages.com company's page"""
sel = Selector(text=response.text)
# here we define some lamba shortcuts for parsing common data like
# selecting first elements, many elements and join all elements together and
first = lambda css: sel.css(css).get("").strip()
many = lambda css: [value.strip() for value in sel.css(css).getall()]
together = lambda css, sep=" ": sep.join(sel.css(css).getall())
# to parse working hours we need to do a bit of complex string parsing
def _parse_datetime(values: List[str]):
"""
parse datetime from yellow pages datetime strings
>>> _parse_datetime(["Fr-Sa 12:00-22:00"])
{'Fr': '12:00-22:00', 'Sa': '12:00-22:00'}
>>> _parse_datetime(["Fr 12:00-22:00"])
{'Fr': '12:00-22:00'}
>>> _parse_datetime(["Fr-Sa 12:00-22:00", "We 10:00-18:00"])
{'Fr': '12:00-22:00', 'Sa': '12:00-22:00', 'We': '10:00-18:00'}
"""
WEEKDAYS = ["Mo", "Tu", "We", "Th", "Fr", "Sa", "Su"]
results = {}
for text in values:
days, hours = text.split(" ")
if "-" in days:
day_start, day_end = days.split("-")
for day in WEEKDAYS[
WEEKDAYS.index(day_start) : WEEKDAYS.index(day_end) + 1
]:
results[day] = hours
else:
results[days] = hours
return results
return {
"name": first("h1.business-name::text"),
"categories": many(".categories>a::text"),
"rating": first(".ratings div::attr(class)").split(" ", 1)[-1],
"ratingCount": first(".ratings .count::text").strip("()"),
"phone": first(".phone::attr(href)").replace("(", "").replace(")", ""),
"website": first(".website-link::attr(href)"),
"address": together(".address::text"),
"workingHours": _parse_datetime(many(".open-details tr time::attr(datetime)")),
}
async def scrape_company(url: str, session: httpx.AsyncClient) -> Company:
"""scrape yellowpage.com company page details"""
first_page = await session.get(url)
return parse_company(first_page)
As you can see, most of this code is HTML parsing. We scraped the business URL and built a parsel.Selector then used a few CSS selectors to extract specific fields we marked up earlier.
We can also see how easy it is to process data in Python:
We cleaned up scraped Yellow Pages phone numbers.
We unpacked work days from range like Mo-We to individual values like Mo,Tu,We
Let's run our Yellow Pages scraper and see the results it produces:
We see that we could scrape all of this information just with a few lines of code! There are a few more interesting data points available on the page like menu details and photos however, let's stick to basics in this tutorial and continue with reviews.
Scraping Yellowpages Reviews
To scrape business reviews we'll have to make additional several requests as they are scattered throughout several pages. For example, if we go back to our Japanese restaurant listing and scroll all the way to the bottom we can see review paging URL format:
using inspect the function of browser developer tools (right-click -> inspect) we can see next page link structure
We can see that for the next page all we need to do is add ?page=2 parameter and since we know the number of results in total we can scrape this the same way we scraped search results:
import asyncio
import math
from typing import List
from typing_extensions import TypedDict
from urllib.parse import urlencode
import httpx
from parsel import Selector
class Review(TypedDict):
"""type hint for yellowpages.com scraped review"""
id: str
author: str
source: str
date: str
stars: str
title: str
text: str
def parse_reviews(response) -> List[Review]:
"""parse company page for visible reviews"""
sel = Selector(text=response.text)
reviews = []
for box in sel.css("#reviews-container>article"):
first = lambda css: box.css(css).get("").strip()
many = lambda css: [value.strip() for value in box.css(css).getall()]
reviews.append(
{
"id": box.attrib.get("id"),
"author": first("div.author::text"),
"source": first("span.attribution>a::text"),
"date": first("p.date-posted>span::text"),
"stars": len(many(".result-ratings ul>li.rating-star")),
"title": first(".review-title::text"),
"text": first(".review-response p::text"),
}
)
return reviews
class CompanyData(TypedDict):
info: Company
reviews: List[Review]
# Now we can extend our company scraper to pick up reviews as well!
async def scrape_company(url, session: httpx.AsyncClient, get_reviews=True) -> CompanyData:
"""scrape yellowpage.com company page details"""
first_page = await session.get(url)
sel = Selector(text=first_page.text)
if not get_reviews:
return parse_company(first_page)
reviews = parse_reviews(first_page)
if reviews:
total_reviews = int(sel.css(".pagination-stats::text").re(r"of (\d+)")[0])
total_pages = int(math.ceil(total_reviews / 20))
for response in await asyncio.gather(
*[session.get(url + "?" + urlencode({"page": page})) for page in range(2, total_pages + 1)]
):
reviews.extend(parse_reviews(response))
return {
"info": parse_company(first_page),
"reviews": reviews,
}
Above we combined what we've learned from scraping search - paging through multiple pages - and what we've learned from scraping company info - parsing HTML using CSS selectors. With these additional features our scraper can collect company information and review data. Let's take it for a spin:
Run code & example output
import asyncio
import math
from typing import Dict, List, Optional
from urllib.parse import urlencode, urljoin
import httpx
from loguru import logger as log
from parsel import Selector
from typing_extensions import TypedDict
BASE_HEADERS = {
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
async def run():
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
result_company = await scrape_company(
"https://www.yellowpages.com/san-francisco-ca/mip/ozumo-japanese-restaurant-8083027", session=session,
)
print(json.dumps(result_company, indent=2, ensure_ascii=False))
if __name__ == "__main__":
asyncio.run(run())
{
"info": {
"name": "Ozumo Japanese Restaurant",
"categories": [
"Japanese Restaurants",
"Asian Restaurants",
"Caterers",
"Japanese Restaurants",
"Asian Restaurants",
"Caterers",
"Family Style Restaurants",
"Restaurants",
"Sushi Bars"
],
"rating": "three half",
"rating_count": "72",
"phone": "(415) 882-1333",
"website": "http://www.ozumo.com",
"address": "161 Steuart St San Francisco, CA 94105",
"work_hours": {
"Mo": "16:00-22:00",
"Tu": "16:00-22:00",
"We": "16:00-22:00",
"Th": "16:00-22:00",
"Fr": "12:00-22:00",
"Sa": "12:00-22:00",
"Su": "12:00-21:00"
}
},
"reviews": [
{
"id": "<redacted for blog use>",
"author": "<redacted for blog use>",
"source": "Citysearch",
"date": "03/18/2010",
"stars": 5,
"title": "Mindblowing Japanese!",
"text": "Wow what a dinner! I went to Ozumo last night with a friend for a complimentary meal I had won by being a Citysearch Dictator. It was AMAZING! We ordered the Hanabi (halibut) and Dohyo (ahi tuna) small plates as well as the Gindara (black cod) and Burikama (roasted yellowtail). Everything was absolutely delicious. They paired our meal with a variety of unique wines and sakes. The manager, Hiro, and our waitress were extremely knowledgeable about the food and how it was prepared. We started to tease the manager that he had a story for everything. His most boring story, he said, was about edamame. It was a great experience!"
},
...
]
}
We've learned how to scrape YellowPages' search, company data and reviews. However, YellowPages often block web scrapers from accessing its public data so, to scrape this target at scale let's take a look at how can we avoid these hurdles using ScrapFly's web scraping API.
Bypass Yellowpages Blocks with Scrapfly
We looked at how to Scrape YellowPages.com though, when scraping at scale we are likely to be either blocked or start serving captchas to solve, which will hinder or completely disable our web scraper.
To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us just with a few extra lines of Python code!
ScrapFly offers several powerful features that'll help us to get around 's web scraper blocking:
For this, we'll be using scrapfly-sdk python package. To start, let's install scrapfly-sdk using pip:
$ pip install scrapfly-sdk
To take advantage of ScrapFly's API in our YellowPages web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests. See Full Scraper Code section for more.
FAQ
To wrap this guide up let's take a look at some frequently asked questions about web scraping YellowPages.com:
Is it legal to scrape YellowPages.com ?
Yes, YellowPages's data is publicly available, and we're not extracting anything personal or private. Scraping YellowPages.com at slow, respectful rates would fall under the ethical scraping definition.
That being said, attention should be paid to GDRP compliance in the EU when scraping personal data such as personal people's data from the reviews section. For more, see our Is Web Scraping Legal? article.
Is there a Yellow Pages API?
No, unfortunately, YellowPages.com doesn't offer an API, but as we've covered in this tutorial - this website is easy to scrape in Python!
In this tutorial, we built a Yellow Pages data scraper in python. We first familiarized ourselves with how the website works using browser developer tools. Then, we replicated the search system in our scraper to find yellow pages from given queries.
To scrape the business data itself, we built CSS selectors for various fields like business ratings and contact details (such as phone number, website etc.)
We used Python with httpx and parsel packages and to prevent being blocked we used ScrapFly's API, which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!
We're taking yet another look at real estate websites. This time we're going down under! Realtestate.com.au is the biggest real estate portal in Australia and let's take a look at how to scrape it.
Immowelt.de is a major real estate website in Germany and it's suprisingly easy to scrape. In this tutorial, we'll be using Python and hidden web data scraping technique to scrape real estate property data.
For this scrape guide we'll be taking a look at another real estate website in Switzerland - Homegate. For this we'll be using hidden web data scraping and JSON parsing.