How to Scrape YellowPages.com

article feature image

In this web scraping tutorial, we'll be scraping YellowPages.com - an online directory of various US-based businesses.

YellowPages is the digital version of telephone directories called yellow pages. It contains business information such as phone numbers, websites, addresses as well as business reviews.
In this tutorial, we'll be using Python to scrape all of that business and review information. Let's dive in!

Web Scraping With Python Tutorial

If you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.

Web Scraping With Python Tutorial

Why Scrape YellowPages.com?

YellowPages.com contains millions of businesses and their details like phone numbers, websites and locations. All of this data can be used in various market and business analytics to acquire competitive advantage or leads. Not only that but YellowPages also contains review data, business images and service menus which can further be used in market analysis. For more on scraping use cases see our extensive web scraping use case article

Setup & Prerequisites

To begin we first should note that yellowpages.com is only accessible to US-based IP addresses. So, if you're located outside of the US you'll need a US-based proxy, VPN to access yellowpages.com.

As for code, in this tutorial we'll be using Python and two major community packages:

  • httpx - HTTP client library which will let us communicate with YellowPages.com's servers
  • parsel - HTML parsing library which will help us to parse our web scraped HTML files. In this tutorial, we'll stick with CSS selectors as YellowPages HTML's are quite simple.

Optionally we'll also use loguru - a pretty logging library that'll help us keep track of what's going on.

These packages can be easily installed via pip command:

$ pip install httpx parsel loguru

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package or anything that supports CSS selectors which is what we'll be using in this tutorial.

Parsing HTML with CSS Selectors

If you're new to CSS selectors check out our complete and interactive introduction article that goes over essential CSS selector syntax and common usage in web scraping.

Parsing HTML with CSS Selectors

Finding Companies

Our first goal is to figure out how to find companies we want to scrape on YellowPages.com. There are a few ways of achieving this. First, if we'd want to scrape all of the businesses in a specific area we can use yellowpages.com/sitemap page which contains links to all categories and locations.

However, we'll be using a more flexible and easier approach by scraping YellowPages search:

0:00
/
once we click find we are redirected to results page with 427 results

We can see that when we submit a search request YellowPages takes us to a new url containing pages of results. Let's see how we can scrape this.

To scrape YellowPages search we'll be forming search URL from given parameters and then iterating through multiple page URLs to collect all business listings.
If we take a look at the URL format:

search page url structure

We can see that it takes in a few key parameters: query (e.g. "Japanese Restaurant"), location and the page number. Let's take a look at how we can scrape this efficiently.

First, let's take a look at scraping a single search page like japanese restaurants in San Francisco, California:
yellowpages.com/search?search_terms=Japanese+Restaurants&geo_location_terms=San+Francisco,%2C+CA.

For this, we'll be using parsel package with a few CSS selectors:

import asyncio
import math
from urllib.parse import urlencode, urljoin

import httpx
from parsel import Selector
from loguru import logger as log
from typing_extensions import TypedDict
from typing import Dict, List, Optional


class Preview(TypedDict):
    """Type hint container for business preview data. This object just helps us to keep track what results we'll be getting"""
    name: str
    url: str
    links: List[str]
    phone: str
    categoresi: List[str]
    address: str
    location: str
    rating: str
    rating_count: str


def parse_search(response) -> Preview:
    """parse yellowpages.com search page for business preview data"""
    sel = Selector(text=response.text)
    parsed = []
    for result in sel.css(".organic div.result"):
        links = {}
        for link in result.css("div.links>a"):
            name = link.xpath("text()").get()
            url = link.xpath("@href").get()
            links[name] = url
        first = lambda css: result.css(css).get("").strip()
        many = lambda css: [value.strip() for value in result.css(css).getall()]
        parsed.append(
            {
                "name": first("a.business-name ::text"),
                "url": urljoin("https://www.yellowpages.com/", first("a.business-name::attr(href)")),
                "links": links,
                "phone": first("div.phone::text"),
                "categories": many(".categories>a::text"),
                "address": first(".adr .street-address::text"),
                "location": first(".adr .locality::text"),
                "rating": first("section.ratings .rating-stars::attr(class)").split(" ", 1)[-1],
                "rating_count": first("section.ratings .count::text").strip("()"),
            }
        )
    return parsed

In the above code, we first isolate each result by its bounding box and iterate through each 30 of the result boxes:

search page parsing markup illustration
search page parsing markup

In each iteration, we use relative CSS selectors to collect business preview information such as phone number, rating, name and most importantly, link to their full information page.

Let's run this little scraper and see the results it generates:

Run code & example output
import asyncio
import json

# to avoid being instantly blocked we should use request headers that of a common web browser:
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


# to run our scraper we need to start httpx session:
async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        response = await session.get("https://www.yellowpages.com/search?search_terms=Japanese+Restaurants&geo_location_terms=San+Francisco%2C+CA")
        result_search = parse_search(response)
        print(json.dumps(result_search, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "name": "Ichiraku",
    "url": "https://www.yellowpages.com/san-francisco-ca/mip/ichiraku-6317061",
    "links": {
      "View Menu": "/san-francisco-ca/mip/ichiraku-6317061#open-menu"
    },
    "phone": "(415) 668-9918",
    "categories": [
      "Japanese Restaurants",
      "Asian Restaurants",
      "Take Out Restaurants"
    ],
    "address": "3750 Geary Blvd",
    "location": "San Francisco, CA 94118",
    "rating": "four half",
    "rating_count": "13"
  },
  {
    "name": "Benihana",
    "url": "https://www.yellowpages.com/san-francisco-ca/mip/benihana-458857411",
    "links": {
      "Website": "http://www.benihana.com/locations/sanfrancisco-ca-sf"
    },
    "phone": "(415) 563-4844",
    "categories": [
      "Japanese Restaurants",
      "Bar & Grills",
      "Restaurants"
    ],
    "address": "1737 Post St",
    "location": "San Francisco, CA 94115",
    "rating": "three half",
    "rating_count": "10"
  },
  ...
]

We can scrape a single search page so now all we have to do is wrap this logic in a scraping loop:

async def search(query: str, session: httpx.AsyncClient, location: Optional[str] = None) -> List[Preview]:
    """search yellowpages.com for business preview information scraping all of the pages"""
    def make_search_url(page):
        base_url = "https://www.yellowpages.com/search?"
        parameters = {"search_terms": query, "geo_location_terms": location, "page": page}
        return base_url + urlencode(parameters)

    log.info(f'scraping "{query}" in "{location}"')
    first_page = await session.get(make_search_url(1))
    sel = Selector(text=first_page.text)
    total_results = int(sel.css(".pagination>span::text ").re("of (\d+)")[0])
    total_pages = int(math.ceil(total_results / 30))
    log.info(f'{query} in {location}: scraping {total_pages} of business preview pages')
    previews = parse_search(first_page)
    for result in await asyncio.gather(*[session.get(make_search_url(page)) for page in range(2, total_pages + 1)]):
        previews.extend(parse_search(result))
    log.info(f'{query} in {location}: scraped {len(previews)} total of business previews')
    return previews
Run code
async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result_search = await search("japanese restaurants", location="San Francisco, CA", session=session)
        print(json.dumps(result_search, indent=2))

if __name__ == "__main__":
    asyncio.run(run())

The above function implements a complete scraping loop. We generate search URL from given query and location parameters. Then, we scrape the first results page to extract the total result number and scrape the remaining pages concurrently. This is a common pagination web scraping idiom:

illustration of pagination scraping idiom

efficient pagination scraping: get total results from first page and then scrape the rest of the pages together!

Now that we know how to find businesses and their preview data let's take a look at how can we scrape all of the business data by scraping each of these founded pages.

Scraping Company Data

To scrape company data we need to request each company's Yellowpages URL we found previously. Let's start with an example URL of a restaurant business ozumo-japanese-restaurant-8083027

company page field markup

We can see that the page contains a lot of business data we can scrape. Let's scrape these marked-up fields:

from parsel import Selector
from typing_extensions import TypedDict
from typing import Dict, List, Optional
from loguru import logger as log


class Company(TypedDict):
    """type hint container for company data found on yellowpages.com"""
    name: str
    categories: List[str]
    rating: str
    rating_count: str
    phone: str
    website: str
    address: str
    work_hours: Dict[str, str] 


def parse_company(response) -> Company:
    """extract company details from yellowpages.com company's page"""
    sel = Selector(text=response.text)
    # here we define some lamba shortcuts for parsing common data like
    # selecting first elements, many elements and join all elements together and 
    first = lambda css: sel.css(css).get("").strip()
    many = lambda css: [value.strip() for value in sel.css(css).getall()]
    together = lambda css, sep=" ": sep.join(sel.css(css).getall())

    # to parse working hours we need to do a bit of complex string parsing
    def _parse_datetime(values: List[str]):
        """
        parse datetime from yellow pages datetime strings

        >>> _parse_datetime(["Fr-Sa 12:00-22:00"])
        {'Fr': '12:00-22:00', 'Sa': '12:00-22:00'}
        >>> _parse_datetime(["Fr 12:00-22:00"])
        {'Fr': '12:00-22:00'}
        >>> _parse_datetime(["Fr-Sa 12:00-22:00", "We 10:00-18:00"])
        {'Fr': '12:00-22:00', 'Sa': '12:00-22:00', 'We': '10:00-18:00'}
        """

        WEEKDAYS = ["Mo", "Tu", "We", "Th", "Fr", "Sa", "Su"]
        results = {}
        for text in values:
            days, hours = text.split(" ")
            if "-" in days:
                day_start, day_end = days.split("-")
                for day in WEEKDAYS[WEEKDAYS.index(day_start) : WEEKDAYS.index(day_end) + 1]:
                    results[day] = hours
            else:
                results[days] = hours
        return results

    return {
        "name": first("h1.business-name::text"),
        "categories": many(".categories>a::text"),
        "rating": first(".rating .result-rating::attr(class)").split(" ", 1)[-1],
        "rating_count": first(".rating .count::text").strip("()"),
        "phone": first("#main-aside .phone>strong::text"),
        "website": first("#main-aside .website-link::attr(href)"),
        "address": together("#main-aside .address ::text"),
        "work_hours": _parse_datetime(many(".open-details tr time::attr(datetime)")),
    }



async def scrape_company(url: str, session: httpx.AsyncClient) -> Company:
    """scrape yellowpage.com company page details"""
    first_page = await session.get(url)
    return parse_company(first_page)

As you can see most of this code is HTML parsing. We simply retrieve the business URL and built a parsel.Selector and with clever use of a few CSS selectors we can easily extract all of the fields we've marked up! We also unpacked working hours from being clumped up by several days (e.g. Monday-Friday) to each day to illustrate how easily we can process text data in Python.

Let's run our scraper and see the results it produces:

Run code & example output
import httpx
import json
import asyncio


BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result_company = await scrape_company(
            "https://www.yellowpages.com/san-francisco-ca/mip/ozumo-japanese-restaurant-8083027", session=session,
        )
        print(json.dumps(result_company))


if __name__ == "__main__":
    asyncio.run(run())
{
  "name": "Ozumo Japanese Restaurant",
  "categories": [
    "Japanese Restaurants",
    "Asian Restaurants",
    "Caterers",
    "Japanese Restaurants",
    "Asian Restaurants",
    "Caterers",
    "Family Style Restaurants",
    "Restaurants",
    "Sushi Bars"
  ],
  "rating": "three half",
  "rating_count": "72",
  "phone": "(415) 882-1333",
  "website": "http://www.ozumo.com",
  "address": "161 Steuart St San Francisco, CA 94105",
  "work_hours": {
    "Mo": "16:00-22:00",
    "Tu": "16:00-22:00",
    "We": "16:00-22:00",
    "Th": "16:00-22:00",
    "Fr": "12:00-22:00",
    "Sa": "12:00-22:00",
    "Su": "12:00-21:00"
  }
}

We see that we could scrape all of this information just with a few lines of code! There are a few more interesting data points available on the page like menu details and photos however, let's stick to basics in this tutorial and continue with reviews.

Scraping Reviews

To scrape business reviews we'll have to make additional several requests as they are scattered throughout several pages. For example, if we go back to our Japanese restaurant listing and scroll all the way to the bottom we can see review paging URL format:

review page url structure
using inspect the function of browser developer tools (right-click -> inspect) we can see next page link structure

We can see that for the next page all we need to do is add ?page=2 parameter and since we know the number of results in total we can scrape this the same way we scraped search results:

import asyncio
import math
from typing import List
from typing_extensions import TypedDict
from urllib.parse import urlencode
import httpx

from parsel import Selector


class Review(TypedDict):
    """type hint for yellowpages.com scraped review"""
    id: str
    author: str
    source: str
    date: str
    stars: str
    title: str
    text: str


def parse_reviews(response) -> List[Review]:
    """parse company page for visible reviews"""
    sel = Selector(text=response.text)
    reviews = []
    for box in sel.css("#reviews-container>article"):
        first = lambda css: box.css(css).get("").strip()
        many = lambda css: [value.strip() for value in box.css(css).getall()]
        reviews.append(
            {
                "id": box.attrib.get("id"),
                "author": first("div.author::text"),
                "source": first("span.attribution>a::text"),
                "date": first("p.date-posted>span::text"),
                "stars": len(many(".result-ratings ul>li.rating-star")),
                "title": first(".review-title::text"),
                "text": first(".review-response p::text"),
            }
        )
    return reviews

class CompanyData(TypedDict):
    info: Company
    reviews: List[Review]

# Now we can extend our company scraper to pick up reviews as well!
async def scrape_company(url, session: httpx.AsyncClient, get_reviews=True) -> CompanyData:
    """scrape yellowpage.com company page details"""
    first_page = await session.get(url)
    sel = Selector(text=first_page.text)
    if not get_reviews:
        return parse_company(first_page)
    reviews = parse_reviews(first_page)
    if reviews:
        total_reviews = int(sel.css(".pagination-stats::text").re(r"of (\d+)")[0])
        total_pages = int(math.ceil(total_reviews / 20))
        for response in await asyncio.gather(
            *[session.get(url + "?" + urlencode({"page": page})) for page in range(2, total_pages + 1)]
        ):
            reviews.extend(parse_reviews(response))
    return {
        "info": parse_company(first_page),
        "reviews": reviews,
    }

Above we combined what we've learned from scraping search - paging through multiple pages - and what we've learned from scraping company info - parsing HTML using CSS selectors. With these additional features our scraper can collect company information and review data. Let's take it for a spin:

Run code & example output
import asyncio
import math
from typing import Dict, List, Optional
from urllib.parse import urlencode, urljoin

import httpx
from loguru import logger as log
from parsel import Selector
from typing_extensions import TypedDict


BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result_company = await scrape_company(
            "https://www.yellowpages.com/san-francisco-ca/mip/ozumo-japanese-restaurant-8083027", session=session,
        )
        print(json.dumps(result_company, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    asyncio.run(run())
{
  "info": {
    "name": "Ozumo Japanese Restaurant",
    "categories": [
      "Japanese Restaurants",
      "Asian Restaurants",
      "Caterers",
      "Japanese Restaurants",
      "Asian Restaurants",
      "Caterers",
      "Family Style Restaurants",
      "Restaurants",
      "Sushi Bars"
    ],
    "rating": "three half",
    "rating_count": "72",
    "phone": "(415) 882-1333",
    "website": "http://www.ozumo.com",
    "address": "161 Steuart St San Francisco, CA 94105",
    "work_hours": {
      "Mo": "16:00-22:00",
      "Tu": "16:00-22:00",
      "We": "16:00-22:00",
      "Th": "16:00-22:00",
      "Fr": "12:00-22:00",
      "Sa": "12:00-22:00",
      "Su": "12:00-21:00"
    }
  },
  "reviews": [
    {
      "id": "<redacted for blog use>",
      "author": "<redacted for blog use>",
      "source": "Citysearch",
      "date": "03/18/2010",
      "stars": 5,
      "title": "Mindblowing Japanese!",
      "text": "Wow what a dinner!  I went to Ozumo last night with a friend for a complimentary meal I had won by being a Citysearch Dictator.  It was AMAZING!  We ordered the Hanabi (halibut) and Dohyo (ahi tuna) small plates as well as the Gindara (black cod) and Burikama (roasted yellowtail).  Everything was absolutely delicious.  They paired our meal with a variety of unique wines and sakes.  The manager, Hiro, and our waitress were extremely knowledgeable about the food and how it was prepared.  We started to tease the manager that he had a story for everything.  His most boring story, he said, was about edamame.  It was a great experience!"
    },
  ...
  ]
}

We've learned how to scrape YellowPages' search, company data and reviews. However, YellowPages often block web scrapers from accessing its public data so, to scrape this target at scale let's take a look at how can we avoid these hurdles using ScrapFly's web scraping API.

Avoiding Blocking with ScrapFly

We looked at how to Scrape YellowPages.com though, when scraping at scale we are likely to be either blocked or start serving captchas to solve, which will hinder or completely disable our web scraper.

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us just with a few extra lines of Python code!

illustration of scrapfly's middleware

ScrapFly offers several powerful features that'll help us to get around 's web scraper blocking:

For this, we'll be using scrapfly-sdk python package. To start, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our YellowPages web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests.

Full Scraper Code

Finally, let's put everything together: finding companies using search and scraping their info and review data with ScrapFly integration:

Full Scraper Code with ScrapFly integration
import asyncio
import math
from typing import Dict, List, Optional
from urllib.parse import urlencode, urljoin

from loguru import logger as log
from typing_extensions import TypedDict
from scrapfly import ScrapflyClient, ScrapeApiResponse, ScrapeConfig


class Preview(TypedDict):
    """Type hint container for business preview data. This object just helps us to keep track what results we'll be getting"""

    name: str
    url: str
    links: List[str]
    phone: str
    categoresi: List[str]
    address: str
    location: str
    rating: str
    rating_count: str


def parse_search(result: ScrapeApiResponse) -> Preview:
    """parse yellowpages.com search page for business preview data"""
    parsed = []
    for result in result.selector.css(".organic div.result"):
        links = {}
        for link in result.css("div.links>a"):
            name = link.xpath("text()").get()
            url = link.xpath("@href").get()
            links[name] = url
        first = lambda css: result.css(css).get("").strip()
        many = lambda css: [value.strip() for value in result.css(css).getall()]
        parsed.append(
            {
                "name": first("a.business-name ::text"),
                "url": urljoin("https://www.yellowpages.com/", first("a.business-name::attr(href)")),
                "links": links,
                "phone": first("div.phone::text"),
                "categories": many(".categories>a::text"),
                "address": first(".adr .street-address::text"),
                "location": first(".adr .locality::text"),
                "rating": first(".ratings .result-rating::attr(class) ").split(" ", 1)[-1],
                "rating_count": first(".ratings span.count::text").strip("()"),
            }
        )
    return parsed


async def search(query: str, session: ScrapflyClient, location: Optional[str] = None) -> List[Preview]:
    """search yellowpages.com for business preview information scraping all of the pages"""

    def make_search_url(page):
        base_url = "https://www.yellowpages.com/search?"
        parameters = {"search_terms": query, "geo_location_terms": location, "page": page}
        return base_url + urlencode(parameters)

    log.info(f'scraping "{query}" in "{location}"')
    first_page = await session.async_scrape(ScrapeConfig(make_search_url(1), country="US"))
    total_results = int(first_page.selector.css(".pagination>span::text ").re("of (\d+)")[0])
    total_pages = int(math.ceil(total_results / 30))
    log.info(f"{query} in {location}: scraping {total_pages} of business preview pages")
    previews = parse_search(first_page)
    async for result in session.concurrent_scrape(
        [ScrapeConfig(make_search_url(page), country="US") for page in range(2, total_pages + 1)]
    ):
        previews.extend(parse_search(result))
    log.info(f"{query} in {location}: scraped {len(previews)} total of business previews")
    return previews


class Review(TypedDict):
    id: str
    author: str
    source: str
    date: str
    stars: str
    title: str
    text: str


def parse_reviews(result: ScrapeApiResponse) -> List[Review]:
    reviews = []
    for box in result.selector.css("#reviews-container>article"):
        first = lambda css: box.css(css).get("").strip()
        many = lambda css: [value.strip() for value in box.css(css).getall()]
        reviews.append(
            {
                "id": box.attrib.get("id"),
                "author": first("div.author::text"),
                "source": first("span.attribution>a::text"),
                "date": first("p.date-posted>span::text"),
                "stars": len(many(".result-ratings ul>li.rating-star")),
                "title": first(".review-title::text"),
                "text": first(".review-response p::text"),
            }
        )
    return reviews


def _parse_datetime(values: List[str]):
    """
    parse datetime from yellow pages datetime strings

    >>> _parse_datetime(["Fr-Sa 12:00-22:00"])
    {'Fr': '12:00-22:00', 'Sa': '12:00-22:00'}
    >>> _parse_datetime(["Fr 12:00-22:00"])
    {'Fr': '12:00-22:00'}
    >>> _parse_datetime(["Fr-Sa 12:00-22:00", "We 10:00-18:00"])
    {'Fr': '12:00-22:00', 'Sa': '12:00-22:00', 'We': '10:00-18:00'}
    """

    WEEKDAYS = ["Mo", "Tu", "We", "Th", "Fr", "Sa", "Su"]
    results = {}
    for text in values:
        days, hours = text.split(" ")
        if "-" in days:
            day_start, day_end = days.split("-")
            for day in WEEKDAYS[WEEKDAYS.index(day_start) : WEEKDAYS.index(day_end) + 1]:
                results[day] = hours
        else:
            results[days] = hours
    return results


class Company(TypedDict):
    """type hint container for company data found on yellowpages.com"""

    name: str
    categories: List[str]
    rating: str
    rating_count: str
    phone: str
    website: str
    address: str
    work_hours: Dict[str, str]


def parse_company(result: ScrapeApiResponse) -> Company:
    """extract company details from yellowpages.com company's page"""
    # here we define some lamba shortcuts for parsing common data like
    # selecting first elements, many elements and join all elements together and
    first = lambda css: result.selector.css(css).get("").strip()
    together = lambda css, sep=" ": sep.join(result.selector.css(css).getall())
    many = lambda css: [value.strip() for value in result.selector.css(css).getall()]
    return {
        "name": first("h1.business-name::text"),
        "categories": many(".categories>a::text"),
        "rating": first(".rating .result-rating::attr(class)").split(" ", 1)[-1],
        "rating_count": first(".rating .count::text").strip("()"),
        "phone": first("#main-aside .phone>strong::text"),
        "website": first("#main-aside .website-link::attr(href)"),
        "address": together("#main-aside .address ::text"),
        "work_hours": _parse_datetime(many(".open-details tr time::attr(datetime)")),
    }


class CompanyData(TypedDict):
    info: Company
    reviews: List[Review]


async def scrape_company(url: str, session: ScrapflyClient, get_reviews=True) -> CompanyData:
    """scrape yellowpage.com company page details"""
    first_page = await session.async_scrape(ScrapeConfig(url, country="US"))
    if not get_reviews:
        return parse_company(first_page)
    reviews = parse_reviews(first_page)
    if reviews:
        total_reviews = int(first_page.selector.css(".pagination-stats::text").re(r"of (\d+)")[0])
        total_pages = int(math.ceil(total_reviews / 20))
        other_page_urls = [url + "?" + urlencode({"page": page}) for page in range(2, total_pages + 1)]
        async for result in session.concurrent_scrape([ScrapeConfig(url=url) for url in other_page_urls]):
            reviews.extend(parse_reviews(result))
    return {
        "info": parse_company(first_page),
        "reviews": reviews,
    }


BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    with ScrapflyClient(key="YOU_SCRAPFLY_KEY", max_concurrency=10) as session:
        result_search = await search("japanese restaurants", location="San Francisco, CA", session=session)
        result_company = await scrape_company(
            "https://www.yellowpages.com/san-francisco-ca/mip/ozumo-japanese-restaurant-8083027",
            session=session,
        )
        return


if __name__ == "__main__":
    asyncio.run(run())

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping YellowPages.com:

Summary

In this tutorial, we built an YellowPages.com scraper by understanding how the website functions, so we could replicate its functionality in our web scraper. First, we replicated the search function to find companies in specific area, then we scraped company detail information and the review data.

For this, we used Python with httpx and parsel packages and to prevent being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Related post

How to Scrape Zoominfo

Practical tutorial on how to web scrape public company and people data from Zoominfo.com using Python and how to avoid being blocked using ScrapFly API.

How to Scrape Google Maps

We'll take a look at to find businesses through Google Maps search system and how to scrape their details using either Selenium, Playwright or ScrapFly's javascript rendering feature - all of that in Python.

How to Scrape Angel.co

Tutorial for web scraping AngelList - angel.co - tech startup company and job directory using Python. Practical code and best practices.

How to Scrape Crunchbase.com

Tutorial on how to scrape crunchbase.com business and related data using Python. How to avoid blocking to scrape data at scale and other tips.