How to Scrape YellowPages.com Business Data and Reviews (2024 Update)

How to Scrape YellowPages.com Business Data and Reviews (2024 Update)

In this tutorial, we'll explain how to scrape Yellowpages - an online directory of various US-based businesses.

YellowPages.com is the digital version of telephone directories called yellow pages. It contains business information such as phone numbers, websites, and addresses as well as business reviews.

In this tutorial, we'll be using Python to scrape all of that business and review information. We'll also apply a few HTML parsing tricks to extract the data from its pages effectively. Let's dive in!

Hands on Python Web Scraping Tutorial and Example Project

If you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.

Hands on Python Web Scraping Tutorial and Example Project

Latest YellowPages Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Why Scrape YellowPages.com?

YellowPages contains thousands of businesses and their details, such as phone numbers, websites and locations. Therefore, we can utilize YellowPages web scraping for various use cases, from market and business analytics, to acquire competitive advantage or leads.

Furthermore, YallowPages features user reviews on the business. Scraping YellowPages allows for retrieving this data quickly, which can be utilized in machine learning techniques for analyzing and gaining insights into users' experiences and opinions.

Fore more details, refer to our previous guide on web scraping use cases.

Project Setup

Before we start, keep in mind that YellowPages is only accessible to US-based IP addresses. So, if you are located outside the US, you will need to use a US-based Proxy or VPN to access the website. Alternatively, run the full YellowPages scraper code on GitHub in the ScrapFly version.

To scrape YellowPages, we'll use Python with a few community packages:

  • httpx - An HTTP client library we'll use to request the YellowPages server.
  • parsel - An HTML parsing library we'll use to parse the HTML we get using selectors like CSS and XPath.
  • loguru - A logging library we'll use to monitor our YellowPages scraper.
  • asyncio - A library we'll use to run our code asynchronously, increasing our web scraping speed.

Note that asyncio comes pre-installed in Python. So, you will only have to install other packages using the following pip command:

$ pip install httpx parsel loguru

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests. As for, parsel, another great alternative is beautifulsoup.

How to Find Companies on YellowPages

Before we scrape YellowPages for company data, we need to find them first. To find them, we can use either of two approaches. The first one is using the YellowPages sitemap, which contain links for all categories and pages on the website.

However, we'll use a more flexible approach, the search pages.

0:00
/
once we click find we are redirected to results page with 427 results

We can see that upon submitting we submit a search request, YellowPages redirects us to a new URL containing pages of results. Let's scrape these results in the following section.

To scrape YellowPages, we need to form a search URL using a search query and a few parameters. Below is an example of using the base search URL with the minimum parameters:

search page url structure

The above URL include the search query, location and search page number. Let's apply this URL structure with an example search query. We'll search for
Let's apply the above URL structure with an example search query, we'll search for japanese restaurants in San Francisco, California. Here is the page we got by requesting this URL:

Search page parsing markup illustration
search page parsing markup

We'll scrape YellowPages search page data from the marked fields above. Let's start by defining our parsing logic:

def parse_search(response) -> Preview:
    """parse yellowpages.com search page for business preview data"""
    sel = Selector(text=response.text)
    parsed = []
    for result in sel.css(".organic div.result"):
        links = {}
        for link in result.css("div.links>a"):
            name = link.xpath("text()").get()
            url = link.xpath("@href").get()
            links[name] = url
        first = lambda css: result.css(css).get("").strip()
        many = lambda css: [value.strip() for value in result.css(css).getall()]
        parsed.append(
            {
                "name": first("a.business-name ::text"),
                "url": urljoin("https://www.yellowpages.com/", first("a.business-name::attr(href)")),
                "links": links,
                "phone": first("div.phone::text"),
                "categories": many(".categories>a::text"),
                "address": first(".adr .street-address::text"),
                "location": first(".adr .locality::text"),
                "rating": first(".ratings .rating div::attr(class)").split(" ", 1)[-1],
                "rating_count": first(".ratings .rating span::text").strip("()"),
            }
        )
    return parsed

Here, we define a parse_search function. It iterates over result boxes and uses the CSS selectors to extract business preview information, such as phone number, rating, name and most importantly, link to their full information page.

Next, we'll utilize the parsing logic while requesting the search pages to scrape the data:

import asyncio
from urllib.parse import urljoin
import asyncio
import json
import httpx
from parsel import Selector
from loguru import logger as log
from typing_extensions import TypedDict
from typing import List


class Preview(TypedDict):
    """Type hint container for business preview data. This object just helps us to keep track what results we'll be getting"""
    name: str
    url: str
    links: List[str]
    phone: str
    categoresi: List[str]
    address: str
    location: str
    rating: str
    rating_count: str


def parse_search(response) -> Preview:
    """parse yellowpages.com search page for business preview data"""
    sel = Selector(text=response.text)
    parsed = []
    for result in sel.css(".organic div.result"):
        links = {}
        for link in result.css("div.links>a"):
            name = link.xpath("text()").get()
            url = link.xpath("@href").get()
            links[name] = url
        first = lambda css: result.css(css).get("").strip()
        many = lambda css: [value.strip() for value in result.css(css).getall()]
        parsed.append(
            {
                "name": first("a.business-name ::text"),
                "url": urljoin("https://www.yellowpages.com/", first("a.business-name::attr(href)")),
                "links": links,
                "phone": first("div.phone::text"),
                "categories": many(".categories>a::text"),
                "address": first(".adr .street-address::text"),
                "location": first(".adr .locality::text"),
                "rating": first(".ratings .rating div::attr(class)").split(" ", 1)[-1],
                "rating_count": first(".ratings .rating span::text").strip("()"),
            }
        )
    return parsed


# to avoid being instantly blocked we should use request headers that of a common web browser:
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


# to run our scraper we need to start httpx session:
async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        response = await session.get("https://www.yellowpages.com/search?search_terms=Japanese+Restaurants&geo_location_terms=San+Francisco%2C+CA")
        result_search = parse_search(response)
        # print the results in JSOn format
        print(json.dumps(result_search, indent=2))

if __name__ == "__main__":
    asyncio.run(run())

Here is a sample output of result we got:

Sample output
[
  {
    "name": "Ichiraku",
    "url": "https://www.yellowpages.com/san-francisco-ca/mip/ichiraku-6317061",
    "links": {
      "View Menu": "/san-francisco-ca/mip/ichiraku-6317061#open-menu"
    },
    "phone": "(415) 668-9918",
    "categories": [
      "Japanese Restaurants",
      "Asian Restaurants",
      "Take Out Restaurants"
    ],
    "address": "3750 Geary Blvd",
    "location": "San Francisco, CA 94118",
    "rating": "four half",
    "rating_count": "13"
  },
  {
    "name": "Benihana",
    "url": "https://www.yellowpages.com/san-francisco-ca/mip/benihana-458857411",
    "links": {
      "Website": "http://www.benihana.com/locations/sanfrancisco-ca-sf"
    },
    "phone": "(415) 563-4844",
    "categories": [
      "Japanese Restaurants",
      "Bar & Grills",
      "Restaurants"
    ],
    "address": "1737 Post St",
    "location": "San Francisco, CA 94115",
    "rating": "three half",
    "rating_count": "10"
  },
  ...
]

The above code can scrape a single search page. Let's modify it to crawl over other search pages:

import asyncio
from urllib.parse import urlencode, urljoin
import asyncio
import json
import math
import httpx
from parsel import Selector
from loguru import logger as log
from typing_extensions import TypedDict
from typing import List, Optional


class Preview(TypedDict):
    """Type hint container for business preview data. This object just helps us to keep track what results we'll be getting"""
    name: str
    url: str
    links: List[str]
    phone: str
    categoresi: List[str]
    address: str
    location: str
    rating: str
    rating_count: str


def parse_search(response) -> Preview:
    """parse yellowpages.com search page for business preview data"""
    sel = Selector(text=response.text)
    parsed = []
    for result in sel.css(".organic div.result"):
        links = {}
        for link in result.css("div.links>a"):
            name = link.xpath("text()").get()
            url = link.xpath("@href").get()
            links[name] = url
        first = lambda css: result.css(css).get("").strip()
        many = lambda css: [value.strip() for value in result.css(css).getall()]
        parsed.append(
            {
                "name": first("a.business-name ::text"),
                "url": urljoin("https://www.yellowpages.com/", first("a.business-name::attr(href)")),
                "links": links,
                "phone": first("div.phone::text"),
                "categories": many(".categories>a::text"),
                "address": first(".adr .street-address::text"),
                "location": first(".adr .locality::text"),
                "rating": first(".ratings .rating div::attr(class)").split(" ", 1)[-1],
                "rating_count": first(".ratings .rating span::text").strip("()"),
            }
        )
    return parsed


async def search(query: str, session: httpx.AsyncClient, location: Optional[str] = None) -> List[Preview]:
    """search yellowpages.com for business preview information scraping all of the pages"""
    def make_search_url(page):
        base_url = "https://www.yellowpages.com/search?"
        parameters = {"search_terms": query, "geo_location_terms": location, "page": page}
        return base_url + urlencode(parameters)

    log.info(f'scraping "{query}" in "{location}"')
    first_page = await session.get(make_search_url(1))
    sel = Selector(text=first_page.text)
    total_results = int(sel.css(".pagination>span::text ").re("of (\d+)")[0])
    total_pages = int(math.ceil(total_results / 30))
    log.info(f'{query} in {location}: scraping {total_pages} of business preview pages')
    previews = parse_search(first_page)
    for result in await asyncio.gather(*[session.get(make_search_url(page)) for page in range(2, total_pages + 1)]):
        previews.extend(parse_search(result))
    log.success(f'{query} in {location}: scraped {len(previews)} total of business previews')
    return previews
Run the code
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}

async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result_search = await search("japanese restaurants", location="San Francisco, CA", session=session)
        print(json.dumps(result_search, indent=2))

if __name__ == "__main__":
    asyncio.run(run())

he above function implements a complete scraping loop. We generate search URLs from a given query and location parameters. Then, we scrape the first results page to extract the total result number and scrape the remaining pages concurrently. This is a common pagination web scraping idiom:

illustration of pagination scraping idiom
efficient pagination scraping: get total results from first page and then scrape the rest of the pages together!

Our YellowPages scraper can find and scrape business data from search pages. Next, we'll scrape the dedicated business pages.

How to Scrape Yellowpages Company Data

To scrape company data, we need to request each company URL that we found previously. Let's start with an example URL of the restaurant business ozumo restaurant:

company page field markup

We'll scrape the marked fields in the above image. First, let's start with the scraping logic:

class Company(TypedDict):
    """type hint container for company data found on yellowpages.com"""
    name: str
    categories: List[str]
    rating: str
    rating_count: str
    phone: str
    website: str
    address: str
    work_hours: Dict[str, str] 


def parse_company(response) -> Company:
    """extract company details from yellowpages.com company's page"""
    sel = Selector(text=response.text)
    # here we define some lamba shortcuts for parsing common data like
    # selecting first elements, many elements and join all elements together and 
    first = lambda css: sel.css(css).get("").strip()
    many = lambda css: [value.strip() for value in sel.css(css).getall()]
    together = lambda css, sep=" ": sep.join(sel.css(css).getall())

    # to parse working hours we need to do a bit of complex string parsing
    def _parse_datetime(values: List[str]):
        """
        parse datetime from yellow pages datetime strings

        >>> _parse_datetime(["Fr-Sa 12:00-22:00"])
        {'Fr': '12:00-22:00', 'Sa': '12:00-22:00'}
        >>> _parse_datetime(["Fr 12:00-22:00"])
        {'Fr': '12:00-22:00'}
        >>> _parse_datetime(["Fr-Sa 12:00-22:00", "We 10:00-18:00"])
        {'Fr': '12:00-22:00', 'Sa': '12:00-22:00', 'We': '10:00-18:00'}
        """

        WEEKDAYS = ["Mo", "Tu", "We", "Th", "Fr", "Sa", "Su"]
        results = {}
        for text in values:
            days, hours = text.split(" ")
            if "-" in days:
                day_start, day_end = days.split("-")
                for day in WEEKDAYS[
                    WEEKDAYS.index(day_start) : WEEKDAYS.index(day_end) + 1
                ]:
                    results[day] = hours
            else:
                results[days] = hours
        return results

    return {
        "name": first("h1.business-name::text"),
        "categories": many(".categories>a::text"),
        "rating": first(".ratings div::attr(class)").split(" ", 1)[-1],
        "ratingCount": first(".ratings .count::text").strip("()"),
        "phone": first(".phone::attr(href)").replace("(", "").replace(")", ""),
        "website": first(".website-link::attr(href)"),
        "address": together(".address::text"),
        "workingHours": _parse_datetime(many(".open-details tr time::attr(datetime)")),
    }

Here, we use the CSS selectors to extract specific company fields we marked earlier. We also process and clean a few fields, such as the phone numbers and unpackeding the work days from range like Mo-We to individual values like Mo,Tu,We.

Next, let's use the parse_company function we defined while requesting the company pages:

import httpx
import json
import asyncio
from parsel import Selector
from typing_extensions import TypedDict
from typing import Dict, List

class Company(TypedDict):
    """type hint container for company data found on yellowpages.com"""
    name: str
    categories: List[str]
    rating: str
    rating_count: str
    phone: str
    website: str
    address: str
    work_hours: Dict[str, str] 


def parse_company(response) -> Company:
    """extract company details from yellowpages.com company's page"""
    sel = Selector(text=response.text)
    # here we define some lamba shortcuts for parsing common data like
    # selecting first elements, many elements and join all elements together and 
    first = lambda css: sel.css(css).get("").strip()
    many = lambda css: [value.strip() for value in sel.css(css).getall()]
    together = lambda css, sep=" ": sep.join(sel.css(css).getall())

    # to parse working hours we need to do a bit of complex string parsing
    def _parse_datetime(values: List[str]):
        """
        parse datetime from yellow pages datetime strings

        >>> _parse_datetime(["Fr-Sa 12:00-22:00"])
        {'Fr': '12:00-22:00', 'Sa': '12:00-22:00'}
        >>> _parse_datetime(["Fr 12:00-22:00"])
        {'Fr': '12:00-22:00'}
        >>> _parse_datetime(["Fr-Sa 12:00-22:00", "We 10:00-18:00"])
        {'Fr': '12:00-22:00', 'Sa': '12:00-22:00', 'We': '10:00-18:00'}
        """

        WEEKDAYS = ["Mo", "Tu", "We", "Th", "Fr", "Sa", "Su"]
        results = {}
        for text in values:
            days, hours = text.split(" ")
            if "-" in days:
                day_start, day_end = days.split("-")
                for day in WEEKDAYS[
                    WEEKDAYS.index(day_start) : WEEKDAYS.index(day_end) + 1
                ]:
                    results[day] = hours
            else:
                results[days] = hours
        return results

    return {
        "name": first("h1.business-name::text"),
        "categories": many(".categories>a::text"),
        "rating": first(".ratings div::attr(class)").split(" ", 1)[-1],
        "ratingCount": first(".ratings .count::text").strip("()"),
        "phone": first(".phone::attr(href)").replace("(", "").replace(")", ""),
        "website": first(".website-link::attr(href)"),
        "address": together(".address::text"),
        "workingHours": _parse_datetime(many(".open-details tr time::attr(datetime)")),
    }


async def scrape_company(url: str, session: httpx.AsyncClient) -> Company:
    """scrape yellowpage.com company page details"""
    first_page = await session.get(url)
    return parse_company(first_page)


BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result_company = await scrape_company(
            "https://www.yellowpages.com/san-francisco-ca/mip/ozumo-japanese-restaurant-8083027", session=session,
        )
        print(json.dumps(result_company))

if __name__ == "__main__":
    asyncio.run(run())

Here is a sample output of the result we got:

Sample output ```json { "name": "Ozumo Japanese Restaurant", "categories": [ "Japanese Restaurants", "Asian Restaurants", "Caterers", "Japanese Restaurants", "Asian Restaurants", "Caterers", "Family Style Restaurants", "Restaurants", "Sushi Bars" ], "rating": "three half", "rating_count": "72", "phone": "(415) 882-1333", "website": "http://www.ozumo.com", "address": "161 Steuart St San Francisco, CA 94105", "work_hours": { "Mo": "16:00-22:00", "Tu": "16:00-22:00", "We": "16:00-22:00", "Th": "16:00-22:00", "Fr": "12:00-22:00", "Sa": "12:00-22:00", "Su": "12:00-21:00" } } ```

Cool! With just a few lines of code, our YellowPages scraper was able to get all the essential business details. Next, we'll scrape the business reviews!

How to Scrape Yellowpages Reviews

To scrape business reviews, we'll have to send additional requests to the review pages. For example, if we go back to our Japanese restaurant listing and scroll to the bottom, we can find review paging URL format:

review page url structure
using inspect the function of browser developer tools (right-click -> inspect) we can see next page link structure

From the above image, we can see that we can paginate over reviews using the page parameter. And since we know the total number of reviews, we can crawl over review pages to extract all the reviews:

import asyncio
import httpx
import math
import json
from typing import List
from typing_extensions import TypedDict
from parsel import Selector
from urllib.parse import urlencode


class Company(TypedDict):
    # rest of the Company class we defined earlier

def parse_company(response):
    # the parse_company logic we defined earlier

class Review(TypedDict):
    """type hint for yellowpages.com scraped review"""
    id: str
    author: str
    source: str
    date: str
    stars: str
    title: str
    text: str


def parse_reviews(response) -> List[Review]:
    """parse company page for visible reviews"""
    sel = Selector(text=response.text)
    reviews = []
    for box in sel.css("#reviews-container>article"):
        first = lambda css: box.css(css).get("").strip()
        many = lambda css: [value.strip() for value in box.css(css).getall()]
        reviews.append(
            {
                "id": box.attrib.get("id"),
                "author": first("div.author::text"),
                "source": first("span.attribution>a::text"),
                "date": first("p.date-posted>span::text"),
                "stars": len(many(".result-ratings ul>li.rating-star")),
                "title": first(".review-title::text"),
                "text": first(".review-response p::text"),
            }
        )
    return reviews


class CompanyData(TypedDict):
    info: Company
    reviews: List[Review]


# Now we can extend our company scraper to pick up reviews as well!
async def scrape_company(url, session: httpx.AsyncClient, get_reviews=True) -> CompanyData:
    """scrape yellowpage.com company page details"""
    first_page = await session.get(url)
    sel = Selector(text=first_page.text)
    if not get_reviews:
        return parse_company(first_page)
    reviews = parse_reviews(first_page)
    if reviews:
        total_reviews = int(sel.css(".pagination-stats::text").re(r"of (\d+)")[0])
        total_pages = int(math.ceil(total_reviews / 20))
        for response in await asyncio.gather(
            *[session.get(url + "?" + urlencode({"page": page})) for page in range(2, total_pages + 1)]
        ):
            reviews.extend(parse_reviews(response))
    return {
        "info": parse_company(first_page),
        "reviews": reviews,
    }

In the above code, we apply the pagination approach we used in the search scraping logic. We also utilize the company parsing logic to extract the company information alongside the reviews. Let's run our YellowPages scraping code and have a look at the results:

Run code & example output
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result_company = await scrape_company(
            "https://www.yellowpages.com/san-francisco-ca/mip/ozumo-japanese-restaurant-8083027", session=session,
        )
        print(json.dumps(result_company, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    asyncio.run(run())
{
  "info": {
    "name": "Ozumo Japanese Restaurant",
    "categories": [
      "Japanese Restaurants",
      "Asian Restaurants",
      "Caterers",
      "Japanese Restaurants",
      "Asian Restaurants",
      "Caterers",
      "Family Style Restaurants",
      "Restaurants",
      "Sushi Bars"
    ],
    "rating": "three half",
    "rating_count": "72",
    "phone": "(415) 882-1333",
    "website": "http://www.ozumo.com",
    "address": "161 Steuart St San Francisco, CA 94105",
    "work_hours": {
      "Mo": "16:00-22:00",
      "Tu": "16:00-22:00",
      "We": "16:00-22:00",
      "Th": "16:00-22:00",
      "Fr": "12:00-22:00",
      "Sa": "12:00-22:00",
      "Su": "12:00-21:00"
    }
  },
  "reviews": [
    {
      "id": "<redacted for blog use>",
      "author": "<redacted for blog use>",
      "source": "Citysearch",
      "date": "03/18/2010",
      "stars": 5,
      "title": "Mindblowing Japanese!",
      "text": "Wow what a dinner!  I went to Ozumo last night with a friend for a complimentary meal I had won by being a Citysearch Dictator.  It was AMAZING!  We ordered the Hanabi (halibut) and Dohyo (ahi tuna) small plates as well as the Gindara (black cod) and Burikama (roasted yellowtail).  Everything was absolutely delicious.  They paired our meal with a variety of unique wines and sakes.  The manager, Hiro, and our waitress were extremely knowledgeable about the food and how it was prepared.  We started to tease the manager that he had a story for everything.  His most boring story, he said, was about edamame.  It was a great experience!"
    },
  ...
  ]
}

With this last feature, we can scrape YellowPages business data from company, search and review pages. However, our YellowPages scraper is very likely to get blocked after sending a few additional requests. Let's explore how we can scale it!

Bypass Yellowpages Scraping Blocking

Scraping Yellowpages isn't very complicated but scaling up such scraping operations can be difficult and this where Scrapfly can lend a hand!

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

To take advantage of ScrapFly's API in our YellowPages web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests:

# standard web scraping code
import httpx
from parsel import Selector

response = httpx.get("some yellowpages.com URL")
selector = Selector(response.text)

# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient

# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="Your ScrapFly API key")

response = scrapfly.scrape(ScrapeConfig(
    url="some yellowpages.com URL",
    asp=True, # enable the anti scraping protection to bypass blocking
    country="US", # set the proxy location to a specfic country
    render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))

# use the built in Parsel selector
selector = response.selector
# access the HTML content

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping YellowPages.

Yes, YellowPages's data is publicly available and it's legal to scrape them. Scraping YellowPages.com at slow, respectful rates would fall under the ethical scraping definition. For more details, refer to our Is Web Scraping Legal? article.

Is there an API for YellowPages?

No, unfortunately, YellowPages.com doesn't offer APIs for public use. However, as we've covered in this tutorial - scraping YellowPages using Python is straightforward.

Are there alternatives for scraping YellowPages?

Yes, Yelp.com is another public website for business directories. We have covered how to scrape Yelp in a previous guide.

Latest Yellowpages Scraper Code
https://github.com/scrapfly/scrapfly-scrapers/

YellowPages Scraping Summary

In this article, we explained how to scrape YellowPages in Python. We started by reverse engineering the website behavior to understand its search system and find company pages on the website. Then, we used CSS selectors to parse the HTML pages and extract business details.

Finally, we have explained how to bypass YellowPages scraping blocking using ScrapFly's web scraping API.

Related Posts

How to Scrape Reddit Posts, Subreddits and Profiles

In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.

How to Scrape LinkedIn in 2024

In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.

How to Scrape SimilarWeb Website Traffic Analytics

In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.