How to Web Scrape Yelp.com (2023 update)

article feature image

Yelp.com is one of the oldest and well-known yellow page websites. It contains company information like address, website, location, etc. as well as user reviews of these companies.

In this web scraping tutorial, we'll take a look at how to scrape yelp.com in Python. We'll start with a bit of reverse engineering of the search functionality, so we can find businesses, and then we'll scrape and parse the business data itself. Finally, we'll take a look at how to avoid our scraper getting blocked when scraping at scale since Yelp is notorious for blocking web scraping.

Project Setup

We'll be using Python in this tutorial as well as a few popular community packages:

  • httpx - HTTP client library which will let us communicate with Booking.com's servers.
  • parsel - HTML parsing library which will help us to parse our web scraped HTML files for yelp data.

We can easily install them using pip command:

$ pip install httpx parsel

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions that are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package or anything that supports CSS selectors which is what we'll be using in this tutorial.

Discovering Yelp Company Pages

To start scraping, we need to find a way to discover businesses on yelp.

Unfortunately, if we take a look at yelp.com/robots.txt we can see that yelp.com doesn't provide a sitemap or any directory pages which might contain all the businesses. This means we have to reverse-engineer their search functionality and replicate that in our yelp scraper.

Let's start by taking a look at yelp's front page and what happens when we submit our search:

yelp.com search functionality

We can see that upon entering search details we are being redirected to URL with search keywords:

https://www.yelp.com/search?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=220

This is our search seed request, but we can go even further and look for data requests by examining the pagination. Let's click on the next page link and see what is happening in our browser's web inspector XHR tab:

yelp page 2 network traffic inspector
https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user

We found the data endpoint for yelp's backend API. We can see that the /search/snippet endpoint takes some parameters and returns search results of business IDs and preview details like:

{
    // Business ID which we'll need later
    "bizId": "oIff0iLkEiPsWcDATe6mfA",
    // Business preview data
    "searchResultBusiness": {
        "ranking": null, "isAd": true,
        "renderAdInfo": true,
        "name": "Smooth Air",
        "alternateNames": [],
        "businessUrl": "/adredir?ad_business_id=oIff0iLkEiPsWcDATe6mfA&campaign_id=VcMvmxKjXiH2peL8g1c_jw&click_origin=search_results&placement=carousel_0&placement_slot=0&redirect_url=https%3A%2F%2Fwww.yelp.com%2Fbiz%2Fsmooth-air-brampton&request_id=daed206f44c35b85&signature=e537121fa6eb5d95fe240274d63ae189267de71994e5908c824eab5cea323c55&slot=1",
        "categories": [{
            "title": "Plumbing",
            "url": "/search?cflt=plumbing&find_loc=Toronto%2C+Ontario%2C+Canada"
        }, {
            "title": "Heating & Air Conditioning/HVAC",
            "url": "/search?cflt=hvac&find_loc=Toronto%2C+Ontario%2C+Canada"
        }, {
            "title": "Water Heater Installation/Repair",
            "url": "/search?cflt=waterheaterinstallrepair&find_loc=Toronto%2C+Ontario%2C+Canada"
        }],
        "priceRange": "",
        "rating": 0.0,
        "reviewCount": 0,
        "formattedAddress": "",
        "neighborhoods": [],
        "phone": "",
        "serviceArea": null,
        "parentBusiness": null,
        "servicePricing": null,
        "bizSiteUrl": "https://biz.yelp.com"
}

So, we can use this API endpoint to find all business IDs for a given location and search term. With this information, we can start working on our web scraper.

We can start on our web scraper by replicating the search request we saw earlier:

import asyncio
import httpx


async def _search_yelp_page(keyword: str, location: str, session: httpx.AsyncClient, offset=0):
    """scrape single page of yelp search"""
    # final url example:
    # https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user
    resp = await session.get(
        "https://www.yelp.com/search/snippet",
        params={
            "find_desc": keyword,
            "find_loc": location,
            "start": offset,
            "parent_request": "",
            "ns": 1,
            "request_origin": "user"
        }
    )
    assert resp.status_code == 200
    return resp.json()

Note: We're using asynchronous python, so later we can schedule multiple requests concurrently which will give us a huge speed boost.

In the script above we are replicating the /search/snippet endpoint request which returns search result data for a single search page. Let's see the results this scraper generates:

Run code & example output
BASE_HEADERS = {
    "authority": "www.yelp.com",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}
async def run():
    async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
        results = await _yelp_search_page('plumbers', 'Toronto, Ontario, Canada', session=session)
        print(results)

if __name__ == "__main__":
    asyncio.run(run())
{
  "pageTitle": [
    "Yelp"
  ],
  "loggingConfig": {
    "sitRepConfig": {
      "isSitRepEnabled": true,
      "enabledSitRepChannels": {
        "vertical_search_reservation": true,
        "vertical_search_platform": true,
        "frontend_performance": true,
        "search_suggest_events": true,
        "vertical_search_waitlist": true,
        "ad_syndication_cookie_sync_errors": true,
        "traffic_quality": true,
        "search_ux": true,
        "message_the_business": true,
        "ytp_session_events": true,
        "ad_syndication": true
      },
    }
    "searchPageProps"{
        ...
    }
...
}

Further, we need to parse this search data and implement the ability to scrape all the pages.
Let's start with parsing:

def parse_search(search_results: Dict) -> Tuple[List[Dict], Dict]:
    """
    Parses yelp search results for business results
    Returns list of businesses and search metadata
    """
    results = search_results['searchPageProps']['mainContentComponentsListProps']
    businesses = [r for r in results if r.get('searchResultBusiness') and not r.get('adLoggingInfo')]
    search_meta = next(r for r in results if r.get('type') == 'pagination')['props']
    return businesses, search_meta

Backend APIs often include loads of metadata including ads, tracking info etc. However, we only need the business info and the total amount of pages in the search query, so we can retrieve all the results.

Finally, let's wrap everything up with a loop function that scrapes all available search pages asynchronously. We'll scrape the first page and then scrape the rest of the pages asynchronously:

async def yelp_search_all(keyword: str, location: str, session: httpx.AsyncClient):
    """scrape all pages of yelp search for business preview data"""
    # get the first page data
    first_page = await _yelp_search(keyword, location, session=session)
    # parse first page for first page of businesses and total amount of pages
    businesses, search_meta = parse_search(first_page)
    # scrape remaining pages asynchronously
    tasks = []
    for page in range(10, search_meta['totalResults'], 10):
        tasks.append(
            _yelp_search(keyword, location, session=session, offset=page)
        )
    for result in await asyncio.gather(*tasks):
        businesses.extend(parse_search(result)[0])

    return businesses

This common pagination scraping idiom allows us to greatly speed up web scraping via asynchronous requests. We retrieve the first page for the total page count, and then we can schedule concurrent requests for the rest of the pages.

import asyncio
from typing import Dict, List, Tuple
import httpx

def parse_search(search_results: Dict) -> Tuple[List[Dict], Dict]:
    """
    Parses yelp search results for business results
    Returns list of businesses and search metadata
    """
    results = search_results['searchPageProps']['mainContentComponentsListProps']
    businesses = [r for r in results if r.get('searchResultBusiness') and not r.get('adLoggingInfo')]
    search_meta = next(r for r in results if r.get('type') == 'pagination')['props']
    return businesses, search_meta

async def _search_yelp_page(keyword: str, location: str, session: httpx.AsyncClient, offset=0):
    """scrape single page of yelp search"""
    # final url example:
    # https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user
    resp = await session.get(
        "https://www.yelp.com/search/snippet",
        params={
            "find_desc": keyword,
            "find_loc": location,
            "start": offset,
            "parent_request": "",
            "ns": 1,
            "request_origin": "user"
        }
    )
    assert resp.status_code == 200
    return resp.json()


async def search_yelp(keyword: str, location: str, session: httpx.AsyncClient):
    """scrape all pages of yelp search for business preview data"""
    first_page = await _search_yelp_page(keyword, location, session=session)
    businesses, search_meta = parse_search(first_page)
    tasks = []
    for page in range(10, search_meta['totalResults'], 10):
        tasks.append(
            _search_yelp_page(keyword, location, session=session, offset=page)
        )
    for result in await asyncio.gather(*tasks):
        businesses.extend(parse_search(result)[0])
    return businesses
Run code & example output
BASE_HEADERS = {
    "authority": "www.yelp.com",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}
async def run():
    async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
        results = await yelp_search_all('plumbers', 'Toronto, Ontario, Canada', session=session)
        print(results)

if __name__ == "__main__":
    asyncio.run(run())

Scraping Yelp Company Data

Now that we have our company discovery scraper we can further retrieve the details of each company we've discovered. For this, we need to scrape each company URL.

Let's start by taking a look at the company page itself and where the data is located:

yelp.com business parsing field markup

We see that HTML contains all the business data we might need, like phone number, address, etc. However, if we fire up the web inspector, we can see that the structure itself is not very tidy:

yelp.com dynamic class screenshot
highlight of dynamic class names that can change unpredictably

Such complex class names indicate the fact that they are dynamically generated - meaning we cannot rely on using class names in our HTML parsing selectors, or we have to be very safe about how we do it. Instead, we'll build our selectors relative to text matching. In other words, we'll find keyword text like "Get Directions", and we'll navigate the tree to the address value:

usage of relative selectors illustration

We can easily achieve this by taking advantage of XPATH contains() and .. features:

//a[contains(text(),"Get Directions")]/../following-sibling::p/text()

We'll be using this technique to get most of the values so let's get to it. For our XPATH selectors we'll be using parsel HTML parsing library:

$ pip install parsel

Using parsel and XPATH we can fully extract all visible details on the page:

import httpx
import asyncio
import json
from parsel import Selector

def parse_company(resp: httpx.Response):
    sel = Selector(text=resp.text)
    xpath = lambda xp: sel.xpath(xp).get(default="").strip()
    open_hours = {}
    for day in sel.xpath('//th/p[contains(@class,"day-of-the-week")]'):
        name = day.xpath('text()').get().strip()
        value = day.xpath('../following-sibling::td//p/text()').get().strip()
        open_hours[name.lower()] = value
    return dict(
        name=xpath('//h1/text()'),
        website=xpath('//p[contains(text(),"Business website")]/following-sibling::p/a/text()'),
        phone=xpath('//p[contains(text(),"Phone number")]/following-sibling::p/text()'),
        address=xpath('//a[contains(text(),"Get Directions")]/../following-sibling::p/text()'),
        logo=xpath('//img[contains(@class,"businessLogo")]/@src'),
        claim_status=''.join(sel.xpath('//span[contains(@class,"claim-text")]/text()').getall()).strip().lower(),
        open_hours=open_hours,
    )


async def _scrape_companies_by_url(company_urls:List[str], session: httpx.AsyncClient) -> List[Dict]:
    """Scrape yelp company details from given yelp company urls"""
    responses = await asyncio.gather(*[
        session.get(url) for url in company_urls
    ])
    results = []
    for resp in responses:
        results.append(parse_company(resp))
    return results

Here, we've added our parse_company function where we're using the xpath techniques we've covered earlier to extract our highlighted fields. If we run this scraper we'd see results similar to:

Run code
BASE_HEADERS = {
    "authority": "www.yelp.com",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}

async def run():
    async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
        resp = await yelp_companies(["https://www.yelp.com/biz/smooth-air-brampton"])
        results = parse_company(resp)
        print(json.dumps(results, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
[{
    "name": "Smooth Air",
    "website": "https://www.smoothairhvac.com",
    "phone": "(647) 828-6789",
    "address": "305 Fleetwood Crescent Brampton, ON L6T 2E7 Canada",
    "logo": "https://s3-media0.fl.yelpcdn.com/businessregularlogo/c90545xfS2yr7R7yKe9gZg/ms.jpg",
    "claim_status": "claimed",
    "open_hours": {
        "mon": "Open 24 hours",
        "tue": "Open 24 hours",
        "wed": "Open 24 hours",
        "thu": "Open 24 hours",
        "fri": "Open 24 hours",
        "sat": "Open 24 hours",
        "sun": "Open 24 hours"
    }
},
...
]

Scraping Yelp Reviews

To scrape Yelp company reviews, we have to take a look at another hidden API request. The easiest way to find this API endpoint is to simply click on the 2nd review page and observe the web inspector for outgoing requests:

web inspector when clicking on review page

Here we see a request to /review_feed endpoint is being made. We can see it uses a few parameters like BUSINESS_ID, which we scraped earlier through Yelp search step.

How to find yelp's business ID?

The Yelp's business ID can also be found in the HTML source of the business page itself:

import httpx
from parsel import Selector

def scrape_business_id(url):
    response = httpx.get(url)
    selector = Selector(response.text)
    return selector.css('meta[name="yelp-biz-id"]::attr(content)').get()

print(scrape_business_id("https://www.yelp.com/biz/capri-laguna-laguna-beach"))
"Yz7qwi0GipbeLBFAjSr_PQ"

For example, let's take this business https://www.yelp.com/biz/Yz7qwi0GipbeLBFAjSr_PQ.
Its reviews would be located at: https://www.yelp.com/biz/Yz7qwi0GipbeLBFAjSr_PQ/review_feed?rl=en&q=&sort_by=relevance_desc&start=10 which we can even click on and see the JSON results in our browser.

Let's take a look at how can we scrape yelp reviews in Python:

import asyncio
from typing import TypedDict, List
from parsel import Selector
import httpx
import json


class Review(TypedDict):
    id: str
    userId: str
    business: dict
    user: dict
    comment: dict
    rating: int
    ...


async def scrape_reviews(business_url: str, session: httpx.AsyncClient) -> List[Review]:
    # first find business ID from business URL
    response_business = await session.get(business_url)
    selector = Selector(text=response_business.text)
    business_id = selector.css('meta[name="yelp-biz-id"]::attr(content)').get()
    # then scrape first page
    first_page = await session.get(
        f"https://www.yelp.com/biz/{business_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start=0"
    )
    first_page_data = json.loads(first_page.text)
    reviews = first_page_data["reviews"]
    total_reviews = first_page_data["pagination"]["totalResults"]
    print(f"scraping {total_reviews} of business {business_id}")
    to_scrape = [
        session.get(
            f"https://www.yelp.com/biz/{business_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={offset}"
        )
        for offset in range(10, total_reviews + 10, 10)
    ]
    for page in asyncio.as_completed(to_scrape):
        response = await page
        data = json.loads(response.text)
        reviews.extend(data["reviews"])
    return reviews
Run Code & Example Output
BASE_HEADERS = {
    "authority": "www.yelp.com",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
        results = await scrape_reviews("https://www.yelp.com/biz/capri-laguna-laguna-beach", session=session)
        print(json.dumps(results, indent=2))


if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "id": "eB6j_V2LILubb2i0O6pODw",
    "userId": "ANYfELwm1rX-Z__Ryi_pQQ",
    "business": {
      "id": "Yz7qwi0GipbeLBFAjSr_PQ",
      "alias": "capri-laguna-laguna-beach",
      "name": "Capri Laguna",
      "photoSrc": "https://s3-media0.fl.yelpcdn.com/bphoto/nLCrCo0iobpoB6dIpEifXw/60s.jpg"
    },
    "user": {
      "link": "HIDDEN",
      "src": "HIDDEN",
      "srcSet": null,
      "markupDisplayName": "HIDDEN",
      "displayLocation": "HIDDEN",
      "altText": "HIDDEN",
      "userUrl": "HIDDEN",
      "partnerAlias": null,
      "friendCount": 0,
      "photoCount": 3,
      "reviewCount": 1,
      "eliteYear": null
    },
    "comment": {
      "text": "Very nice getaway for the family! I have been in Capri Laguna three times this summer already and the place never fails to amaze me. <br>The hotel has the best view over the ocean. You can watch the sunset from any deck or terrace in this hotel. As well some rooms have private balconies. <br>The service was great. The rooms are very clean and comfortable. The area is so calm and relaxing, you can sleep peacefully and comfortably. <br>The staff is so welcoming and respectful. Bachir was great, he is kind, friendly and very professional. Amazing customer service. Thank you Bachir!",
      "language": "en"
    },
    "localizedDate": "9/23/2022",
    "localizedDateVisited": null,
    "rating": 5,
    "photos": [
      {
        "src": "https://s3-media0.fl.yelpcdn.com/bphoto/0OUID9ZaHT89dEgZpU9wmA/180s.jpg",
        "caption": null,
        ...
      }
      ...
    ],
    "lightboxMediaItems": [
      { ... },
    ],
    "photosUrl": "/biz_photos/capri-laguna-laguna-beach?userid=ANYfELwm1rX-Z__Ryi_pQQ",
    "totalPhotos": 3,
    "feedback": {
      "counts": {
        "useful": 0,
        "funny": 0,
        "cool": 0
      },
      "userFeedback": {
        "useful": false,
        "funny": false,
        "cool": false
      },
      "voterText": null
    },
    "isUpdated": false,
    "businessOwnerReplies": null,
    "appreciatedBy": null,
    "previousReviews": null,
    "tags": [
      {
        "label": "3 photos",
        "title": null,
        "href": "HIDDEN",
        "iconName": "18x18_camera",
        "iconColor": ""
      }
    ]
  },

In our scraper above, to download yelp data of reviews, we first scrape business ID from the business' profile page. Then, we use this ID to scrape the first page of the reviews to find the review count and scrape the rest of the review pages concurrently.

The code snippet above got over 500 yelp reviews in mere seconds! That's because hidden APIs are much faster than HTML pages.

Bypass Yelp Blocking with Scrapfly

Yelp.com is a major web scraping target meaning they employ many techniques to blog web scrapers at scale. To retrieve the pages we did use custom headers that replicate a common web browser but if we were to scale this scraper to thousands of companies Yelp will catch up to us eventually and block us.

yelp blocked: this page is not available

Once Yelp realizes the client is a web scraper it will start redirecting all requests to "This page is not available" web page. How can we avoid this?

There's a lot we can do to avoid scraper blocking and for all of these details refer to our in-depth guide:

How to Scrape Without Getting Blocked? In-Depth Tutorial

For an in-depth look on web scraping blocking see our complete guide which covers what technologies are being used to detect web scrapers and how to get around them.

How to Scrape Without Getting Blocked? In-Depth Tutorial

For this project, to avoid blocking, we'll be using ScrapFly's web scraping API

illustration of scrapfly's middleware

Which offers several powerful features that'll help us to get around yelp's blocking:

To scrape Yelp.com using ScrapFly and Python all we have to do is install scrapfly-sdk:

$ pip install scrapfly-sdk

Then replace the httpx functionality with ScrapFly's SDK client functions. For example to scrape business phone number:

import httpx

response = httpx.get("https://www.yelp.com/biz/smooth-air-brampton")
selector = Selector(text=response.text)
phone_number = selector.xpath('//p[contains(text(),"Phone number")]/following-sibling::p/text()').get()

# in ScrapFly SDK becomes
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient("YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
    "https://www.yelp.com/biz/smooth-air-brampton",
    # we can select specific proxy country
    country="US",
    # and enable anti scraping protection bypass:
    asp=True
))
phone_number = result.selector.xpath('//p[contains(text(),"Phone number")]/following-sibling::p/text()').get()

See the Full Scraper Code section for more.

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping Yelp:

Yes. Yelp host only public data, and we're not extracting anything personal or private. Scraping yelp.com at slow, respectful rates of would fall under ethical scraping definition. For scraping Yelp reviews we should ensure that we don't collect any personal data in GDPR protected countries or further consult a lawyer. See our Is Web Scraping Legal? article for more.

How to scrape Yelp reviews?

To retrieve reviews of the business page we need to replicate yet another backend API request. If we click 2nd page in the review container we can see request to https://www.yelp.com/biz/BUSINESS_ID/review_feed?rl=en&q=&sort_by=relevance_desc&start=10 being made:

web inspector when clicking on review page

Where BUSINESS_ID is ID we've extracted earlier during the search step or alternative can be found in the HTML source of the business page itself.

For example, https://www.yelp.com/biz/capri-laguna-laguna-beach reviews would be located under this url https://www.yelp.com/biz/Yz7qwi0GipbeLBFAjSr_PQ/review_feed?rl=en&q=&sort_by=relevance_desc&start=10

Yelp Sraping Summary

In this tutorial we built a small yelp.com scraper which discovers companies from provided keyword and location input and retrieves their contact details such as phone numbers, website and other information fields.

For this we've used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Full Yelp Scraper Code

Here's the full Yelp.com scraper using Python and Scrapfly Python SDK:

💙 This code should only be used as a reference. To scrape data from Yelp.com at scale you'll need to adjust it to your preferences and environment

import asyncio
import json
from pathlib import Path
from typing import Dict, List, Optional, Tuple, TypedDict
from urllib.parse import urlencode, urljoin

from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")


def parse_search(result: ScrapeApiResponse) -> Tuple[List[Dict], Dict]:
    """
    Parses yelp search results for business results
    Returns list of businesses and search metadata
    """
    search_results = json.loads(result.content)
    results = search_results["searchPageProps"]["mainContentComponentsListProps"]
    businesses = [r for r in results if r.get("searchResultBusiness") and not r.get("adLoggingInfo")]
    search_meta = next(r for r in results if r.get("type") == "pagination")["props"]
    return businesses, search_meta


class Company(TypedDict):
    """Type hint data for a yelp company scrape result"""
    name: str
    website: str
    phone: str
    address: str
    logo: str
    open_hours: dict[str, str]
    claim_status: str


def parse_company(result: ScrapeApiResponse):
    """Parse yelp company page scrape result for company details"""
    xpath = lambda xp: result.selector.xpath(xp).get(default="").strip()
    open_hours = {}
    for day in result.selector.xpath('//th/p[contains(@class,"day-of-the-week")]'):
        name = day.xpath("text()").get().strip()
        value = day.xpath("../following-sibling::td//p/text()").get().strip()
        open_hours[name.lower()] = value

    claim_status = (
        "".join(result.selector.xpath('//span[contains(@class,"claim-text")]/text()').getall()).strip().lower()
    )
    return dict(
        name=xpath("//h1/text()"),
        website=xpath('//p[contains(text(),"Business website")]/following-sibling::p/a/text()'),
        phone=xpath('//p[contains(text(),"Phone number")]/following-sibling::p/text()'),
        address=xpath('//a[contains(text(),"Get Directions")]/../following-sibling::p/text()'),
        logo=xpath('//img[contains(@class,"businessLogo")]/@src'),
        open_hours=open_hours,
        claim_status=claim_status,
    )


async def search_yelp(keyword: str, location: str, max_results: Optional[int] = None):
    """scrape all pages of yelp search for business preview data"""

    def create_search_url(offset=0):
        """scrape single page of yelp search"""
        return "https://www.yelp.com/search/snippet?" + urlencode(
            {
                "find_desc": keyword,
                "find_loc": location,
                "start": offset,
                "parent_request": "",
                "ns": 1,
                "request_origin": "user",
            }
        )

    # scrape first page
    first_page = await scrapfly.async_scrape(ScrapeConfig(create_search_url()))
    businesses, search_meta = parse_search(first_page)
    total_results = search_meta["totalResults"]
    if max_results and total_results > max_results:
        total_results = max_results
    # scrape other pages concurrently
    other_urls = [create_search_url(page_offset) for page_offset in range(10, total_results, 10)]
    async for result in scrapfly.concurrent_scrape([ScrapeConfig(url) for url in other_urls]):
        businesses.extend(parse_search(result)[0])
    return businesses


async def scrape_companies_by_url(urls: List[str]) -> List[Dict]:
    """Scrape yelp company details from given yelp company urls"""
    results = []
    async for result in scrapfly.concurrent_scrape([ScrapeConfig(url) for url in urls]):
        results.append(parse_company(result))
    return results


async def scrape_companies_by_search(keyword: str, location: str, max_results: Optional[int] = None):
    """Scrape yelp company detail from given search details"""
    found_company_previews = await search_yelp(keyword, location, max_results=max_results)
    company_urls = [
        urljoin(
            "https://www.yelp.com",
            company_preview["searchResultBusiness"]["businessUrl"],
        )
        for company_preview in found_company_previews
    ]
    return await scrape_companies_by_url(company_urls)


class Review(TypedDict):
    """Type hint data for a yelp review scrape result"""
    id: str
    userId: str
    business: dict
    user: dict
    comment: dict
    rating: int
    ...


async def scrape_reviews(business_url: str, max_reviews: Optional[int] = None) -> List[Review]:
    """Scrape yelp reviews for a given business url"""
    # get business ID from business url
    result_business = await scrapfly.async_scrape(ScrapeConfig(business_url))
    business_id = result_business.selector.css('meta[name="yelp-biz-id"]::attr(content)').get()

    # scrape first page of reviews
    first_page = await scrapfly.async_scrape(
        ScrapeConfig(f"https://www.yelp.com/biz/{business_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start=0")
    )
    first_page_data = json.loads(first_page.content)
    reviews = first_page_data["reviews"]
    total_reviews = first_page_data["pagination"]["totalResults"]
    if max_reviews and total_reviews > max_reviews:
        total_reviews = max_reviews

    # scrape remaining pages of reviews
    print(f"scraping {total_reviews} of business {business_id}")
    to_scrape = [
        ScrapeConfig(
            f"https://www.yelp.com/biz/{business_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={offset}"
        )
        for offset in range(10, total_reviews + 10, 10)
    ]
    async for result in scrapfly.concurrent_scrape(to_scrape):
        data = json.loads(result.content)
        reviews.extend(data["reviews"])
    return reviews


async def example_run():
    """
    This example demonstrates:
    - business review scraping and saves reviews of a single business to /results/reviews.json
    - business page scraping by directl url and saves details of a single business to /results/companies.json
    - business page scraping by search keyword and saves details of the first 20 businesses to /results/search.json
    """
    out = Path(__file__).parent / "results"
    out.mkdir(exist_ok=True)

    result_reviews = await scrape_reviews("https://www.yelp.com/biz/capri-laguna-laguna-beach", max_reviews=50)
    out.joinpath("reviews.json").write_text(json.dumps(result_reviews, indent=2, ensure_ascii=False))

    result_companies = await scrape_companies_by_url(["https://www.yelp.com/biz/capri-laguna-laguna-beach"])
    out.joinpath("companies.json").write_text(json.dumps(result_companies, indent=2, ensure_ascii=False))

    result_search = await scrape_companies_by_search("plumbers", "Toronto, Ontario, Canada", max_results=20)
    out.joinpath("search.json").write_text(json.dumps(result_search, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    asyncio.run(example_run())

Related Posts

How to Web Scrape with HTTPX and Python

Intro to using Python's httpx library for web scraping. Proxy and user agent rotation and common web scraping challenges, tips and tricks.

How to Scrape Goat.com for Fashion Apparel Data in Python

Goat.com is a rising storefront for luxury fashion apparel items. It's known for high quality apparel data so in this tutorial we'll take a look how to scrape it using Python.

How to Scrape Fashionphile for Second Hand Fashion Data

In this fashion scrapeguide we'll be taking a look at Fashionphile - another major 2nd hand luxury fashion marketplace. We'll be using Python and hidden web data scraping to grap all of this data in just few lines of code.