How to Web Scrape Yelp.com

article feature image

Yelp.com is one of the oldest and well known yellow page websites. It contains company information like address, website, location etc. as well as user reviews of these companies.

In this web scraping tutorial, we'll take a look at how can we scrape yelp.com in Python. We'll start off with a bit of reverse engineering of the search functionality, so we can find businesses, and then we'll scrape and parse the business data itself. Finally, we'll take a look at how to avoid our scraper getting blocked when scraping at scale since Yelp is notorious for blocking web scraping.

Setup

We'll be using Python in this tutorial as well as a few popular community packages:

  • httpx - HTTP client library which will let us communicate with Booking.com's servers.
  • parsel - HTML parsing library which will help us to parse our web scraped HTML files for hotel data.
  • loguru[optional] - for prettier logging, so we can follow along easier.

We can easily install them using pip command:

$ pip install httpx parsel loguru

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package or anything that supports CSS selectors which is what we'll be using in this tutorial.

Discovering Company Pages

To start scraping we need to find a way to discover businesses on yelp.

Unfortunately, if we take a look at yelp.com/robots.txt we can see that yelp.com doesn't provide a sitemap or any directory pages which might contain all the businesses. This means we have to reverse engineer their search functionality and replicate that in our web scraper.

Research

Let's start by taking a look at yelp's front page and what is happening when we submit our search:

yelp.com search functionality

We can see that upon entering search details we are being redirected to URL with search keywords:

https://www.yelp.com/search?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=220

This is our search seed request, but we can go even further and look for data requests by examining the pagination. Let's click on the next page link and see what is happening in our browser's web inspector XHR tab:

yelp page 2 network traffic inspector
https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user

We found the data endpoint for yelp's backend API. We can see that the /search/snippet endpoint takes some parameters and returns search results of business IDs and preview details like:

{
    // Business ID which we'll need later
    "bizId": "oIff0iLkEiPsWcDATe6mfA",
    // Business preview data
    "searchResultBusiness": {
        "ranking": null, "isAd": true,
        "renderAdInfo": true,
        "name": "Smooth Air",
        "alternateNames": [],
        "businessUrl": "/adredir?ad_business_id=oIff0iLkEiPsWcDATe6mfA&campaign_id=VcMvmxKjXiH2peL8g1c_jw&click_origin=search_results&placement=carousel_0&placement_slot=0&redirect_url=https%3A%2F%2Fwww.yelp.com%2Fbiz%2Fsmooth-air-brampton&request_id=daed206f44c35b85&signature=e537121fa6eb5d95fe240274d63ae189267de71994e5908c824eab5cea323c55&slot=1",
        "categories": [{
            "title": "Plumbing",
            "url": "/search?cflt=plumbing&find_loc=Toronto%2C+Ontario%2C+Canada"
        }, {
            "title": "Heating & Air Conditioning/HVAC",
            "url": "/search?cflt=hvac&find_loc=Toronto%2C+Ontario%2C+Canada"
        }, {
            "title": "Water Heater Installation/Repair",
            "url": "/search?cflt=waterheaterinstallrepair&find_loc=Toronto%2C+Ontario%2C+Canada"
        }],
        "priceRange": "",
        "rating": 0.0,
        "reviewCount": 0,
        "formattedAddress": "",
        "neighborhoods": [],
        "phone": "",
        "serviceArea": null,
        "parentBusiness": null,
        "servicePricing": null,
        "bizSiteUrl": "https://biz.yelp.com"
}

So, we can use this API endpoint to find all business IDs for a given location and search term. With this information, we can start working on our web scraper.

We can start on our web scraper by replicating the search request we saw earlier:

import asyncio
import httpx


async def _search_yelp_page(keyword: str, location: str, session: httpx.AsyncClient, offset=0):
    """scrape single page of yelp search"""
    # final url example:
    # https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user
    resp = await session.get(
        "https://www.yelp.com/search/snippet",
        params={
            "find_desc": keyword,
            "find_loc": location,
            "start": offset,
            "parent_request": "",
            "ns": 1,
            "request_origin": "user"
        }
    )
    assert resp.status_code == 200
    return resp.json()

Note: We're using asynchronous python, so later we can schedule multiple requests concurrently which will give us a huge speed boost.

In the script above we are replicating the /search/snippet endpoint request which returns search result data for a single search page. Let's see the results this scraper generates:

Run code & example output
BASE_HEADERS = {
    "authority": "www.yelp.com",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}
async def run():
    async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
        results = await _yelp_search_page('plumbers', 'Toronto, Ontario, Canada', session=session)
        print(results)

if __name__ == "__main__":
    asyncio.run(run())
{
  "pageTitle": [
    "Yelp"
  ],
  "loggingConfig": {
    "sitRepConfig": {
      "isSitRepEnabled": true,
      "enabledSitRepChannels": {
        "vertical_search_reservation": true,
        "vertical_search_platform": true,
        "frontend_performance": true,
        "search_suggest_events": true,
        "vertical_search_waitlist": true,
        "ad_syndication_cookie_sync_errors": true,
        "traffic_quality": true,
        "search_ux": true,
        "message_the_business": true,
        "ytp_session_events": true,
        "ad_syndication": true
      },
    }
    "searchPageProps"{
        ...
    }
...
}

Further, we need to parse this search data and implement the ability to scrape all the pages.
Let's start with parsing:

def parse_search(search_results: Dict) -> Tuple[List[Dict], Dict]:
    """
    Parses yelp search results for business results
    Returns list of businesses and search metadata
    """
    results = search_results['searchPageProps']['mainContentComponentsListProps']
    businesses = [r for r in results if r.get('searchResultBusiness') and not r.get('adLoggingInfo')]
    search_meta = next(r for r in results if r.get('type') == 'pagination')['props']
    return businesses, search_meta

Backend APIs often include loads of metadata including ads, tracking info etc. However, we only need the business info and the total amount of pages in the search query, so we can retrieve all the results.

Finally, let's wrap everything up with a loop function that scrapes all available search pages asynchronously. We'll scrape the first page and then scrape the rest of the pages asynchronously:

async def yelp_search_all(keyword: str, location: str, session: httpx.AsyncClient):
    """scrape all pages of yelp search for business preview data"""
    # get the first page data
    first_page = await _yelp_search(keyword, location, session=session)
    # parse first page for first page of businesses and total amount of pages
    businesses, search_meta = parse_search(first_page)
    # scrape remaining pages asynchronously
    tasks = []
    for page in range(10, search_meta['totalResults'], 10):
        tasks.append(
            _yelp_search(keyword, location, session=session, offset=page)
        )
    for result in await asyncio.gather(*tasks):
        businesses.extend(parse_search(result)[0])

    return businesses

This common pagination scraping idiom allows us to greatly speed up web scraping via asynchronous requests. We retrieve the first page for the total page count, and then we can schedule concurrent requests for the rest of the pages.

import asyncio
from typing import Dict, List, Tuple
import httpx

def parse_search(search_results: Dict) -> Tuple[List[Dict], Dict]:
    """
    Parses yelp search results for business results
    Returns list of businesses and search metadata
    """
    results = search_results['searchPageProps']['mainContentComponentsListProps']
    businesses = [r for r in results if r.get('searchResultBusiness') and not r.get('adLoggingInfo')]
    search_meta = next(r for r in results if r.get('type') == 'pagination')['props']
    return businesses, search_meta

async def _search_yelp_page(keyword: str, location: str, session: httpx.AsyncClient, offset=0):
    """scrape single page of yelp search"""
    # final url example:
    # https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user
    resp = await session.get(
        "https://www.yelp.com/search/snippet",
        params={
            "find_desc": keyword,
            "find_loc": location,
            "start": offset,
            "parent_request": "",
            "ns": 1,
            "request_origin": "user"
        }
    )
    assert resp.status_code == 200
    return resp.json()


async def search_yelp(keyword: str, location: str, session: httpx.AsyncClient):
    """scrape all pages of yelp search for business preview data"""
    first_page = await _search_yelp_page(keyword, location, session=session)
    businesses, search_meta = parse_search(first_page)
    tasks = []
    for page in range(10, search_meta['totalResults'], 10):
        tasks.append(
            _search_yelp_page(keyword, location, session=session, offset=page)
        )
    for result in await asyncio.gather(*tasks):
        businesses.extend(parse_search(result)[0])
    return businesses
Run code & example output
BASE_HEADERS = {
    "authority": "www.yelp.com",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}
async def run():
    async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
        results = await yelp_search_all('plumbers', 'Toronto, Ontario, Canada', session=session)
        print(results)

if __name__ == "__main__":
    asyncio.run(run())

Getting Full Company Data

Now that we have our company discovery scraper we can further retrieve the details of each company we've discovered. For this, we need to scrape each company URL.

Let's start by taking a look at the company page itself and where the data is located:

yelp.com business parsing field markup

We see that HTML contains all the business data we might need like phone number, address etc. However, if we fire up web inspector we can see that the structure itself is not very tidy:

yelp.com dynamic class screenshot
highlight of dynamic class names that can change unpredictably

Such complex class names indicate the fact that they are dynamically generated - meaning we cannot rely on using class names in our HTML parsing selectors, or we have to be very safe about how we do it. Instead, we'll build our selectors relative to text matching. In other words, we'll find keyword text like "Get Directions", and we'll navigate the tree to the address value:

usage of relative selectors illustration

We can easily achieve this by taking advantage of XPATH contains()and..` features:

//a[contains(text(),"Get Directions")]/../following-sibling::p/text()

We'll be using this technique to get most of the values so let's get to it. For our XPATH selectors we'll be using parsel HTML parsing library:

$ pip install parsel

Using parsel and XPATH we can fully extract all visible details on the page:

import httpx
import asyncio
import json
from parsel import Selector

def parse_company(resp: httpx.Response):
    sel = Selector(text=resp.text)
    xpath = lambda xp: sel.xpath(xp).get(default="").strip()
    open_hours = {}
    for day in sel.xpath('//th/p[contains(@class,"day-of-the-week")]'):
        name = day.xpath('text()').get().strip()
        value = day.xpath('../following-sibling::td//p/text()').get().strip()
        open_hours[name.lower()] = value
    return dict(
        name=xpath('//h1/text()'),
        website=xpath('//p[contains(text(),"Business website")]/following-sibling::p/a/text()'),
        phone=xpath('//p[contains(text(),"Phone number")]/following-sibling::p/text()'),
        address=xpath('//a[contains(text(),"Get Directions")]/../following-sibling::p/text()'),
        logo=xpath('//img[contains(@class,"businessLogo")]/@src'),
        claim_status=''.join(sel.xpath('//span[contains(@class,"claim-text")]/text()').getall()).strip().lower(),
        open_hours=open_hours,
    )


async def _scrape_companies_by_url(company_urls:List[str], session: httpx.AsyncClient) -> List[Dict]:
    """Scrape yelp company details from given yelp company urls"""
    responses = await asyncio.gather(*[
        session.get(url) for url in company_urls
    ])
    results = []
    for resp in responses:
        results.append(parse_company(resp))
    return results

Here, we've added our parse_company function where we're using the xpath techniques we've covered earlier to extract our highlighted fields. If we run this scraper we'd see results similar to:

Run code
BASE_HEADERS = {
    "authority": "www.yelp.com",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}

async def run():
    async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
        resp = await yelp_companies(["https://www.yelp.com/biz/smooth-air-brampton"])
        results = parse_company(resp)
        print(json.dumps(results, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
[{
    "name": "Smooth Air",
    "website": "https://www.smoothairhvac.com",
    "phone": "(647) 828-6789",
    "address": "305 Fleetwood Crescent Brampton, ON L6T 2E7 Canada",
    "logo": "https://s3-media0.fl.yelpcdn.com/businessregularlogo/c90545xfS2yr7R7yKe9gZg/ms.jpg",
    "claim_status": "claimed",
    "open_hours": {
        "mon": "Open 24 hours",
        "tue": "Open 24 hours",
        "wed": "Open 24 hours",
        "thu": "Open 24 hours",
        "fri": "Open 24 hours",
        "sat": "Open 24 hours",
        "sun": "Open 24 hours"
    }
},
...
]

Finally, we can put everything together into a comprehensive scraper that searches for companies and scrapes their full profile details:

import asyncio
import json
from typing import Dict, List, Tuple
from urllib.parse import urljoin

import httpx
from parsel import Selector


def parse_search(search_results: Dict) -> Tuple[List[Dict], Dict]:
    """
    Parses yelp search results for business results
    Returns list of businesses and search metadata
    """
    results = search_results["searchPageProps"]["mainContentComponentsListProps"]
    businesses = [
        r
        for r in results
        if r.get("searchResultBusiness") and not r.get("adLoggingInfo")
    ]
    search_meta = next(r for r in results if r.get("type") == "pagination")["props"]
    return businesses, search_meta


def parse_company(resp: httpx.Response):
    sel = Selector(text=resp.text)
    xpath = lambda xp: sel.xpath(xp).get(default="").strip()
    open_hours = {}
    for day in sel.xpath('//th/p[contains(@class,"day-of-the-week")]'):
        name = day.xpath("text()").get().strip()
        value = day.xpath("../following-sibling::td//p/text()").get().strip()
        open_hours[name.lower()] = value
    return dict(
        name=xpath("//h1/text()"),
        website=xpath('//p[contains(text(),"Business website")]/following-sibling::p/a/text()'),
        phone=xpath('//p[contains(text(),"Phone number")]/following-sibling::p/text()'),
        address=xpath('//a[contains(text(),"Get Directions")]/../following-sibling::p/text()'),
        logo=xpath('//img[contains(@class,"businessLogo")]/@src'),
        claim_status="".join(
            sel.xpath('//span[contains(@class,"claim-text")]/text()').getall()
        ).strip().lower(),
        open_hours=open_hours,
    )


async def _search_yelp_page(keyword: str, location: str, session: httpx.AsyncClient, offset=0):
    """scrape single page of yelp search"""
    # final url example:
    # https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user
    resp = await session.get(
        "https://www.yelp.com/search/snippet",
        params={
            "find_desc": keyword,
            "find_loc": location,
            "start": offset,
            "parent_request": "",
            "ns": 1,
            "request_origin": "user",
        },
    )
    assert resp.status_code == 200
    return resp.json()


async def search_yelp(keyword: str, location: str, session: httpx.AsyncClient):
    """scrape all pages of yelp search for business preview data"""
    first_page = await _search_yelp_page(keyword, location, session=session)
    businesses, search_meta = parse_search(first_page)
    tasks = []
    for page in range(10, search_meta["totalResults"], 10):
        tasks.append(_search_yelp_page(keyword, location, session=session, offset=page))
    for result in await asyncio.gather(*tasks):
        businesses.extend(parse_search(result)[0])
    return businesses


async def _scrape_companies_by_url(company_urls: List[str], session: httpx.AsyncClient) -> List[Dict]:
    """Scrape yelp company details from given yelp company urls"""
    responses = await asyncio.gather(*[session.get(url) for url in company_urls])
    results = []
    for resp in responses:
        results.append(parse_company(resp))
    return results


async def scrape_companies_by_search(keyword: str, location: str, session: httpx.AsyncClient):
    """Scrape yelp company detail from given search details"""
    found_company_previews = await search_yelp(keyword, location, session=session)
    company_urls = [
        urljoin(
            "https://www.yelp.com",
            company_preview["searchResultBusiness"]["businessUrl"],
        )
        for company_preview in found_company_previews
    ]
    return await _scrape_companies_by_url(company_urls, session=session)
Run code
BASE_HEADERS = {
    "authority": "www.yelp.com",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    limits = httpx.Limits(max_keepalive_connections=5, max_connections=5)
    async with httpx.AsyncClient(headers=BASE_HEADERS, limits=limits) as session:
        results = await scrape_companies_by_search(
            "plumbers", "Toronto, Ontario, Canada", session=session
        )
        total = json.dumps(results)
        return total


if __name__ == "__main__":
    asyncio.run(run())

Avoiding Scraper Blocking

Yelp.com is a major web scraping target meaning they employ many techniques to blog web scrapers at scale. To retrieve the pages we did use custom headers that replicate a common web browser but if we were to scale this scraper to thousands of companies Yelp will catch up to us eventually and block us.

yelp blocked: this page is not available

Once Yelp realizes the client is a web scraper it will start redirecting all requests to "This page is not available" web page. How can we avoid this?

There's a lot we can do to avoid scraper blocking and for all of these details refer to our in-depth guide:

How to Scrape Without Getting Blocked Tutorial

For an in-depth look on web scraping blocking see our complete guide which covers what technologies are being used to detect web scrapers and how to get around them.

How to Scrape Without Getting Blocked Tutorial

For this project, to avoid blocking we'll be using ScrapFly's web scraping API

illustration of scrapfly's middleware

Which offers several powerful features that'll help us to get around yelp's blocking:

Let's re-implement our scraper to use ScrapFly's API via scrapfly-sdk in Python:

$ pip install scrapfly-sdk

For this all we have to do is replace the httpx functionality with ScrapFly's SDK client functions.

Full Scraper Code

Full Scraper Code with ScrapFly integration
import asyncio
import json
from typing import Dict, List, Tuple, TypedDict
from urllib.parse import urlencode, urljoin

from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient


def parse_search(result: ScrapeApiResponse) -> Tuple[List[Dict], Dict]:
    """
    Parses yelp search results for business results
    Returns list of businesses and search metadata
    """
    search_results = json.loads(result.content)
    results = search_results["searchPageProps"]["mainContentComponentsListProps"]
    businesses = [r for r in results if r.get("searchResultBusiness") and not r.get("adLoggingInfo")]
    search_meta = next(r for r in results if r.get("type") == "pagination")["props"]
    return businesses, search_meta


class Company(TypedDict):
    name: str
    website: str
    phone: str
    address: str
    logo: str
    open_hours: dict[str, str]
    claim_status: str


def parse_company(result: ScrapeApiResponse):
    xpath = lambda xp: result.selector.xpath(xp).get(default="").strip()
    open_hours = {}
    for day in result.selector.xpath('//th/p[contains(@class,"day-of-the-week")]'):
        name = day.xpath("text()").get().strip()
        value = day.xpath("../following-sibling::td//p/text()").get().strip()
        open_hours[name.lower()] = value

    claim_status = (
        "".join(result.selector.xpath('//span[contains(@class,"claim-text")]/text()').getall()).strip().lower()
    )
    return dict(
        name=xpath("//h1/text()"),
        website=xpath('//p[contains(text(),"Business website")]/following-sibling::p/a/text()'),
        phone=xpath('//p[contains(text(),"Phone number")]/following-sibling::p/text()'),
        address=xpath('//a[contains(text(),"Get Directions")]/../following-sibling::p/text()'),
        logo=xpath('//img[contains(@class,"businessLogo")]/@src'),
        open_hours=open_hours,
        claim_status=claim_status,
    )


def create_search_url(keyword: str, location: str, offset=0):
    """scrape single page of yelp search"""
    return "https://www.yelp.com/search/snippet?" + urlencode(
        {
            "find_desc": keyword,
            "find_loc": location,
            "start": offset,
            "parent_request": "",
            "ns": 1,
            "request_origin": "user",
        }
    )


async def search_yelp(keyword: str, location: str, session: ScrapflyClient):
    """scrape all pages of yelp search for business preview data"""
    first_page = await session.async_scrape(ScrapeConfig(create_search_url(keyword, location)))
    businesses, search_meta = parse_search(first_page)

    other_urls = [create_search_url(keyword, location, page) for page in range(10, search_meta["totalResults"], 10)]
    async for result in session.concurrent_scrape([ScrapeConfig(url) for url in other_urls]):
        businesses.extend(parse_search(result)[0])
    return businesses


async def _scrape_companies_by_url(urls: List[str], session: ScrapflyClient) -> List[Dict]:
    """Scrape yelp company details from given yelp company urls"""
    results = []
    async for result in session.concurrent_scrape([ScrapeConfig(url) for url in urls]):
        results.append(parse_company(result))
    return results


async def scrape_companies_by_search(keyword: str, location: str, session: ScrapflyClient):
    """Scrape yelp company detail from given search details"""
    found_company_previews = await search_yelp(keyword, location, session=session)
    company_urls = [
        urljoin(
            "https://www.yelp.com",
            company_preview["searchResultBusiness"]["businessUrl"],
        )
        for company_preview in found_company_previews
    ]
    return await _scrape_companies_by_url(company_urls, session=session)


async def run():
    scrapfly = ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=20)
    with scrapfly as session:
        results = await scrape_companies_by_search("plumbers", "Toronto, Ontario, Canada", session=session)
        total = json.dumps(results)
        return total


if __name__ == "__main__":
    asyncio.run(run())

In our update scraper above we've replaced httpx calls with ScrapflyClient calls, so all of our requests are going through ScrapFly API which smartly avoids web scraper blocking.

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping yelp.com:

How to scrape Yelp reviews?

To retrieve reviews of the business page we need to replicate yet another backend API request. If we click 2nd page in the review container we can see request to https://www.yelp.com/biz/BUSINESS_ID/review_feed?rl=en&q=&sort_by=relevance_desc&start=10 being made:

web inspector when clicking on review page

Where BUSINESS_ID is ID we've extracted earlier during the search step or alternative can be found in the HTML source of the business page itself.

For example, https://www.yelp.com/biz/capri-laguna-laguna-beach reviews would be located under this url https://www.yelp.com/biz/Yz7qwi0GipbeLBFAjSr_PQ/review_feed?rl=en&q=&sort_by=relevance_desc&start=10

Summary

In this tutorial we built a small yelp.com scraper which discovers companies from provided keyword and location input and retrieves their contact details such as phone numbers, website and other information fields.

For this we've used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Related post

How to Ensure Web Scrapped Data Quality

Ensuring consitent web scrapped data quality can be a difficult and exhausting task. In this article we'll be taking a look at two populat tools in Python - Cerberus and Pydantic - and how can we use them to validate data.

How to Turn Web Scrapers into Data APIs

Delivering web scraped data can be a difficult problem - what if we could scrape data on demand? In this tutorial we'll be building a data API using FastAPI and Python for real time web scraping.

How to Scrape Glassdoor

In this web scraping tutorial we'll take a look at Glassdoor - a major resource for company review, job listings and salary data.