How to Scrape Instagram
Tutorial on how to scrape instagram.com user and post data using pure Python. How to scrape instagram without loging in or being blocked.
Yelp.com is one of the oldest and well known yellow pages websites. It contains company information like address, website, location etc. as well as user reviews of these companies.
In this web scraping tutorial, we'll take a look how can we scrape yelp.com in Python. We'll start off with a bit of reverse engineering of the search functionality, so we can find businesses, and then we'll scrape and parse the business data itself. Finally, we'll take a look at how to avoid our scraper getting blocked when scraping at scale since Yelp is notorious for blocking web scraping.
To start scraping we need to find a way to discover businesses on yelp.
Unfortunately, if we take a look at https://www.yelp.com/robots.txt we can see that yelp.com doesn't provide a sitemap or any directory pages which might contain all the businesses. This means, we have to reverse engineer their search functionality and replicate that in our web scraper.
Let's start by taking a look at yelp's front page and what is happening when we submit our search:
We can see that upon entering search details we are being redirected to url with search keywords:
https://www.yelp.com/search?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=220
This is our search seed request, but we can go even further and look for data requests by examining the pagination. Let's click on next page and see what is happening in our browser's web inspector XHR tab:
https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user
We found the data endpoint for yelp's backend API. We can see that the /search/snippet
endpoint takes some parameters and returns search results of business IDs and preview details like:
{
// Business ID which we'll need later
"bizId": "oIff0iLkEiPsWcDATe6mfA",
// Business preview data
"searchResultBusiness": {
"ranking": null, "isAd": true,
"renderAdInfo": true,
"name": "Smooth Air",
"alternateNames": [],
"businessUrl": "/adredir?ad_business_id=oIff0iLkEiPsWcDATe6mfA&campaign_id=VcMvmxKjXiH2peL8g1c_jw&click_origin=search_results&placement=carousel_0&placement_slot=0&redirect_url=https%3A%2F%2Fwww.yelp.com%2Fbiz%2Fsmooth-air-brampton&request_id=daed206f44c35b85&signature=e537121fa6eb5d95fe240274d63ae189267de71994e5908c824eab5cea323c55&slot=1",
"categories": [{
"title": "Plumbing",
"url": "/search?cflt=plumbing&find_loc=Toronto%2C+Ontario%2C+Canada"
}, {
"title": "Heating & Air Conditioning/HVAC",
"url": "/search?cflt=hvac&find_loc=Toronto%2C+Ontario%2C+Canada"
}, {
"title": "Water Heater Installation/Repair",
"url": "/search?cflt=waterheaterinstallrepair&find_loc=Toronto%2C+Ontario%2C+Canada"
}],
"priceRange": "",
"rating": 0.0,
"reviewCount": 0,
"formattedAddress": "",
"neighborhoods": [],
"phone": "",
"serviceArea": null,
"parentBusiness": null,
"servicePricing": null,
"bizSiteUrl": "https://biz.yelp.com"
}
So, we can use this API endpoint to find all business IDs for given location and search term. With this information we can start working on our web scraper.
We'll be using httpx as our HTTP client combined with Python's asyncio
, so we can scrape yelp quickly and asynchronously. To install httpx let's use pip:
$ pip install httpx
We can start on our web scraper by replicating the search request we saw earlier:
import asyncio
import httpx
async def _search_yelp_page(keyword: str, location: str, session: httpx.AsyncClient, offset=0):
"""scrape single page of yelp search"""
# final url example:
# https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user
resp = await session.get(
"https://www.yelp.com/search/snippet",
params={
"find_desc": keyword,
"find_loc": location,
"start": offset,
"parent_request": "",
"ns": 1,
"request_origin": "user"
}
)
assert resp.status_code == 200
return resp.json()
BASE_HEADERS = {
"authority": "www.yelp.com",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
async def run():
async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
results = await _yelp_search_page('plumbers', 'Toronto, Ontario, Canada', session=session)
print(results)
if __name__ == "__main__":
asyncio.run(run())
Note: We're using asynchronous python, so later we can schedule multiple requests concurrently which will give us a huge speed boost.
In the script above we are replicating the /search/snippet
endpoint request which returns search result data for a single search page. Further, we need to parse this search data and implement ability to scrape all the pages.
Let's start with parsing:
def parse_search(search_results: Dict) -> Tuple[List[Dict], Dict]:
"""
Parses yelp search results for business results
Returns list of businesses and search metadata
"""
results = search_results['searchPageProps']['mainContentComponentsListProps']
businesses = [r for r in results if r.get('searchResultBusiness') and not r.get('adLoggingInfo')]
search_meta = next(r for r in results if r.get('type') == 'pagination')['props']
return businesses, search_meta
Backend API's often include loads of metadata including ads, tracking info etc. However, we only need the business info and total amount of pages in the search query, so we can retrieve all the results.
Finally, let's wrap everything up with a loop function that scrapes all available search pages asynchronously. We'll scrape the first page and then scrape the rest of the pages asynchronously:
async def yelp_search_all(keyword: str, location: str, session: httpx.AsyncClient):
"""scrape all pages of yelp search for business preview data"""
# get the first page data
first_page = await _yelp_search(keyword, location, session=session)
# parse first page for first page of businesses and total amount of pages
businesses, search_meta = parse_search(first_page)
# scrape remaining pages asynchronously
tasks = []
for page in range(10, search_meta['totalResults'], 10):
tasks.append(
_yelp_search(keyword, location, session=session, offset=page)
)
for result in await asyncio.gather(*tasks):
businesses.extend(parse_search(result)[0])
return businesses
This common pagination scraping idiom allows us to greatly speed up web scraping via asynchronous requests. We retrieve the first page for total page count, and then we can schedule concurrent requests for the rest of the pages.
import asyncio
from typing import Dict, List, Tuple
import httpx
def parse_search(search_results: Dict) -> Tuple[List[Dict], Dict]:
"""
Parses yelp search results for business results
Returns list of businesses and search metadata
"""
results = search_results['searchPageProps']['mainContentComponentsListProps']
businesses = [r for r in results if r.get('searchResultBusiness') and not r.get('adLoggingInfo')]
search_meta = next(r for r in results if r.get('type') == 'pagination')['props']
return businesses, search_meta
async def _search_yelp_page(keyword: str, location: str, session: httpx.AsyncClient, offset=0):
"""scrape single page of yelp search"""
# final url example:
# https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user
resp = await session.get(
"https://www.yelp.com/search/snippet",
params={
"find_desc": keyword,
"find_loc": location,
"start": offset,
"parent_request": "",
"ns": 1,
"request_origin": "user"
}
)
assert resp.status_code == 200
return resp.json()
async def search_yelp(keyword: str, location: str, session: httpx.AsyncClient):
"""scrape all pages of yelp search for business preview data"""
first_page = await _search_yelp_page(keyword, location, session=session)
businesses, search_meta = parse_search(first_page)
tasks = []
for page in range(10, search_meta['totalResults'], 10):
tasks.append(
_search_yelp_page(keyword, location, session=session, offset=page)
)
for result in await asyncio.gather(*tasks):
businesses.extend(parse_search(result)[0])
return businesses
BASE_HEADERS = {
"authority": "www.yelp.com",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
async def run():
async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
results = await yelp_search_all('plumbers', 'Toronto, Ontario, Canada', session=session)
print(results)
if __name__ == "__main__":
asyncio.run(run())
Now that we have our company discovery scraper we can further retrieve the details of each company we've discovered. For this we need to scrape each individual company url.
Let's start by taking a look at company page itself and where the data is located:
We see that HTML contains all the business data we might need like phone number, address etc. However, if we fire up web inspector we can see that the structure itself is not very tidy:
Such complex class names indicate the fact that they are dynamically generated - meaning we cannot rely on using class names in our HTML parsing selectors, or we have to be very safe about how we do it. Instead, we'll build our selectors relative to text matching. In other words, we'll find keyword text like "Get Directions", and we'll navigate the tree to the address value:
We can easily achieve this by taking advantage XPATH contains()
and ..
features:
//a[contains(text(),"Get Directions")]/../following-sibling::p/text()
We'll be using this technique to get most of the values so let's get to it. For our XPATH selectors we'll be using parsel HTML parsing library:
$ pip install parsel
Using parsel
and XPATH we can fully extract all visible details on the page:
import httpx
import asyncio
import json
from parsel import Selector
def parse_company(resp: httpx.Response):
sel = Selector(text=resp.text)
xpath = lambda xp: sel.xpath(xp).get(default="").strip()
open_hours = {}
for day in sel.xpath('//th/p[contains(@class,"day-of-the-week")]'):
name = day.xpath('text()').get().strip()
value = day.xpath('../following-sibling::td//p/text()').get().strip()
open_hours[name.lower()] = value
return dict(
name=xpath('//h1/text()'),
website=xpath('//p[contains(text(),"Business website")]/following-sibling::p/a/text()'),
phone=xpath('//p[contains(text(),"Phone number")]/following-sibling::p/text()'),
address=xpath('//a[contains(text(),"Get Directions")]/../following-sibling::p/text()'),
logo=xpath('//img[contains(@class,"businessLogo")]/@src'),
claim_status=''.join(sel.xpath('//span[contains(@class,"claim-text")]/text()').getall()).strip().lower(),
open_hours=open_hours,
)
async def _scrape_companies_by_url(company_urls:List[str], session: httpx.AsyncClient) -> List[Dict]:
"""Scrape yelp company details from given yelp company urls"""
responses = await asyncio.gather(*[
session.get(url) for url in company_urls
])
results = []
for resp in responses:
results.append(parse_company(resp))
return results
BASE_HEADERS = {
"authority": "www.yelp.com",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
async def run():
async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
resp = await yelp_companies(["https://www.yelp.com/biz/smooth-air-brampton"])
results = parse_company(resp)
print(json.dumps(results, indent=2))
if __name__ == "__main__":
asyncio.run(run())
Here, we've added our parse_company
function where we're using the xpath techniques we've covered earlier to extract our highlighted fields. If we run this scraper we'd see results similar to:
[{
"name": "Smooth Air",
"website": "https://www.smoothairhvac.com",
"phone": "(647) 828-6789",
"address": "305 Fleetwood Crescent Brampton, ON L6T 2E7 Canada",
"logo": "https://s3-media0.fl.yelpcdn.com/businessregularlogo/c90545xfS2yr7R7yKe9gZg/ms.jpg",
"claim_status": "claimed",
"open_hours": {
"mon": "Open 24 hours",
"tue": "Open 24 hours",
"wed": "Open 24 hours",
"thu": "Open 24 hours",
"fri": "Open 24 hours",
"sat": "Open 24 hours",
"sun": "Open 24 hours"
}
}]
Finally, we can put everything together into a comprehensive scraper which searches for companies and scrapes their full profile details:
import asyncio
import json
from typing import Dict, List, Tuple
from urllib.parse import urljoin
import httpx
from parsel import Selector
def parse_search(search_results: Dict) -> Tuple[List[Dict], Dict]:
"""
Parses yelp search results for business results
Returns list of businesses and search metadata
"""
results = search_results["searchPageProps"]["mainContentComponentsListProps"]
businesses = [
r
for r in results
if r.get("searchResultBusiness") and not r.get("adLoggingInfo")
]
search_meta = next(r for r in results if r.get("type") == "pagination")["props"]
return businesses, search_meta
def parse_company(resp: httpx.Response):
sel = Selector(text=resp.text)
xpath = lambda xp: sel.xpath(xp).get(default="").strip()
open_hours = {}
for day in sel.xpath('//th/p[contains(@class,"day-of-the-week")]'):
name = day.xpath("text()").get().strip()
value = day.xpath("../following-sibling::td//p/text()").get().strip()
open_hours[name.lower()] = value
return dict(
name=xpath("//h1/text()"),
website=xpath('//p[contains(text(),"Business website")]/following-sibling::p/a/text()'),
phone=xpath('//p[contains(text(),"Phone number")]/following-sibling::p/text()'),
address=xpath('//a[contains(text(),"Get Directions")]/../following-sibling::p/text()'),
logo=xpath('//img[contains(@class,"businessLogo")]/@src'),
claim_status="".join(
sel.xpath('//span[contains(@class,"claim-text")]/text()').getall()
).strip().lower(),
open_hours=open_hours,
)
async def _search_yelp_page(keyword: str, location: str, session: httpx.AsyncClient, offset=0):
"""scrape single page of yelp search"""
# final url example:
# https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user
resp = await session.get(
"https://www.yelp.com/search/snippet",
params={
"find_desc": keyword,
"find_loc": location,
"start": offset,
"parent_request": "",
"ns": 1,
"request_origin": "user",
},
)
assert resp.status_code == 200
return resp.json()
async def search_yelp(keyword: str, location: str, session: httpx.AsyncClient):
"""scrape all pages of yelp search for business preview data"""
first_page = await _search_yelp_page(keyword, location, session=session)
businesses, search_meta = parse_search(first_page)
tasks = []
for page in range(10, search_meta["totalResults"], 10):
tasks.append(_search_yelp_page(keyword, location, session=session, offset=page))
for result in await asyncio.gather(*tasks):
businesses.extend(parse_search(result)[0])
return businesses
async def _scrape_companies_by_url(company_urls: List[str], session: httpx.AsyncClient) -> List[Dict]:
"""Scrape yelp company details from given yelp company urls"""
responses = await asyncio.gather(*[session.get(url) for url in company_urls])
results = []
for resp in responses:
results.append(parse_company(resp))
return results
async def scrape_companies_by_search(keyword: str, location: str, session: httpx.AsyncClient):
"""Scrape yelp company detail from given search details"""
found_company_previews = await search_yelp(keyword, location, session=session)
company_urls = [
urljoin(
"https://www.yelp.com",
company_preview["searchResultBusiness"]["businessUrl"],
)
for company_preview in found_company_previews
]
return await _scrape_companies_by_url(company_urls, session=session)
BASE_HEADERS = {
"authority": "www.yelp.com",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
async def run():
limits = httpx.Limits(max_keepalive_connections=5, max_connections=5)
async with httpx.AsyncClient(headers=BASE_HEADERS, limits=limits) as session:
results = await scrape_companies_by_search(
"plumbers", "Toronto, Ontario, Canada", session=session
)
total = json.dumps(results)
return total
if __name__ == "__main__":
asyncio.run(run())
Yelp.com is a major web scraping target meaning they employ many techniques to blog web scrapers at scale. To retrieve the pages we did use custom headers that replicate a common web browser but if we were to scale this scraper to thousands of companies Yelp will catch up to us eventually and block us.
Once Yelp realizes the client is a web scraper it will start redirecting all requests to "This page is not available" web page. How can we avoid this?
There's a lot we can do to avoid scraper blocking and for all of these details refer to our in depth guide:
For an in-depth look on web scraping blocking see our complete guide which covers what technologies are being used to detect web scrapers and how to get around them.
For this project, to avoid blocking we'll be using ScrapFly's web scraping API
Which offers several powerful features that'll help us to get around yelp's blocking:
Let's re-implement our scraper to use ScrapFly's API via scrapfly-sdk in Python:
$ pip install scrapfly-sdk
For this all we have to do is replace the httpx
functionality with ScrapFly's SDK client functions (see the highlighted changes):
import asyncio
import json
from typing import Dict, List, Tuple
from urllib.parse import urlencode, urljoin
from parsel import Selector
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient
def parse_search(search_results: Dict) -> Tuple[List[Dict], Dict]:
"""
Parses yelp search results for business results
Returns list of businesses and search metadata
"""
results = search_results["searchPageProps"]["mainContentComponentsListProps"]
businesses = [r for r in results if r.get("searchResultBusiness") and not r.get("adLoggingInfo")]
search_meta = next(r for r in results if r.get("type") == "pagination")["props"]
return businesses, search_meta
def parse_company(htnl):
sel = Selector(text=html)
xpath = lambda xp: sel.xpath(xp).get(default="").strip()
open_hours = {}
for day in sel.xpath('//th/p[contains(@class,"day-of-the-week")]'):
name = day.xpath("text()").get().strip()
value = day.xpath("../following-sibling::td//p/text()").get().strip()
open_hours[name.lower()] = value
return dict(
name=xpath("//h1/text()"),
website=xpath('//p[contains(text(),"Business website")]/following-sibling::p/a/text()'),
phone=xpath('//p[contains(text(),"Phone number")]/following-sibling::p/text()'),
address=xpath('//a[contains(text(),"Get Directions")]/../following-sibling::p/text()'),
logo=xpath('//img[contains(@class,"businessLogo")]/@src'),
claim_status="".join(sel.xpath('//span[contains(@class,"claim-text")]/text()').getall()).strip().lower(),
open_hours=open_hours,
)
async def _search_yelp_page(keyword: str, location: str, session: ScrapflyClient, offset=0):
"""scrape single page of yelp search"""
url = "https://www.yelp.com/search/snippet?" + urlencode(
{
"find_desc": keyword,
"find_loc": location,
"start": offset,
"parent_request": "",
"ns": 1,
"request_origin": "user",
}
)
resp = session.scrape(ScrapeConfig(url=url))
assert resp.response.status_code == 200
return json.loads(resp.content)
async def search_yelp(keyword: str, location: str, session: ScrapflyClient):
"""scrape all pages of yelp search for business preview data"""
first_page = await _search_yelp_page(keyword, location, session=session)
businesses, search_meta = parse_search(first_page)
tasks = []
for page in range(10, search_meta["totalResults"], 10):
tasks.append(_search_yelp_page(keyword, location, session=session, offset=page))
for result in await asyncio.gather(*tasks):
businesses.extend(parse_search(result)[0])
return businesses
async def _scrape_companies_by_url(company_urls: List[str], session: ScrapflyClient) -> List[Dict]:
"""Scrape yelp company details from given yelp company urls"""
scrapfly_responses: List[ScrapeApiResponse] = await session.concurrent_scrape(
[ScrapeConfig(url) for url in company_urls]
)
results = []
for resp in scrapfly_responses:
results.append(parse_company(resp.response.text))
return results
async def scrape_companies_by_search(keyword: str, location: str, session: ScrapflyClient):
"""Scrape yelp company detail from given search details"""
found_company_previews = await search_yelp(keyword, location, session=session)
company_urls = [
urljoin(
"https://www.yelp.com",
company_preview["searchResultBusiness"]["businessUrl"],
)
for company_preview in found_company_previews
]
return await _scrape_companies_by_url(company_urls, session=session)
async def run():
scrapfly = ScrapflyClient(key="YOUR_SCRAPFLY_API_KEY", max_concurrency=5)
with scrapfly as session:
results = await scrape_companies_by_search("plumbers", "Toronto, Ontario, Canada", session=session)
total = json.dumps(results)
return total
if __name__ == "__main__":
asyncio.run(run())
In our update scraper above we've replaced httpx
calls with ScrapflyClient
calls, so all of our requests are going through ScrapFly API which smartly avoids web scraper blocking.
To wrap this guide up let's take a look at some frequently asked questions about web scraping yelp.com:
Yes. Yelp host only public data, and we're not extracting anything personal or private. Scraping yelp.com at slow, respectful rates of would fall under ethical scraping definition. For scraping Yelp reviews we should ensure that we don't collect any personal data in GDPR protected countries or further consult a lawyer. See our Is Web Scraping Legal? article for more.
To retrieve reviews of the business page we need to replicate yet another backend API request. If we click 2nd page in the review container we can see request to https://www.yelp.com/biz/BUSINESS_ID/review_feed?rl=en&q=&sort_by=relevance_desc&start=10
being made:
Where BUSINESS_ID
is ID we've extracted earlier during the search step or alternative can be found in the HTML source of the business page itself.
For example, https://www.yelp.com/biz/capri-laguna-laguna-beach reviews would be located under this url https://www.yelp.com/biz/Yz7qwi0GipbeLBFAjSr_PQ/review_feed?rl=en&q=&sort_by=relevance_desc&start=10
In this tutorial we built a small yelp.com scraper which discovers companies from provided keyword and location input and retrieves their contact details such as phone numbers, website and other information fields.
For this we've used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!