How to scrape Threads by Meta using Python (2023-08 Update)
Guide how to scrape Threads - new social media network by Meta and Instagram - using Python and popular libraries like Playwright and background request capture techniques.
In this web scraping guide we'll be scraping TripAdvisor.com - one of the biggest service portals in the hospitality industry, which contains hotel, activity and restaurant data.
In this tutorial, we'll take a look at how to scrape TripAdvisor reviews as well as other details like hotel information and pricing, and how to find hotel pages by scraping search.
Everything we'll learn can be applied to other TripAdvisor targets such as restaurants, tours and activities.
TripAdvisor is one of the biggest sources of hospitality industry data. Most people are interested in scraping TripAdvisor reviews but this public source also contains data like the hotel, tour and restaurant information and pricing. So, by scraping TripAdvisor not only we can collect information about hotel industry targets but public opinions about them as well!
All of this data has great value in business intelligence like market and competitive analysis. In other words, data available on TripAdvisor can give us a glimpse into the hospitality industry which can be used to generate leads and improve business performances.
For more on scraping use cases see our extensive write-up Scraping Use Cases
In this tutorial we'll be using Python with two or three community packages:
These packages can be easily installed via pip
command:
$ pip install "httpx[http2,brotli]" parsel
Alternatively, you're free to swap httpx
out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel
, another great alternative is beautifulsoup package.
TripAdvisor is a tough target to scrape - if you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.
Let's start our scraper by taking a look at how can we find hotels on TripAdvisor. For this, let's take a look at how TripAdvisor's search function works:
In the short video above, we can see that a GraphQl-powered POST request is being made in the background when we type in our search query. This request returns search page recommendations each of which contains preview data of hotels, restaurants or tours.
Let's replicate this graphql request in our Python-based scraper.
For this we'll establish an HTTP connection session and submit a POST
type request that mimics what we've observed above:
import asyncio
import json
import random
import string
from typing import List, TypedDict
import httpx
from loguru import logger as log
class LocationData(TypedDict):
"""result dataclass for tripadvisor location data"""
localizedName: str
url: str
HOTELS_URL: str
ATTRACTIONS_URL: str
RESTAURANTS_URL: str
placeType: str
latitude: float
longitude: float
async def scrape_location_data(query: str, client: httpx.AsyncClient) -> List[LocationData]:
"""
scrape search location data from a given query.
e.g. "New York" will return us TripAdvisor's location details for this query
"""
log.info(f"scraping location data: {query}")
# the graphql payload that defines our search
# note: that changing values outside of expected ranges can block the web scraper
payload = [
{
# Every graphql query has a query ID that doesn't change often:
"query": "5eec1d8288aa8741918a2a5051d289ef",
# the variables define the search
"variables": {
"request": {
"query": query,
"limit": 10,
"scope": "WORLDWIDE",
"locale": "en-US",
"scopeGeoId": 1,
"searchCenter": None,
# we can define search result types, in this case we want to search locations
"types": [
"LOCATION",
# "QUERY_SUGGESTION",
# "USER_PROFILE",
# "RESCUE_RESULT"
],
# we can further narrow down locations to
"locationTypes": [
"GEO",
"AIRPORT",
"ACCOMMODATION",
"ATTRACTION",
"ATTRACTION_PRODUCT",
"EATERY",
"NEIGHBORHOOD",
"AIRLINE",
"SHOPPING",
"UNIVERSITY",
"GENERAL_HOSPITAL",
"PORT",
"FERRY",
"CORPORATION",
"VACATION_RENTAL",
"SHIP",
"CRUISE_LINE",
"CAR_RENTAL_OFFICE",
],
"userId": None,
"articleCategories": [
"default",
"love_your_local",
"insurance_lander",
],
"enabledFeatures": ["typeahead-q"],
}
},
}
]
# we need to generate a random request ID for this request to succeed
random_request_id = "".join(
random.choice(string.ascii_lowercase + string.digits) for i in range(180)
)
headers = {
"X-Requested-By": random_request_id,
"Referer": "https://www.tripadvisor.com/Hotels",
"Origin": "https://www.tripadvisor.com",
}
result = await client.post(
url="https://www.tripadvisor.com/data/graphql/ids",
json=payload,
headers=headers,
)
data = json.loads(result.content)
results = data[0]["data"]["Typeahead_autocomplete"]["results"]
results = [r["details"] for r in results] # strip metadata
log.info(f"found {len(results)} results")
return results
# To avoid being instantly blocked we'll be using request headers that
# mimic Chrome browser on Windows
BASE_HEADERS = {
"authority": "www.tripadvisor.com",
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
# start HTTP session client with our headers and HTTP2
client = httpx.AsyncClient(
http2=True, # http2 connections are significantly less likely to get blocked
headers=BASE_HEADERS,
timeout=httpx.Timeout(15.0),
limits=httpx.Limits(max_connections=5),
)
async def run():
result = await scrape_location_data("Malta")
print(json.dumps(result, indent=2))
if __name__ == "__main__":
asyncio.run(run())
This graphQL request might appear complicated, but we're mostly using values we took from our browser only changing query
string itself. There are a few things to note here:
Referer
and Origin
are required to not be blocked by TripAdvisorX-Requested-By
is a tracking ID header and in this case, we just generate a bunch of random numbers.We're also using httpx with http2 enabled to make our requests faster and less likely to be blocked.
For more details on how to scrape graphql powered websites see our introduction tutorial which covers what is graphql, how to scrape it and common tools, tips and tricks.
Let's take our scraper for a spin and see what it finds for "Malta" keyword:
{
"localizedName": "Malta",
"localizedAdditionalNames": {
"longOnlyHierarchy": "Europe"
},
"streetAddress": {
"street1": null
},
"locationV2": {
"placeType": "COUNTRY",
"names": {
"longOnlyHierarchyTypeaheadV2": "Europe"
},
"vacationRentalsRoute": {
"url": "/VacationRentals-g190311-Reviews-Malta-Vacation_Rentals.html"
}
},
"url": "/Tourism-g190311-Malta-Vacations.html",
"HOTELS_URL": "/Hotels-g190311-Malta-Hotels.html",
"ATTRACTIONS_URL": "/Attractions-g190311-Activities-Malta.html",
"RESTAURANTS_URL": "/Restaurants-g190311-Malta.html",
"placeType": "COUNTRY",
"latitude": 35.892,
"longitude": 14.42979,
"isGeo": true,
"thumbnail": {
"photoSizeDynamic": {
"maxWidth": 2880,
"maxHeight": 1920,
"urlTemplate": "https://dynamic-media-cdn.tripadvisor.com/media/photo-o/21/66/c5/99/caption.jpg?w={width}&h={height}&s=1&cx=1203&cy=677&chk=v1_cf397a9cdb4fbd9239a9"
}
}
}
We can see that we get URLs to Hotel, Restaurant and Attraction searches! We can use these URLs to scrape search results themselves.
We figured out how to use TripAdvisor's Search suggestions to find search pages, now let's scrape these pages for hotel preview data like links and names.
Let's take a look at how we can do that by extending our scraping code:
import asyncio
import json
import math
from typing import List, Optional, TypedDict
from urllib.parse import urljoin
import httpx
from loguru import logger as log
from parsel import Selector
from snippet1 import scrape_location_data, client
class Preview(TypedDict):
url: str
name: str
def parse_search_page(response: httpx.Response) -> List[Preview]:
"""parse result previews from TripAdvisor search page"""
log.info(f"parsing search page: {response.url}")
parsed = []
# Search results are contain in boxes which can be in two locations.
# this is location #1:
selector = Selector(response.text)
for box in selector.css("span.listItem"):
title = box.css("div[data-automation=hotel-card-title] a ::text").getall()[1]
url = box.css("div[data-automation=hotel-card-title] a::attr(href)").get()
parsed.append(
{
"url": urljoin(str(response.url), url), # turn url absolute
"name": title,
}
)
if parsed:
return parsed
# location #2
for box in selector.css("div.listing_title>a"):
parsed.append(
{
"url": urljoin(
str(response.url), box.xpath("@href").get()
), # turn url absolute
"name": box.xpath("text()").get("").split(". ")[-1],
}
)
return parsed
async def scrape_search(query: str, max_pages: Optional[int] = None) -> List[Preview]:
"""scrape search results of a search query"""
# first scrape location data and the first page of results
log.info(f"{query}: scraping first search results page")
try:
location_data = (await scrape_location_data(query))[0] # take first result
except IndexError:
log.error(f"could not find location data for query {query}")
return
hotel_search_url = "https://www.tripadvisor.com" + location_data["HOTELS_URL"]
log.info(f"found hotel search url: {hotel_search_url}")
first_page = await client.get(hotel_search_url)
assert first_page.status_code == 200, "scraper is being blocked"
# parse first page
results = parse_search_page(first_page)
if not results:
log.error("query {} found no results", query)
return []
# extract pagination metadata to scrape all pages concurrently
page_size = len(results)
total_results = first_page.selector.xpath("//span/text()").re(
"(\d*\,*\d+) properties"
)[0]
total_results = int(total_results.replace(",", ""))
next_page_url = first_page.selector.css(
'a[aria-label="Next page"]::attr(href)'
).get()
next_page_url = urljoin(hotel_search_url, next_page_url) # turn url absolute
total_pages = int(math.ceil(total_results / page_size))
if max_pages and total_pages > max_pages:
log.debug(
f"{query}: only scraping {max_pages} max pages from {total_pages} total"
)
total_pages = max_pages
# scrape remaining pages
log.info(
f"{query}: found {total_results=}, {page_size=}. Scraping {total_pages} pagination pages"
)
other_page_urls = [
# note: "oa" stands for "offset anchors"
next_page_url.replace(f"oa{page_size}", f"oa{page_size * i}")
for i in range(1, total_pages)
]
# we use assert to ensure that we don't accidentally produce duplicates which means something went wrong
assert len(set(other_page_urls)) == len(other_page_urls)
to_scrape = [client.get(url) for url in other_page_urls]
for response in asyncio.as_completed(to_scrape):
results.extend(parse_search_page(await response))
return results
# example use:
if __name__ == "__main__":
async def run():
result = await scrape_search("Malta")
print(json.dumps(result, indent=2))
asyncio.run(run())
[
"id": "573828",
"url": "/Hotel_Review-g230152-d573828-Reviews-Radisson_Blu_Resort_Spa_Malta_Golden_Sands-Mellieha_Island_of_Malta.html",
"name": "Radisson Blu Resort & Spa, Malta Golden Sands"
},
...
]
Here, we create our scrape_search()
function that takes in a query and finds the correct search page. Then we scrape the whole search page which contains multiple paginated pages.
With preview results in hand, we can scrape information, pricing and review data of each TripAdvisor hotel listing - let's take a look at how to do that.
To scrape hotel information we'll have to collect each hotel page we found using the search.
Before we start scraping though, let's take a look at the individual hotel page to see where is the data located in the hotel page itself.
For example, let's see this 1926 Hotel & Spa hotel. If we take a look at the page source of this page in our browser we can see GraphQl cache state which contains a colossal amount of data:
Since TripAdvisor is a highly dynamic website it stores its data both in the visible part of the page (HTML of the page) and in the hidden part of the page (javascript page state). The latter often contains much more data than displayed on the visible part of the page, it's also often easier to parse - perfect for our scraper!
For more on hidden web data scraping see our full introduction article which explains what is hidden web data and the many methods of scraping it.
We can easily pull all of this hidden data by extracting the hidden JSON web data and parsing it in Python:
import re
def extract_page_manifest(html):
"""extract javascript state data hidden in TripAdvisor HTML pages"""
data = re.findall(r"pageManifest:({.+?})};", html, re.DOTALL)[0]
return json.loads(data)
By using a simple regular expression pattern we can extract page manifest data from any TripAdvisor page. Let's put this function to use in our hotel data scraper:
import asyncio
import json
import httpx
import re
def extract_page_manifest(html):
"""extract javascript state data hidden in TripAdvisor HTML pages"""
data = re.findall(r"pageManifest:({.+?})};", html, re.DOTALL)[0]
return json.loads(data)
def extract_named_urql_cache(urql_cache: dict, pattern: str):
"""extract named urql response cache from hidden javascript state data"""
data = json.loads(next(v["data"] for k, v in urql_cache.items() if pattern in v["data"]))
return data
async def scrape_hotel(url, session):
"""Scrape TripAdvisor's hotel information"""
first_page = await session.get(url)
page_data = extract_page_manifest(first_page.text)
hotel_cache = extract_named_urql_cache(page_data["urqlCache"]["results"], '"locationDescription"')
hotel_info = hotel_cache["locations"][0]
return hotel_info
If we run our scraper now, we can see hotel information results similar to:
async def run():
limits = httpx.Limits(max_connections=5)
BASE_HEADERS = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
"accept-encoding": "gzip, deflate, br",
}
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
result = await scrape_hotel(
"https://www.tripadvisor.com/Hotel_Review-g190327-d264936-Reviews-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
session,
)
print(json.dumps(result, indent=2))
if __name__ == "__main__":
asyncio.run(run())
Which results in a dataset similar to:
{
"locationId": 264936,
"name": "1926 Hotel & Spa",
"accommodationType": "T_HOTEL",
"parent": {
"locationId": 190320
},
"parents": [
{
"name": "Sliema",
"hotelsUrl": "/Hotels-g190327-Sliema_Island_of_Malta-Hotels.html"
},
"..."
],
"locationDescription": "Inspired by the life and passions of one man and featuring a touch of the roaring twenties, 1926 Hotel & Spa offers luxury rooms and suites in the central city of Sliema. The hotel is located 200 meters from the seafront and also offers a splendid Beach Club on the water’s edge as well as a luxury SPA. Beach club is located 200 meters away from the hotel and is a seasonal operation. Our concept of ‘Lean Luxury’ includes the following: • Luxury rooms at affordable prices • Uncomplicated comfort and a great sleep • Smart design technology • Raindance showerheads • Flat screens • SuitePad Tablets • Self check in and check out (if desired) • Coffee & tea making facilities",
"businessAdvantageData": {
"specialOffer": null,
"contactLinks": [
{
"contactLinkType": "PHONE",
"linkUrl": null
},
"..."
]
},
"writeUserReviewUrl": "/UserReview-g190327-d264936-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
"reviewSummary": {
"rating": 4.5,
"count": 955
},
"accommodationCategory": "HOTEL",
"popIndexDetails": {
"popIndexRank": 5,
"popIndexTotal": 29
},
"detail": {
"hotelAmenities": {
"highlightedAmenities": {
"roomFeatures": [
{
"tagId": 18898,
"amenityNameLocalized": "Bathrobes",
"amenityCategoryName": "Comfort",
"amenityIcon": "hotels"
},
"..."
],
"roomTypes": [
{
"tagId": 9184,
"amenityNameLocalized": "Non-smoking rooms",
"amenityCategoryName": "RoomTypes",
"amenityIcon": "hotels"
},
"..."
],
"propertyAmenities": [
{
"tagId": 18970,
"amenityNameLocalized": "Free public parking nearby",
"amenityCategoryName": "Parking",
"amenityIcon": "parking"
},
"..."
]
},
"nonHighlightedAmenities": {
"roomFeatures": [
{
"tagId": 19104,
"amenityNameLocalized": "Telephone",
"amenityCategoryName": "RoomAmenities",
"amenityIcon": "hotels"
},
"..."
],
"roomTypes": [],
"propertyAmenities": [
{
"tagId": 19052,
"amenityNameLocalized": "Paid private parking nearby",
"amenityCategoryName": "Parking",
"amenityIcon": "parking"
},
"..."
]
},
"languagesSpoken": [
{
"tagId": 18950,
"amenityNameLocalized": "English"
},
"..."
]
},
"userPartialFilterMatch": {
"locations": []
},
"starRating": [],
"styleRankings": [
{
"tagId": 6216,
"tagName": "Family",
"geoRanking": 1,
"translatedName": "Family",
"score": 0.8039135983441473
},
"..."
],
"hotel": {
"reviewSubratingAvgs": [
{
"avgRating": 4.368532,
"questionId": 10
},
"..."
],
"greenLeader": null
}
},
"heroPhoto": {
"id": 599471239
}
}
In this section, we scraped the hotel's information just by extracting javascript state data and parsing it in Python. We can further use this technique to retrieve the hotel's pricing data - let's see how to do that.
For pricing information, it seems that we need to supply check-in and check-out dates. However, an easier approach is to explore the pricing calendar which contains pricing data of several months:
For pricing calendar information, let's explore our javascript state cache further. An easy way to inspect this is to search one of the dates present in the calendar in our cache data (e.g. just ctrl+f "2022-06-20"):
This means we can use the same technique we used to parse hotel information to extract hotel pricing data:
import asyncio
import json
import httpx
from snippet3 import extract_named_urql_cache, extract_page_manifest
async def scrape_hotel(url: str, session: httpx.AsyncClient):
"""Scrape hotel data: information and pricing"""
first_page = await session.get(url)
page_data = extract_page_manifest(first_page.text)
# price data keys are dynamic first we need to find the full key name
_pricing_key = next(
key for key in page_data["redux"]["api"]["responses"]
if "/hotelDetail" in key and "/heatMap" in key
)
pricing_details = page_data["redux"]["api"]["responses"][_pricing_key]["data"]["items"]
hotel_cache = extract_named_urql_cache(page_data["urqlCache"]["results"], '"locationDescription"')
hotel_info = hotel_cache["locations"][0]
hotel_data = {
"price": pricing_details,
"info": parse_hotel_info(hotel_info),
}
return hotel_data
If we run our scraper now, we can see several months of pricing data that looks something like this:
async def run():
limits = httpx.Limits(max_connections=5)
BASE_HEADERS = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
"accept-encoding": "gzip, deflate, br",
}
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
result = await scrape_hotel(
"https://www.tripadvisor.com/Hotel_Review-g190327-d264936-Reviews-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
session,
)
print(json.dumps(result['price'], indent=2))
if __name__ == "__main__":
asyncio.run(run())
[
{
"date": "2022-08-31",
"priceUSD": 13852,
"priceDisplay": "USD 138.52"
},
{
"date": "2022-08-30",
"priceUSD": 14472,
"priceDisplay": "USD 144.72"
},
...
]
With this piece of our scraper complete, we have scraping functionality for hotel information and hotel pricing - we're only missing hotel reviews. So, let's take a look at how can we scrape hotel review data.
Finally, to scrape TripAdvisor reviews in Python we'll continue with our javascript state cache scraping approach. However, since hotel reviews are scattered through multiple pages we'll have to make a few additional requests. Our scraping flow will look something like this:
Let's update our scrape_hotel()
function with review scraping logic:
import asyncio
import json
import math
from typing import TypedDict
import httpx
from snippet3 import extract_named_urql_cache, extract_page_manifest
class Review(TypedDict):
"""storage type hint for review data"""
id: str
date: str
rating: str
title: str
text: str
votes: int
url: str
language: str
platform: str
author_id: str
author_name: str
author_username: str
def parse_reviews(html) -> Review:
"""Parse reviews from a review page"""
page_data = extract_page_manifest(html)
review_cache = extract_named_urql_cache(page_data["urqlCache"]["results"], '"reviewListPage"')
parsed = []
# review data contains loads of information, let's parse only the basic in this tutorial
for review in review_cache["locations"][0]["reviewListPage"]["reviews"]:
parsed.append(
{
"id": review["id"],
"date": review["publishedDate"],
"rating": review["rating"],
"title": review["title"],
"text": review["text"],
"votes": review["helpfulVotes"],
"url": review["route"]["url"],
"language": review["language"],
"platform": review["publishPlatform"],
"author_id": review["userProfile"]["id"],
"author_name": review["userProfile"]["displayName"],
"author_username": review["userProfile"]["username"],
}
)
return parsed
async def scrape_hotel(url, session):
"""Scrape all hotel data: information, pricing and reviews"""
first_page = await session.get(url)
page_data = extract_page_manifest(first_page.text)
_pricing_key = next(
(key for key in page_data["redux"]["api"]["responses"] if "/hotelDetail" in key and "/heatMap" in key)
)
pricing_details = page_data["redux"]["api"]["responses"][_pricing_key]["data"]["items"]
hotel_cache = extract_named_urql_cache(page_data["urqlCache"]["result"], '"locationDescription"')
hotel_info = hotel_cache["locations"][0]
# ------- NEW CODE ----------------
# for reviews we first need to scrape multiple pages
# so, first let's find total amount of pages
total_reviews = hotel_info["reviewSummary"]["count"]
_review_page_size = 10
total_review_pages = int(math.ceil(total_reviews / _review_page_size))
# then we can scrape all review pages concurrently
# note: in review url "or" stands for "offset reviews"
review_urls = [
url.replace("-Reviews-", f"-Reviews-or{_review_page_size * i}-") for i in range(1, total_review_pages)
]
assert len(set(review_urls)) == len(review_urls)
review_responses = await asyncio.gather(*[session.get(url) for url in review_urls])
reviews = []
for response in [first_page, *review_responses]:
reviews.extend(parse_reviews(response.text))
# ---------------------------------
hotel_data = {
"price": pricing_details,
"info": parse_hotel_info(hotel_info),
"reviews": reviews,
}
return hotel_data
Above, we're using the same technique we used to scrape hotel information. We extract the initial review data from the javascript state, and then we iterate through all pages to gather the remaining reviews in the same way.
One thing to note here is that we're using a common paging scraping idiom: we retrieve the first page to get total amount of results and then collect remaining pages concurrently.
Using this approach allows us to scrape many pagination pages concurrently which gives us a huge speed boost!
async def run():
limits = httpx.Limits(max_connections=5)
BASE_HEADERS = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
"accept-encoding": "gzip, deflate, br",
}
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
result = await scrape_hotel(
"https://www.tripadvisor.com/Hotel_Review-g190327-d264936-Reviews-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
session,
)
rint(json.dumps(result['reviews'], indent=2))
if __name__ == "__main__":
asyncio.run(run())
Which should result in review dataset similar to:
[
{
"id": 843669952,
"date": "2022-06-20",
"rating": 5,
"title": "A birthday to remember",
"text": "Memorable visit for a special birthday. Room was just perfect. Staff were lovely and on the whole very helpful. Used the beach club and loved it. Lovely hotel to spend some time with friends and so handy for sight seeing and local bars and restaurants.",
"votes": 0,
"url": "/ShowUserReviews-g190327-d264936-r843669952-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
"language": "en",
"platform": "OTHER",
"author_id": "removed from blog for privacy purposes",
"author_name": "removed from blog for privacy purposes",
"author_username": "removed from blog for privacy purposes"
},
{
"id": 843644452,
"date": "2022-06-19",
"rating": 5,
"title": "t mini bre",
"text": "We stayed here for a friends wedding and it was lovely staff were great. Breakfast had a good range of food and drink. Couldn\u2019t fault the hotel had everything you needed. Beach club was really good and served lovely food. ",
"votes": 0,
"url": "/ShowUserReviews-g190327-d264936-r843644452-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
"language": "en",
"platform": "OTHER",
"author_id": "removed from blog for privacy purposes",
"author_name": "removed from blog for privacy purposes",
"author_username": "removed from blog for privacy purposes"
}
...
]
With this final feature, we have our full TripAdvisor scraper ready to scrape hotel information, pricing data and reviews. We can easily apply the same scraping logic to scrape other TripAdvisor details like activities and restaurant data since the underlying web technology is the same.
However, to successfully scrape TripAdvisor at scale we need to fortify our scraper to avoid blocking and captchas. For that, let's take a look at ScrapFly web scraping API service which can easily allow us to achieve this by adding a few minor modifications to our scraper code.
Scraping TripAdvisor.com data doesn't seem to be too difficult though unfortunately when scraping at scale we'll likely be blocked or requested to solve captchas, which will hinder or completely disable our web scraping process.
To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!
ScrapFly offers several powerful features that'll help us to get around TripAdvisor's blocking:
For this we'll be using scrapfly-sdk python package. First, let's install scrapfly-sdk
using pip:
$ pip install scrapfly-sdk
To take advantage of ScrapFly's API in our TripAdvisor web scraper all we need to do is modify our httpx
session code with scrapfly-sdk
client requests.
To wrap this guide up let's take a look at some frequently asked questions about web scraping tripadvisor.com:
Yes. TripAdvisor's data is publicly available, and we're not extracting anything personal or private. Scraping tripadvisor.com at slow, respectful rates would fall under the ethical scraping definition. That being said, for scraping reviews we should avoid collecting personal information such as users' names in GDRP-compliant countries (like the EU). For more, see our Is Web Scraping Legal? article.
Unfortunately, TripAdvisor's API is difficult to use and very limited. For example, it provides only 3 reviews per location. By scraping public TripAdvisor pages we can collect all of the reviews and hotel details, which we couldn't get through TripAdvisor API otherwise.
In this tutorial, we've taken a look at scraping Tripadvisor.com for hotel overview, review and pricing data. We've also takne a look how to discover hotel listings using Tripadvisor's search.
For all of this we used Python with popular community packages like httpx
and parsel
. To scrape tripadvisor we used the classic HTML parsing as well as modern hidden web data scraping techniques.
Finally, to avoid being blocked and to scale up our scraper we've taken a look at Scrapfly web scraping API through Scrapfly-SDK package.