How to Scrape TripAdvisor.com

article feature image

In this web scraping guide we'll be scraping TripAdvisor.com - one of the biggest service portals in the hospitality industry, which contains hotel, activity and restaurant data.

In this tutorial, we'll focus on scraping hotel information, reviews and pricing data, though everything we'll learn can be applied to other TripAdvisor subjects such as restaurants or activities.

Web Scraping With Python Tutorial

TripAdvisor is a tough target to scrape - if you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.

Web Scraping With Python Tutorial

Why Scrape TripAdvisor?

TripAdvisor is one of the biggest sources of hospitality industry data. It contains hotel, tour and restaurant information, pricing data and reviews. All of which, have great value in business intelligence like market and competitive analysis. In other words, data available on TripAdvisor can give us a glimpse into the hospitality industry which can be used to generate leads or important market decisions.

For more on scraping use cases see our extensive write-up Scraping Use Cases

Setup

In this tutorial we'll be using Python with two or three community packages:

  • httpx - HTTP client library which will let us communicate with TripAdvisor.com's servers
  • parsel - HTML parsing library which will help us to parse our web scraped HTML files.
  • loguru[optional] - pretty logging library that'll help us to keep track of what's going on.

These packages can be easily installed via pip command:

$ pip install httpx parsel loguru

Alternatively, you're free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package.

Finding Hotels

Let's start off our scraper by taking a look how can we find hotels on TripAdvisor. For this, let's take a look how TripAdvisor's search function works:

0:00
/

In the short video above, we can see that a GraphQl powered POST request is being made in the background when we type in our search query. This request returns search page recommendations each of which contains preview data hotels, restaurants or tours.

Let's replicate this request in our Python-based scraper. For this we'll establish an HTTP connection session and submit a POST type request that mimics what we've observed above:

from loguru import logger as log
import httpx

async def search_location(query: str, session: httpx.AsyncSession):
    """
    search for location data from given query. 
    e.g. "New York" will return us TripAdvisor's location details for this query
    """
    log.info(f"searching: {query}")
    url = "https://www.tripadvisor.com/data/graphql/ids"
    # For this request, we'll be sending JSON graphql type query
    # we indicate query ID and query variables
    payload = json.dumps(
        [
            {
                # Every graphql query has a query ID, in this case it's:
                "query": "c9d791589f937ec371723f236edc7c6b",
                "variables": {
                    "request": {
                        "query": query,
                        "limit": 10,
                        "scope": "WORLDWIDE",
                        "locale": "en-US",
                        "scopeGeoId": 1,
                        "searchCenter": None,
                        # we can define search result types, in this case we want to search locations
                        "types": [
                            "LOCATION",
                            # other options are:
                            #   "QUERY_SUGGESTION",
                            #   "USER_PROFILE",
                            #   "RESCUE_RESULT"
                        ],
                        # we can further narrow down locations to specific items here like
                        # attractions (tours), accomodations (hotels) etc.
                        "locationTypes": [
                            "GEO",
                            "AIRPORT",
                            "ACCOMMODATION",
                            "ATTRACTION",
                            "ATTRACTION_PRODUCT",
                            "EATERY",
                            "NEIGHBORHOOD",
                            "AIRLINE",
                            "SHOPPING",
                            "UNIVERSITY",
                            "GENERAL_HOSPITAL",
                            "PORT",
                            "FERRY",
                            "CORPORATION",
                            "VACATION_RENTAL",
                            "SHIP",
                            "CRUISE_LINE",
                            "CAR_RENTAL_OFFICE",
                        ],
                        "userId": None,
                        "articleCategories": ["default", "love_your_local", "insurance_lander"],
                        "enabledFeatures": ["typeahead-q"],
                    }
                },
            }
        ]
    )

    headers = {
        **session.headers,
        # we need to generate a random request ID for this request to succeed
        "x-requested-by": "".join(random.choice(string.ascii_lowercase + string.digits) for i in range(64)),
    }
    response = await session.post(url, headers=headers, data=payload)
    data = response.json()
    # return first result
    log.info(f'found {len(data[0]["data"]["Typeahead_autocomplete"]["results"])} results, taking first one')
    return data[0]["data"]["Typeahead_autocomplete"]["results"][0]["details"]

This graphql request might appear complicated, but we're mostly using values we took from our browser only changing query string itself. The important thing to note here is that TripAdvisor requires all graphql requests to be signed - in our scraper we create a random 64 character signature we include with every request to comply with this API.

Web Scraping Graphql with Python

For more details on how to scrape graphql powered websites see our introduction tutorial which covers what is graphql, how to scrape it and common tools, tips and tricks.

Web Scraping Graphql with Python

Let's take our scraper for a spin and see what it finds for "Malta" keyword:

Run code & example output
import asyncio

# To avoid being instantly blocked we'll be using request headers that
# mimic Chrome browser on Windows
BASE_HEADERS = {
    "authority": "www.tripadvisor.com",
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result = await search_location("Malta", session=session)
        print(json.dumps(result, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
{
  "localizedName": "Malta",
  "localizedAdditionalNames": {
    "longOnlyHierarchy": "Europe"
  },
  "streetAddress": {
    "street1": null
  },
  "locationV2": {
    "placeType": "COUNTRY",
    "names": {
      "longOnlyHierarchyTypeaheadV2": "Europe"
    },
    "vacationRentalsRoute": {
      "url": "/VacationRentals-g190311-Reviews-Malta-Vacation_Rentals.html"
    }
  },
  "url": "/Tourism-g190311-Malta-Vacations.html",
  "HOTELS_URL": "/Hotels-g190311-Malta-Hotels.html",
  "ATTRACTIONS_URL": "/Attractions-g190311-Activities-Malta.html",
  "RESTAURANTS_URL": "/Restaurants-g190311-Malta.html",
  "placeType": "COUNTRY",
  "latitude": 35.892,
  "longitude": 14.42979,
  "isGeo": true,
  "thumbnail": {
    "photoSizeDynamic": {
      "maxWidth": 2880,
      "maxHeight": 1920,
      "urlTemplate": "https://dynamic-media-cdn.tripadvisor.com/media/photo-o/21/66/c5/99/caption.jpg?w={width}&h={height}&s=1&cx=1203&cy=677&chk=v1_cf397a9cdb4fbd9239a9"
    }
  }
}

We can see that we get URLs to Hotel, Restaurant and Attraction searches! We can use these URLs to scrape search results themselves.

We figured out how to use TripAdvisor's Search suggestions to find search pages, now let's scrape these pages for hotel preview data like links and names.
Let's take a look at how we can do that by extending our scraping code:

def parse_search_page(response):
    """parsed results from TripAdvisor search page"""
    sel = Selector(text=response.text)
    parsed = []
    # we go through each result box and extract id, url and name:
    for result_box in sel.css("div.listing_title>a"):
        parsed.append(
            {
                "id": result_box.xpath("@id").get("").split("_")[-1],
                "url": result_box.xpath("@href").get(""),
                "name": result_box.xpath("text()").get("").split(". ")[-1],
            }
        )
    return parsed


async def scrape_search(query:str, session:httpx.AsyncClient):
    """Scrape all search results of a search query"""
    # scrape first page
    log.info(f"{query}: scraping first search results page")
    hotel_search_url = 'https://www.tripadvisor.com/' + (await search_location(query, session))['HOTELS_URL']
    log.info(f"found hotel search url: {hotel_search_url}")
    first_page = await session.get(hotel_search_url)

    # extract paging meta information from the first page: How many pages there are?
    sel = Selector(text=first_page.text)
    total_results = int(sel.xpath("//div[@data-main-list-match-count]/@data-main-list-match-count").get())
    next_page_url = sel.css('a[data-page-number="2"]::attr(href)').get()
    page_size = int(sel.css('a[data-page-number="2"]::attr(data-offset)').get())
    total_pages = int(math.ceil(total_results / page_size))

    # scrape remaining pages concurrently
    log.info(f"{query}: found total {total_results} results, {page_size} results per page ({total_pages} pages)")
    other_page_urls = [
        # note "oa" stands for "offset anchors"
        urljoin(str(first_page.url), next_page_url.replace(f"oa{page_size}", f"oa{page_size * i}"))
        for i in range(1, total_pages)
    ]
    # we use assert to ensure that we don't accidentally produce duplicates which means something went wrong
    assert len(set(other_page_urls)) == len(other_page_urls)
    other_pages = await asyncio.gather(*[session.get(url) for url in other_page_urls])

    # parse all data and return listing preview results
    results = []
    for response in [first_page, *other_pages]:
        results.extend(parse_search_page(response))
    return results

Here, we create our scrape_search() function that takes in a query and finds the correct search page. Then we scrape the whole search page which contains multiple paginated pages. Let's give it a spin!

Run code & example output
async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result = await scrape_search("Malta", session)
        print(json.dumps(result, indent=2))
        return

if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "id": "573828",
    "url": "/Hotel_Review-g230152-d573828-Reviews-Radisson_Blu_Resort_Spa_Malta_Golden_Sands-Mellieha_Island_of_Malta.html",
    "name": "Radisson Blu Resort & Spa, Malta Golden Sands"
  },
  ...
]

With preview results in hand, we can scrape information, pricing and review data of each TripAdvisor hotel listing - let's take a look at how to do that.

Scraping Hotel Data

To scrape hotel information we'll have to collect each hotel page we found using the search. Before we start scraping though, let's take a look at the individual hotel page to see where is the data located in the hotel page itself.

For example, let's see this 1926 Hotel & Spa hotel. If we take a look at the page source of this page in our browser we can see GraphQl cache state which contains a colossal amount of data:

page source illustration - we can see data hidden in a javascript variable

We can see hotel data by exploring page source in our browser

Since TripAdvisor is a highly dynamic website it stores its data both in the visible part of the page (HTML of the page) and in the hidden part of the page (javascript page state). The latter often contains much more data than displayed on the visible part of the page, it's also often easier to parse - perfect for our scraper!

We can easily pull all of this hidden data by extracting the hidden JSON state object and parsing it in Python:

import re

def extract_page_manifest(html):
    """extract javascript state data hidden in TripAdvisor HTML pages"""
    data = re.findall(r"pageManifest:({.+?})};", html, re.DOTALL)[0]
    return json.loads(data)

By using a simple regular expression pattern we can extract page manifest data from any TripAdvisor page. Let's put this function to use in our hotel data scraper:

def parse_hotel_info(data):
    """parse hotel data from TripAdvisor javascript state to something more readable"""
    parsed = {}
    # there's a lot of information in hotel data, in this tutorial let's extract the basics:
    parsed["name"] = data["name"]
    parsed["id"] = data["locationId"]
    parsed["type"] = data["accommodationType"]
    parsed["description"] = data["locationDescription"]
    parsed["rating"] = data["reviewSummary"]["rating"]
    parsed["rating_count"] = data["reviewSummary"]["count"]
    # for hotel "features" lets just extract the names:
    parsed["features"] = []
    for amenity_type, values in data["detail"]["hotelAmenities"]["highlightedAmenities"].items():
        for value in values:
            parsed["features"].append(f"{amenity_type}_{value['amenityNameLocalized'].lower()}")

    parsed["stars"] = data["detail"]["starRating"][0]["tagNameLocalized"]
    return parsed

def extract_named_urql_cache(urql_cache: dict, pattern: str):
    """extract named urql response cache from hidden javascript state data"""
    data = json.loads(next(v["data"] for k, v in urql_cache.items() if pattern in v["data"]))
    return data


async def scrape_hotel(url, session):
    """Scrape TripAdvisor's hotel information"""
    first_page = await session.get(url)
    page_data = extract_page_manifest(first_page.text)
    hotel_cache = extract_named_urql_cache(page_data["urqlCache"], '"locationDescription"')
    hotel_info = hotel_cache["locations"][0]
    return parse_hotel_info(hotel_info)

If we run our scraper now, we can see hotel information results similar to:

Run code & example output
async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result = await scrape_hotel(
            "https://www.tripadvisor.com/Hotel_Review-g190327-d264936-Reviews-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
            session,
        )
        print(json.dumps(result, indent=2))


if __name__ == "__main__":
    asyncio.run(run())
{
  "name": "1926 Hotel & Spa",
  "id": 264936,
  "type": "T_HOTEL",
  "description": "Inspired by the life and passions of one man and featuring a touch of the roaring twenties, 1926 Hotel & Spa offers luxury rooms and suites in the central city of Sliema. The hotel is located 200 meters from the seafront and also offers a splendid Beach Club on the water\u2019s edge as well as a luxury SPA. Beach club is located 200 meters away from the hotel and is a seasonal operation. Our concept of \u2018Lean Luxury\u2019 includes the following: \u2022 Luxury rooms at affordable prices \u2022 Uncomplicated comfort and a great sleep \u2022 Smart design technology \u2022 Raindance showerheads \u2022 Flat screens \u2022 SuitePad Tablets \u2022 Self check in and check out (if desired) \u2022 Coffee & tea making facilities",
  "rating": 4.5,
  "rating_count": 674,
  "features": [
    "roomFeatures_bathrobes",
    "roomFeatures_air conditioning",
    "roomFeatures_desk",
    "roomFeatures_housekeeping",
    "roomFeatures_interconnected rooms available",
    "roomFeatures_refrigerator",
    "roomFeatures_cable / satellite tv",
    "roomFeatures_walk-in shower",
    "roomTypes_non-smoking rooms",
    "propertyAmenities_free public parking nearby",
    "propertyAmenities_free internet",
    "propertyAmenities_pool",
    "propertyAmenities_fitness center with gym / workout room",
    "propertyAmenities_bar / lounge",
    "propertyAmenities_airport transportation",
    "propertyAmenities_meeting rooms",
    "propertyAmenities_spa"
  ],
  "stars": "3 Star"
}

In this section, we scraped the hotel's information just by extracting javascript state data and parsing it in Python. We can further use this technique to retrieve the hotel's pricing data - let's see how to do that.

Scraping Hotel Price data

For pricing information, it seems that we need to supply check-in and check-out dates. However, an easier approach is to explore the pricing calendar which contains pricing data of several months:

hotel pricing calendar of TripAdvisor.com

we can see some basic pricing data here - is there something more?

For pricing calendar information, let's explore our javascript state cache further. An easy way to inspect this is to search one of the dates present in the calendar in our cache data (e.g. just ctrl+f "2022-06-20"):

page source illustration - we can see pricing data hidden in a javascript variable

We can see pricing calendar data present in the page state variable

This means we can use the same technique we used to parse hotel information to extract hotel pricing data:

async def scrape_hotel(url: str, session: httpx.AsyncClient):
    """Scrape hotel data: information and pricing"""
    first_page = await session.get(url)
    page_data = extract_page_manifest(first_page.text)

    # price data keys are dynamic first we need to find the full key name
    _pricing_key = next(
        (key for key in page_data["redux"]["api"]["responses"] if "/hotelDetail" in key and "/heatMap" in key)
    )
    pricing_details = page_data["redux"]["api"]["responses"][_pricing_key]["data"]["items"]

    hotel_cache = extract_named_urql_cache(page_data["urqlCache"], '"locationDescription"')
    hotel_info = hotel_cache["locations"][0]

    hotel_data = {
        "price": pricing_details,
        "info": parse_hotel_info(hotel_info),
    }
    return hotel_data

If we run our scraper now, we can see several months of pricing data that looks something like this:

Run code & example output
async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result = await scrape_hotel(
            "https://www.tripadvisor.com/Hotel_Review-g190327-d264936-Reviews-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
            session,
        )
        print(json.dumps(result['price'], indent=2))


if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "date": "2022-08-31",
    "priceUSD": 13852,
    "priceDisplay": "USD 138.52"
  },
  {
    "date": "2022-08-30",
    "priceUSD": 14472,
    "priceDisplay": "USD 144.72"
  },
  ...
]

With this piece of our scraper complete, we have scraping functionality for hotel information and hotel pricing - we're only missing hotel reviews. So, let's take a look at how can we scrape hotel review data.

Scraping Hotel Reviews

Finally, to scrape hotel reviews we'll continue with our javascript state cache scraping approach. However, since hotel reviews are scattered through multiple pages we'll have to make a few additional requests.

First, we'll extract review data present on the first page, then we'll extract the number of reviews available in total and scrape other review pages concurrently. Let's update our scrape_hotel() function with review scraping logic:

async def scrape_hotel(url, session):
    """Scrape all hotel data: information, pricing and reviews"""
    first_page = await session.get(url)
    page_data = extract_page_manifest(first_page.text)

    # price data keys are dynamic first we need to find the full key name
    _pricing_key = next(
        (key for key in page_data["redux"]["api"]["responses"] if "/hotelDetail" in key and "/heatMap" in key)
    )
    pricing_details = page_data["redux"]["api"]["responses"][_pricing_key]["data"]["items"]

    # We can extract data from Graphql cache embeded in the page
    # TripAdvisor is using: https://github.com/FormidableLabs/urql as their graphql client
    hotel_cache = extract_named_urql_cache(page_data["urqlCache"], '"locationDescription"')
    hotel_info = hotel_cache["locations"][0]

    # for reviews we first need to scrape multiple pages
    # so, first let's find total amount of pages
    total_reviews = hotel_info["reviewSummary"]["count"]
    _review_page_size = 10
    total_review_pages = int(math.ceil(total_reviews / _review_page_size))
    # then we can scrape all review pages concurrently
    # note: in review url "or" stands for "offset reviews"
    review_urls = [
        url.replace("-Reviews-", f"-Reviews-or{_review_page_size * i}-") for i in range(1, total_review_pages)
    ]
    assert len(set(review_urls)) == len(review_urls)
    review_responses = await asyncio.gather(*[session.get(url) for url in review_urls])
    reviews = []
    for response in [first_page, *review_responses]:
        reviews.extend(parse_reviews(response.text))
    hotel_data = {
        "price": pricing_details,
        "info": parse_hotel_info(hotel_info),
        "reviews": reviews,
    }
    return hotel_data

Above, we're using the same technique we used to scrape hotel information. We extract the initial review data from the javascript state, and then we iterate through all pages to gather the remaining reviews in the same way.

One thing to note here is that we're using a common paging scraping idiom: we retrieve the first page to get total amount of results and then collect remaining pages concurrently.

efficient pagination scraping illustration

Using this approach allows us to scrape many pagination pages concurrently which gives us a huge speed boost!

Run code & example output
async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result = await scrape_hotel(
            "https://www.tripadvisor.com/Hotel_Review-g190327-d264936-Reviews-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
            session,
        )
        print(json.dumps(result['reviews'], indent=2))


if __name__ == "__main__":
    asyncio.run(run())
[
    {
      "id": 843669952,
      "date": "2022-06-20",
      "rating": 5,
      "title": "A birthday to remember",
      "text": "Memorable visit for a special birthday. Room was just perfect. Staff were lovely and on the whole very helpful. Used the beach club and loved it. Lovely hotel to spend some time with friends and so handy for sight seeing and local bars and restaurants.",
      "votes": 0,
      "url": "/ShowUserReviews-g190327-d264936-r843669952-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
      "language": "en",
      "platform": "OTHER",
      "author_id": "removed from blog for privacy purposes",
      "author_name": "removed from blog for privacy purposes",
      "author_username": "removed from blog for privacy purposes"
    },
    {
      "id": 843644452,
      "date": "2022-06-19",
      "rating": 5,
      "title": "Perfect mini break",
      "text": "We stayed here for a friends wedding and it was lovely staff were great. Breakfast had a good range of food and drink. Couldn\u2019t fault the hotel had everything you needed. Beach club was really good and served lovely food. ",
      "votes": 0,
      "url": "/ShowUserReviews-g190327-d264936-r843644452-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
      "language": "en",
      "platform": "OTHER",
      "author_id": "removed from blog for privacy purposes",
      "author_name": "removed from blog for privacy purposes",
      "author_username": "removed from blog for privacy purposes"
    }
...
]

With this final feature, we have our full TripAdvisor scraper ready to scrape hotel information, pricing data and reviews. We can easily apply the same scraping logic to scrape other TripAdvisor details like activities and restaurant data since the underlying web technology is the same.

However, to successfully scrape TripAdvisor at scale we need to fortify our scraper to avoid blocking and captchas. For that, let's take a look at ScrapFly web scraping API service which can easily allow us to achieve this by adding a few minor modifications to our scraper code.

ScrapFly - Avoiding Blocking and Captchas

Scraping TripAdvisor.com data doesn't seem to be too difficult though unfortunately when scraping at scale we'll likely be blocked or requested to solve captchas, which will hinder or completely disable our web scraping process.

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!

illustration of scrapfly's middleware

ScrapFly offers several powerful features that'll help us to get around TripAdvisor's blocking:

For this we'll be using scrapfly-sdk python package. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our TripAdvisor web scraper all we need to do is modify our httpx session code with scrapfly-sdk client requests.

Full Scraper Code

Let's take a look at how our full scraper code would look with ScrapFly integration:

Full Scraper Code with ScrapFly integration
import asyncio
import json
import math
import random
import re
import string
from typing import List, TypedDict
from urllib.parse import urljoin

from loguru import logger as log
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient


async def search_location(query: str, session: ScrapflyClient):
    """
    search for location data from given query.
    e.g. "New York" will return us TripAdvisor's location details for this query
    """
    log.info(f"searching: {query}")
    url = "https://www.tripadvisor.com/data/graphql/ids"
    payload = json.dumps(
        [
            {
                # Every graphql query has a query ID, in this case it's:
                "query": "c9d791589f937ec371723f236edc7c6b",
                "variables": {
                    "request": {
                        "query": query,
                        "limit": 10,
                        "scope": "WORLDWIDE",
                        "locale": "en-US",
                        "scopeGeoId": 1,
                        "searchCenter": None,
                        # we can define search result types, in this case we want to search locations
                        "types": [
                            "LOCATION",
                            #   "QUERY_SUGGESTION",
                            #   "USER_PROFILE",
                            #   "RESCUE_RESULT"
                        ],
                        # we can further narrow down locations to
                        "locationTypes": [
                            "GEO",
                            "AIRPORT",
                            "ACCOMMODATION",
                            "ATTRACTION",
                            "ATTRACTION_PRODUCT",
                            "EATERY",
                            "NEIGHBORHOOD",
                            "AIRLINE",
                            "SHOPPING",
                            "UNIVERSITY",
                            "GENERAL_HOSPITAL",
                            "PORT",
                            "FERRY",
                            "CORPORATION",
                            "VACATION_RENTAL",
                            "SHIP",
                            "CRUISE_LINE",
                            "CAR_RENTAL_OFFICE",
                        ],
                        "userId": None,
                        "articleCategories": ["default", "love_your_local", "insurance_lander"],
                        "enabledFeatures": ["typeahead-q"],
                    }
                },
            }
        ]
    )

    headers = {
        # we need to generate a random request ID for this request to succeed
        "content-type": "application/json",
        "x-requested-by": "".join(random.choice(string.ascii_lowercase + string.digits) for i in range(64)),
    }
    result = await session.async_scrape(
        ScrapeConfig(
            url=url,
            country="US",
            headers=headers,
            body=payload,
            method="POST",
            asp=True,
        )
    )
    data = json.loads(result.content)
    # return first result
    log.info(f'found {len(data[0]["data"]["Typeahead_autocomplete"]["results"])} results, taking first one')
    return data[0]["data"]["Typeahead_autocomplete"]["results"][0]["details"]


class Preview(TypedDict):
    id: str
    url: str
    name: str


def parse_search_page(result: ScrapeApiResponse) -> List[Preview]:
    """parsed results from TripAdvisor search page"""
    parsed = []
    # we go through each result box and extract id, url and name:
    for result_box in result.selector.css("div.listing_title>a"):
        parsed.append(
            {
                "id": result_box.xpath("@id").get("").split("_")[-1],
                "url": result_box.xpath("@href").get(""),
                "name": result_box.xpath("text()").get("").split(". ")[-1],
            }
        )
    return parsed


async def scrape_search(query: str, session: ScrapflyClient) -> List[Preview]:
    """Scrape all search results of a search query"""
    # scrape first page
    log.info(f"{query}: scraping first search results page")
    hotel_search_url = "https://www.tripadvisor.com/" + (await search_location(query, session))["HOTELS_URL"]
    log.info(f"found hotel search url: {hotel_search_url}")
    first_page = await session.async_scrape(ScrapeConfig(url=hotel_search_url))

    # extract paging meta information from the first page: How many pages there are?
    total_results = int(
        first_page.selector.xpath("//div[@data-main-list-match-count]/@data-main-list-match-count").get()
    )
    next_page_url = first_page.selector.css('a[data-page-number="2"]::attr(href)').get()
    page_size = int(first_page.selector.css('a[data-page-number="2"]::attr(data-offset)').get())
    total_pages = int(math.ceil(total_results / page_size))

    # scrape remaining pages
    log.info(f"{query}: found total {total_results} results, {page_size} results per page ({total_pages} pages)")
    other_page_urls = [
        # note "oa" stands for "offset anchors"
        urljoin(first_page.context["url"], next_page_url.replace(f"oa{page_size}", f"oa{page_size * i}"))
        for i in range(2, total_pages + 1)
    ]
    # we use assert to ensure that we don't accidentally produce duplicates which means something went wrong
    assert len(set(other_page_urls)) == len(other_page_urls)

    results = parse_search_page(first_page)
    async for result in session.concurrent_scrape([ScrapeConfig(url=url, country="US") for url in other_page_urls]):
        results.extend(parse_search_page(result))
    return results


def extract_page_manifest(html):
    """extract javascript state data hidden in TripAdvisor HTML pages"""
    data = re.findall(r"pageManifest:({.+?})};", html, re.DOTALL)[0]
    return json.loads(data)


def extract_named_urql_cache(urql_cache: dict, pattern: str):
    """extract named urql response cache from hidden javascript state data"""
    data = json.loads(next(v["data"] for k, v in urql_cache.items() if pattern in v["data"]))
    return data


class Review(TypedDict):
    id: str
    date: str
    rating: str
    title: str
    text: str
    votes: int
    url: str
    language: str
    platform: str
    author_id: str
    author_name: str
    author_username: str


def parse_reviews(result: ScrapeApiResponse) -> Review:
    """Parse reviews from a review page"""
    page_data = extract_page_manifest(result.content)
    review_cache = extract_named_urql_cache(page_data["urqlCache"], '"reviewListPage"')
    parsed = []
    # review data contains loads of information, let's parse only the basic in this tutorial
    for review in review_cache["locations"][0]["reviewListPage"]["reviews"]:
        parsed.append(
            {
                "id": review["id"],
                "date": review["publishedDate"],
                "rating": review["rating"],
                "title": review["title"],
                "text": review["text"],
                "votes": review["helpfulVotes"],
                "url": review["route"]["url"],
                "language": review["language"],
                "platform": review["publishPlatform"],
                "author_id": review["userProfile"]["id"],
                "author_name": review["userProfile"]["displayName"],
                "author_username": review["userProfile"]["username"],
            }
        )
    return parsed


class Hotel(TypedDict):
    name: str
    id: str
    type: str
    description: str
    rating: float
    rating_count: int
    features: List[str]
    stars: int


def parse_hotel_info(data: dict) -> Hotel:
    """parse hotel data from TripAdvisor javascript state to something more readable"""
    parsed = {}
    # there's a lot of information in hotel data, in this tutorial let's extract the basics:
    parsed["name"] = data["name"]
    parsed["id"] = data["locationId"]
    parsed["type"] = data["accommodationType"]
    parsed["description"] = data["locationDescription"]
    parsed["rating"] = data["reviewSummary"]["rating"]
    parsed["rating_count"] = data["reviewSummary"]["count"]
    # for hotel "features" lets just extract the names:
    parsed["features"] = []
    for amenity_type, values in data["detail"]["hotelAmenities"]["highlightedAmenities"].items():
        for value in values:
            parsed["features"].append(f"{amenity_type}_{value['amenityNameLocalized'].lower()}")

    if star_rating := data["detail"]["starRating"]:
        parsed["stars"] = star_rating[0]["tagNameLocalized"]
    return parsed


class HotelAllData(TypedDict):
    info: Hotel
    reviews: List[Review]
    price: dict


async def scrape_hotel(url: str, session: ScrapflyClient) -> HotelAllData:
    """Scrape all hotel data: information, pricing and reviews"""
    first_page = await session.async_scrape(ScrapeConfig(url=url, country="US"))
    page_data = extract_page_manifest(first_page.content)

    # price data keys are dynamic first we need to find the full key name
    _pricing_key = next(
        (key for key in page_data["redux"]["api"]["responses"] if "/hotelDetail" in key and "/heatMap" in key)
    )
    pricing_details = page_data["redux"]["api"]["responses"][_pricing_key]["data"]["items"]

    # We can extract data from Graphql cache embeded in the page
    # TripAdvisor is using: https://github.com/FormidableLabs/urql as their graphql client
    hotel_cache = extract_named_urql_cache(page_data["urqlCache"], '"locationDescription"')
    hotel_info = hotel_cache["locations"][0]

    # for reviews we first need to scrape multiple pages
    # so, first let's find total amount of pages
    total_reviews = hotel_info["reviewSummary"]["count"]
    _review_page_size = 10
    total_review_pages = int(math.ceil(total_reviews / _review_page_size))
    # then we can scrape all review pages concurrently
    # note: in review url "or" stands for "offset reviews"
    review_urls = [
        url.replace("-Reviews-", f"-Reviews-or{_review_page_size * i}-") for i in range(2, total_review_pages + 1)
    ]
    assert len(set(review_urls)) == len(review_urls)
    reviews = parse_reviews(first_page)
    async for result in session.concurrent_scrape([ScrapeConfig(url=url, country="US") for url in review_urls]):
        reviews.extend(parse_reviews(result))

    return {
        "price": pricing_details,
        "info": parse_hotel_info(hotel_info),
        "reviews": reviews,
    }


async def run():
    with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=20) as session:
        result_location = await search_location("Malta", session=session)
        result_search = await scrape_search("Malta", session)
        result_hotel = await scrape_hotel(
            "https://www.tripadvisor.com/Hotel_Review-g190327-d264936-Reviews-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
            session,
        )


if __name__ == "__main__":
    asyncio.run(run())

By dropping in ScrapflyClient instead of httpx.AsyncClient we defer all the connection masking logic to ScrapFly which makes our web scraper code lighter, easier and more efficient. Though, most importantly it allows us to scrape TripAdvisor without being blocked!

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping tripadvisor.com:

Summary

In this tutorial, we built a TripAdvisor.com scraper which is capable of using search to discover hotels and then scraping hotel data, pricing information and user reviews.

For this we've used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Related post

How to Scrape YellowPages.com

Tutorial on how to scrape yellowpages.com business and review data using Python. How to avoid blocking to scrape data at scale and other tips.

How to Scrape Amazon.com

Tutorial on how to scrape Amazon.com's product and review data using Python and how to avoid blocking to scrape this information at scale.

How to Scrape Zillow.com

Tutorial on how to scrape Zillow.com sale and rent property data, using Python and how to avoid blocking to scrape at scale.

How to Scrape Aliexpress.com

Tutorial on how to scrape Aliexpress.com product, review and pricing data using Python. How to avoid blocking to scrape at scale and other tips.