How to Scrape Booking.com

article feature image

Booking.com is the biggest travel reservation service out there, and it contains public data on thousands of hotels, resorts, airbnbs and so on.

In this practical how to scrape booking.com tutorial we'll take a look at how can we scrape all of this data using Python programming language.
We'll start off by taking a look at how booking.com's website functions, then we'll replicate its behavior in our python scraper to scrape and parse hotel information and price data. Finally, we'll wrap everything up by taking a look at some tips and tricks and frequently encountered challenges when web scraping booking.com. So, let's dive in!

Setup

In this tutorial we'll be using Python with two packages:

  • httpx - HTTP client library which will let us communicate with Booking.com's servers
  • parsel - HTML parsing library which will help us to parse our web scraped HTML files for hotel data.

Both of these packages can be easily installed via pip command:

$ pip install httpx parsel

Alternatively, you're free to swap httpx out with any other HTTP client library such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package.

Finding Hotels

Our first step is to figure out how can we discover hotel pages, so we could start scraping their data. On Booking.com platform we have several ways to achieve that.

Using Sitemaps

Booking.com is easily accessible through its vast sitemap system. Sitemaps are heavily compressed XML files that contain all URLs available on the websites categorized by subject.

To find sitemaps, we first must visit the /robots.txt page and here we can find Sitemap links:

Sitemap: https://www.booking.com/sitembk-airport-index.xml
Sitemap: https://www.booking.com/sitembk-articles-index.xml
Sitemap: https://www.booking.com/sitembk-attractions-index.xml
Sitemap: https://www.booking.com/sitembk-beaches-index.xml
Sitemap: https://www.booking.com/sitembk-beach-holidays-index.xml
Sitemap: https://www.booking.com/sitembk-cars-index.xml
Sitemap: https://www.booking.com/sitembk-city-index.xml
Sitemap: https://www.booking.com/sitembk-country-index.xml
Sitemap: https://www.booking.com/sitembk-district-index.xml
Sitemap: https://www.booking.com/sitembk-hotel-index.xml
Sitemap: https://www.booking.com/sitembk-landmark-index.xml
Sitemap: https://www.booking.com/sitembk-region-index.xml
Sitemap: https://www.booking.com/sitembk-tourism-index.xml
Sitemap: https://www.booking.com/sitembk-themed-city-villas-index.xml
Sitemap: https://www.booking.com/sitembk-themed-country-golf-index.xml
Sitemap: https://www.booking.com/sitembk-themed-region-budget-index.xml

Here, we can see urls categorized in by cities, landmarks or even themes. For example, if we take a look at the https://www.booking.com/sitembk-hotel-index.xml we can see that it splits into another set of sitemaps (as a single sitemap is only allowed to contain 50 000 results):

<sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.9">
<sitemap>
<loc>https://www.booking.com/sitembk-hotel-zh-tw.0037.xml.gz</loc>
<lastmod>2022-05-17</lastmod>
</sitemap>
<sitemap>
<loc>https://www.booking.com/sitembk-hotel-zh-tw.0036.xml.gz</loc>
<lastmod>2022-05-17</lastmod>
</sitemap>
...

Here, we have 1710 sitemaps meaning there are 85 million links to various hotel pages. Of course, not all of them are unique hotel pages, but that's an easy way to discover hotel listings on booking.com.

Using sitemaps is an efficient and easy way to find hotels, but it's not a very flexible discovery method. If we want to scrape hotels available in a specific area or containing certain features we should scrape Booking.com's search system instead - so let's take a look at how can we do that.

Alternatively, we can take advantage of the search system running on booking.com just like a human user would.

illustration of booking.com search bar

booking.com's search system looking up hotels located in London

Booking's search might appear to be complex at first because of long URLs but if we dig a bit deeper we can see that it's rather simple as a lot of URL parameters are optional. For example, let's take a look at this query of "Hotels in London":

https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaN0BiAEBmAExuAEKyAEF2AED6AEB-AECiAIBqAIDuAK-u5eUBsACAdICJGRlN2VhYzYyLTJiYzItNDE0MS1iYmY4LWYwZjkxNTc0OGY4ONgCBOACAQ
&sid=51b2c8cd7b3c8377e83692903e6f19ca
&sb=1
&sb_lp=1
&src=index
&src_elem=sb
&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaN0BiAEBmAExuAEKyAEF2AED6AEB-AECiAIBqAIDuAK-u5eUBsACAdICJGRlN2VhYzYyLTJiYzItNDE0MS1iYmY4LWYwZjkxNTc0OGY4ONgCBOACAQ%26sid%3D51b2c8cd7b3c8377e83692903e6f19ca%26sb_price_type%3Dtotal%26%26
&ss=London%2C+Greater+London%2C+United+Kingdom
&is_ski_area=
&ssne=London
&ssne_untouched=London
&checkin_year=2022
&checkin_month=6
&checkin_monthday=9
&checkout_year=2022
&checkout_month=6
&checkout_monthday=11
&group_adults=2
&group_children=0
&no_rooms=1
&b_h4u_keep_filters=
&from_sf=1
&search_pageview_id=f25c2a9ee3630134
&ac_suggestion_list_length=5
&ac_suggestion_theme_list_length=0
&ac_position=0
&ac_langcode=en
&ac_click_type=b
&dest_id=-2601889
&dest_type=city
&iata=LON
&place_id_lat=51.507393
&place_id_lon=-0.127634
&search_pageview_id=f25c2a9ee3630134
&search_selected=true
&ss_raw=London

Lots of scary parameters, but we can actually distill it to a few mandatory ones in our Python web scraper. Let's write the first part of our scraper - function to retrieve a single search result page:

from urllib.parse import urlencode
from httpx import AsyncClient


async def search_page(
    query,
    session: AsyncClient,
    checkin: str = "",
    checkout: str = "",
    number_of_rooms=1,
    offset: int = 0,
):
    """scrapes a single hotel search page of booking.com"""
    checkin_year, checking_month, checking_day = checkin.split("-") if checkin else "", "", ""
    checkout_year, checkout_month, checkout_day = checkout.split("-") if checkout else "", "", ""

    url = "https://www.booking.com/searchresults.html"
    url += "?" + urlencode(
        {
            "ss": query,
            "checkin_year": checkin_year,
            "checkin_month": checking_month,
            "checkin_monthday": checking_day,
            "checkout_year": checkout_year,
            "checkout_month": checkout_month,
            "checkout_monthday": checkout_day,
            "no_rooms": number_of_rooms,
            "offset": offset,
        }
    )
    return await session.get(url, follow_redirects=True)

# Example use:
# first we need to immitate web browser headers to not get blocked instantly
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Connection": "keep-alive",
    "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}


async def run():
    async with AsyncClient(headers=HEADERS) as session:
        await search_page("London", session)

if __name__ == "__main__":
    asyncio.run(run())

Here, we've defined a function that requests a single search result page from a given search query and check-in data. We're also using some common scraping idioms:

  • We set our HTTP client HEADERS to that of a common web browser to avoid being instantly blocked. In this case, we're using Chrome on Windows OS.
    We use the follow_redirects keyword to automatically follow all 30X responses as our generated query parameters are missing some optional values.

Another key parameter here is offset which controls the search result paging. Providing offset tells that we want 25 results to start from the X point of the result set. So, let's use it to implement full paging and hotel preview data parsing:

def parse_search_total_results(html: str):
    """parse total number of results from search page HTML"""
    sel = Selector(text=html)
    # parse total amount of pages from heading1 text:
    # e.g. "London: 1,232 properties found"
    total_results = int(sel.css("h1").re("([\d,]+) properties found")[0].replace(",", ""))
    return total_results


def parse_search_page(html: str):
    """parse hotel preview data from search page HTML"""
    sel = Selector(text=html)

    hotel_previews = {}
    for hotel_box in sel.xpath('//div[@data-testid="property-card"]'):
        url = hotel_box.xpath('.//h3/a[@data-testid="title-link"]/@href').get("").split("?")[0]
        hotel_previews[url] = {
            "name": hotel_box.xpath('.//h3/a[@data-testid="title-link"]/div/text()').get(""),
            "location": hotel_box.xpath('.//span[@data-testid="address"]/text()').get(""),
            "score": hotel_box.xpath('.//div[@data-testid="review-score"]/div/text()').get(""),
            "review_count": hotel_box.xpath('.//div[@data-testid="review-score"]/div[2]/div[2]/text()').get(""),
            "stars": len(hotel_box.xpath('.//div[@data-testid="rating-stars"]/span').getall()),
            "image": hotel_box.xpath('.//img[@data-testid="image"]/@src').get(),
        }
    return hotel_previews


async def scrape_search(
    query,
    session: AsyncClient,
    checkin: str = "",
    checkout: str = "",
    number_of_rooms=1,
):
    """scrape all hotel previews from a given search query"""
    first_page = await search_page(
        query=query, session=session, checkin=checkin, checkout=checkout, number_of_rooms=number_of_rooms
    )
    total_results = parse_search_total_results(first_page.text)
    other_pages = await asyncio.gather(
        *[
            search_page(
                query=query,
                session=session,
                checkin=checkin,
                checkout=checkout,
                number_of_rooms=number_of_rooms,
                offset=offset,
            )
            for offset in range(25, total_results, 25)
        ]
    )
    hotel_previews = {}
    for response in [first_page, *other_pages]:
        hotel_previews.update(parse_search_page(response.text))
    return hotel_previews

There's quite a bit of code here so let's unpack it bit by bit:

First, we define our scrape_search() function which loops through our previously defined search_page() function to scrape all pages instead of just the first one. We do this by taking advantage of a common web scraping idiom for scraping known size pagination -- we scrape the first page, find the number of results and scrape the rest of the pages concurrently:

illustration of efficient paging scraping idiom

illustration of concurrent search paging scraping

Then, we parse preview data from each result page by using XPATH selectors. We do this by iterating through each of the 25 hotel preview boxes present on the page and extracting details such as name, location, score, URL, review count and start value.

Let's run our search scraper:

Run code & example output
import json

async def run():
    async with AsyncClient(headers=HEADERS) as session:
        results = await scrape_search("London", session)
        print(json.dumps(results, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
{
  "https://www.booking.com/hotel/gb/nobu-hotel-london-portman-square.html": {
    "name": "Nobu Hotel London Portman Square",
    "location": "Westminster Borough, London",
    "score": "8.9",
    "review_count": "445 reviews",
    "stars": 5,
    "image": "https://cf.bstatic.com/xdata/images/hotel/square200/339532965.webp?k=ba363634cf1e7c91ac2e64f701bf702d520b133c311ac91e2b3df118d0570aaa&o=&s=1"
  },
...

We've successfully scraped booking.com's search page to discover hotels located in London. Furthermore, we got some valuable metadata and URL to the hotel page itself so next, we can scrape detailed hotel data and pricing!

Scraping Hotel Data

Now that we have a scraper that can scrape booking.com's hotel preview data we can further collect remaining hotel data like description, address, feature list etc. by scraping each individual hotel URL.

hotel page parsing markup

Hotel page field markup - we will scrape these fields

We'll continue with using httpx for connecting to booking.com and parsel for processing hotel HTML pages:

from collections import defaultdict
from parsel import Selector
from httpx import AsyncClient


def parse_hotel(html: str) -> dict:
    sel = Selector(text=html)

    # get latitude and longitude of the hotel address:
    lat, lng = css_first(".show_map_hp_link::attr(data-atlas-latlng)").split(",")

    # get hotel features by type
    features = defaultdict(list)
    for feat_box in sel.css(".hotel-facilities-group"):
        type_ = "".join(feat_box.css(".bui-title__text::text").getall()).strip()
        features[type_].extend([f.strip() for f in feat_box.css(".bui-list__description::text").getall() if f.strip()])

    # extract remaining details:
    css = lambda selector, sep="": sep.join(sel.css(selector).getall()).strip()
    css_first = lambda selector: sel.css(selector).get("")
    return {
        "title": css("h2#hp_hotel_name::text"),
        "description": css("div#property_description_content ::text", "\n"),
        "address": css(".hp_address_subtitle::text"),
        "lat": lat,
        "lng": lng,
        "features": dict(features),
        "id": re.findall(r"b_hotel_id:\s*'(.+?)'", html)[0],
    }


async def scrape_hotels(urls: List[str], session: AsyncClient):
    async def scrape_hotel(url: str):
        resp = await session.get(url)
        hotel = parse_hotel(resp.text)
        hotel["url"] = str(resp.url)
        return hotel

    hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
    return hotels

Here, we define our hotel page scraping logic. Our scrape_hotels function takes a list of hotel urls which we scrape via simple GET requests for the HTML data. We then use our HTML parsing library to extract hotel information using CSS selectors.

We can test our scraper out:

Run code & example output
async def run():
    async with AsyncClient(headers=HEADERS) as session:
        hotels = await scrape_hotels(["https://www.booking.com/hotel/gb/gardencourthotel.html"], session, '')
        print(json.dumps(hotels, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "title": "Garden Court Hotel",
    "description": "You're eligible for a Genius discount at Garden Court Hotel! To save at this property, all you have to do is \nsign in\n.\n\n\nThe 19th-century Garden Court Hotel is superbly located in Kensington Gardens Square. It offers stylish, family-run accommodations, a short walk from Bayswater Underground Station.\n\n\nEach comfortable room is individually designed, with an LCD Freeview cable TV. All rooms have their own private internal private bathrooms, except for a few cozy single rooms which have their own exclusive private external bathrooms.\n\n\nFree Wi-Fi internet access is available throughout the hotel, and there is also free luggage storage and a safe for guests to use at the 24-hour reception.\n\n\nThe hotel is located in fashionable Notting Hill, close to Portobello Antiques Markets and the Royal Parks. Kings Cross Station is 4.8 km away.",
    "address": "30-31 Kensington Gardens Square, Notting Hill, Westminster Borough, London, W2 4BG, United Kingdom",
    "lat": "51.51431706",
    "lng": "-0.19066349",
    "features": {
      "Food & Drink": [ "Breakfast in the room" ],
      "Internet": [],
      "Parking": [ "Electric vehicle charging station", "Street parking" ],
      "Front Desk Services": [ "Invoice provided", "..." ],
      "Cleaning Services": [ "Daily housekeeping", ],
      "Business Facilities": [ "Fax/Photocopying" ],
      "Safety & security": [ "Fire extinguishers", "..." ],
      "General": [ "Shared lounge/TV area", "..." ],
      "Languages Spoken": [ "Arabic", "..." ]
    },
    "id": "102764",
    "url": "https://www.booking.com/hotel/gb/gardencourthotel.html?cur_currency=usd"
  }
]

There's significantly more data available on the page but to keep this tutorial brief we only focused on a few example fields. However, we're missing a very important detail - the price! For that, we need to modify our scraper with an additional request.

Scraping Hotel Pricing

Booking.com's hotel pages do not contain pricing in the HTML data, so we'll have to make additional requests to retrieve pricing calendar data. If we scroll down on the hotel page and open up our web inspector we can see how Booking.com populates its pricing calendar:

illustration of booking.com hotel pricing calendar

Pricing calendar request of Booking.com hotel

Here, we can see that a background request is being made to retrieve pricing data for 61 days! Let's add this functionality to our scraper:

async def scrape_hotels(urls: List[str], session: AsyncClient, price_start_dt: str, price_n_days=30):
    async def scrape_hotel(url: str):
        resp = await session.get(url)
        hotel = parse_hotel(resp.text)
        hotel["url"] = str(resp.url)
        # for background requests we need to find cross-site-reference token
        csrf_token = re.findall(r"b_csrf_token:\s*'(.+?)'", resp.text)[0]
        hotel['price'] = await scrape_prices(csrf_token=csrf_token, hotel_id=hotel['id'])
        return hotel

    async def scrape_prices(hotel_id, csrf_token):
        data = {
            "name": "hotel.availability_calendar",
            "result_format": "price_histogram",
            "hotel_id": hotel_id,
            "search_config": json.dumps({
                # we can adjust pricing configuration here but this is the default
                "b_adults_total": 2,
                "b_nr_rooms_needed": 1,
                "b_children_total": 0,
                "b_children_ages_total": [],
                "b_is_group_search": 0,
                "b_pets_total": 0,
                "b_rooms": [{"b_adults": 2, "b_room_order": 1}],
            }),
            "checkin": price_start_dt,
            "n_days": price_n_days,
            "respect_min_los_restriction": 1,
            "los": 1,
        }
        resp = await session.post(
            "https://www.booking.com/fragment.json?cur_currency=usd",
            headers={**session.headers, "X-Booking-CSRF": csrf_token},
            data=data,
        )
        return resp.json()["data"]

    hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
    return hotels

We've extended our scrape_hotels function with price scrape functionality by replicating the pricing calendar request we saw in our web inspector. If we run this code, our results should contain pricing data similar to this:

Run code & example output
async def run():
    async with AsyncClient(headers=HEADERS) as session:
        hotels = await scrape_hotels(["https://www.booking.com/hotel/gb/gardencourthotel.html"], session, '2022-05-20')
        print(json.dumps(hotels, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "title": "Garden Court Hotel",
    "description": "You're eligible for a Genius discount at Garden Court Hotel! To save at this property, all you have to do is \nsign in\n.\n\n\nThe 19th-century Garden Court Hotel is ...",
    "address": "30-31 Kensington Gardens Square, Notting Hill, Westminster Borough, London, W2 4BG, United Kingdom",
    "lat": "51.51431706",
    "lng": "-0.19066349",
    "features": {
      "Food & Drink": [
        "Breakfast in the room"
      ],
      "Internet": [],
      "Parking": [ "Electric vehicle charging station", "Street parking" ],
      "Front Desk Services": [ "Invoice provided", ""],
      "Cleaning Services": [ "Daily housekeeping", "Ironing service" ],
      "Business Facilities": [ "Fax/Photocopying" ],
      "Safety & security": [ "Fire extinguishers", "..."],
      "General": [ "Shared lounge/TV area", "..." ],
      "Languages Spoken": [ "Arabic", "English", "Spanish", "French", "Romanian" ]
    },
    "id": "102764",
    "url": "https://www.booking.com/hotel/gb/gardencourthotel.html?cur_currency=usd",
    "price": {
      "min_los": 1,
      "days": [
        {
          "b_is_weekend": 0,
          "b_month": "05",
          "b_month_name": "May",
          "b_price_pretty": "USD\u00a0276,69",
          "checkin": "2022-05-20",
          "b_full_year": "2022",
          "b_length_of_stay": 1,
          "b_price": 276.6988213958,
          "b_day": "20",
          "b_avg_price_pretty": "276",
          "b_short_month_name": "May",
          "b_checkout": "2022-05-21",
          "b_weekday": 5,
          "b_avg_price_raw": "276",
          "b_available": 1,
          "b_epoch": 1652997600,
          "b_weekday_name": "Fr",
          "b_min_length_of_stay": 1,
          "b_url_hp": "/hotel/gb/gardencourthotel.html?label=gen173nr-1DEghmcmFnbWVudCiCAjjoB0gzWARo3QGIAQGYATG4ARfIAQzYAQPoAQH4AQOIAgGoAgO4AoLPnZQGwAIB0gIkNzllMjNlMDItMjRlNC00M2U0LTk0YzYtY2JlNDlkMjA5NzI52AIE4AIB&sid=3af1cb864972f2e88ef99b900927c6f1&checkin=2022-05-20&checkout=2022-05-21&room1=A%2CA%2C&#maxotel_rooms",
          "b_checkin": "2022-05-20"
        },
     ...

We can that this request generates price and availability data for each day of the calendar range we've specified.

Scraping Hotel Reviews

To scrape booking.com hotel reviews let's take a look at what happens when we explore the reviews page. Let's click 2nd page and observe what happens in our browsers web inspector (F12 in major browsers):

0:00
/
We can see a background request being made when we click on page 2 link

We can see a request to reviewlist.html endpoint is made which returns an HTML page of 10 reviews. We can easily replicate this in our scraper:

def parse_reviews(html: str) -> List[dict]:
    """parse review page for review data """
    sel = Selector(text=html)
    parsed = []
    for review_box in sel.css('.review_list_new_item_block'):
        get_css = lambda css: review_box.css(css).get("").strip()
        parsed.append({
            "id": review_box.xpath('@data-review-url').get(),
            "score": get_css('.bui-review-score__badge::text'),
            "title": get_css('.c-review-block__title::text'),
            "date": get_css('.c-review-block__date::text'),
            "user_name": get_css('.bui-avatar-block__title::text'),
            "user_country": get_css('.bui-avatar-block__subtitle::text'),
            "text": ''.join(review_box.css('.c-review__body ::text').getall()),
            "lang": review_box.css('.c-review__body::attr(lang)').get(),
        })
    return parsed


async def scrape_reviews(hotel_id: str, session) -> List[dict]:
    """scrape all reviews of a hotel"""
    async def scrape_page(page, page_size=25):  # 25 is largest possible page size for this endpoint
        url = "https://www.booking.com/reviewlist.html?" + urlencode(
            {
                "type": "total",
                # we can configure language preference
                "lang": "en-us",
                # we can configure sorting order here, in this case recent reviews are first
                "sort": "f_recent_desc",
                "cc1": "gb",
                "dist": 1,
                "pagename": hotel_id,
                "rows": page_size,
                "offset": page * page_size,
            }
        )
        return await session.get(url)

    first_page = await scrape_page(1)
    total_pages = Selector(text=first_page.text).css(".bui-pagination__link::attr(data-page-number)").getall()
    total_pages = max(int(page) for page in total_pages)
    other_pages = await asyncio.gather(*[scrape_page(i) for i in range(2, total_pages + 1)])

    results = []
    for response in [first_page, *other_pages]:
        results.extend(parse_reviews(response.text))
    return results

In our scraper code above, we're using what we learned before: we collect the first page to extract a total number of pages and then scrape the rest of the pages concurrently. Another thing to note is that we can adjust the default url parameters a bit to our preference. Above we use a page size of 25 instead of the default 10, meaning we have to perform fewer requests to retrieve all reviews.


Finally - our scraper can discover hotels, extract hotel preview data and then scrape each hotel page for hotel information, pricing data and reviews!
However, to adopt this scraper at scale we need one final thing - web scraper blocking avoidance.

Avoiding Blocking and Captchas

Scraping Booking.com seems to be easy though unfortunately when scraping at scale we'll quickly be blocked or served captchas which will prevent us access the hotel/search data.

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!

illustration of scrapfly's middleware

Which offers several powerful features that'll help us to get around Instagram's blocking:

For this, we'll be using scrapfly-sdk python package and ScrapFly's anti scraping protection bypass feature. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Walmart web scraper all we need to do is replace httpx requests with scrapfly-sdk requests.

Full Scraper Code

Let's take a look at how our full scraper code would look with ScrapFly integration:

Full Scraper Code with ScrapFly integration
import asyncio
import json
import re
from collections import defaultdict
from typing import Dict, List, TypedDict
from urllib.parse import urlencode
from requests.structures import CaseInsensitiveDict

from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient


def create_search_page_url(
    query,
    checkin: str = "",
    checkout: str = "",
    number_of_rooms=1,
    offset: int = 0,
):
    """scrapes a single hotel search page of booking.com"""
    checkin_year, checking_month, checking_day = checkin.split("-") if checkin else "", "", ""
    checkout_year, checkout_month, checkout_day = checkout.split("-") if checkout else "", "", ""

    url = "https://www.booking.com/searchresults.html"
    url += "?" + urlencode(
        {
            "ss": query,
            "checkin_year": checkin_year,
            "checkin_month": checking_month,
            "checkin_monthday": checking_day,
            "checkout_year": checkout_year,
            "checkout_month": checkout_month,
            "checkout_monthday": checkout_day,
            "no_rooms": number_of_rooms,
            "offset": offset,
        }
    )
    return url


def parse_search_total_results(result: ScrapeApiResponse) -> int:
    """parse total number of results from search page HTML"""
    # parse total amount of pages from heading1 text:
    # e.g. "London: 1,232 properties found"
    total_results = int(result.selector.css("h1").re("([\d,]+) properties found")[0].replace(",", ""))
    return total_results


class HotelPreview(TypedDict):
    """type hint for hotel preview storage (extracted from search page)"""

    name: str
    location: str
    score: str
    review_count: str
    stars: str
    image: str


def parse_search_page(result: ScrapeApiResponse) -> Dict[str, HotelPreview]:
    """parse hotel preview data from search page HTML"""
    hotel_previews = {}
    for hotel_box in result.selector.xpath('//div[@data-testid="property-card"]'):
        url = hotel_box.xpath('.//h3/a[@data-testid="title-link"]/@href').get("").split("?")[0]
        hotel_previews[url] = {
            "name": hotel_box.xpath('.//h3/a[@data-testid="title-link"]/div/text()').get(""),
            "location": hotel_box.xpath('.//span[@data-testid="address"]/text()').get(""),
            "score": hotel_box.xpath('.//div[@data-testid="review-score"]/div/text()').get(""),
            "review_count": hotel_box.xpath('.//div[@data-testid="review-score"]/div[2]/div[2]/text()').get(""),
            "stars": len(hotel_box.xpath('.//div[@data-testid="rating-stars"]/span').getall()),
            "image": hotel_box.xpath('.//img[@data-testid="image"]/@src').get(),
        }
    return hotel_previews


async def scrape_search(
    query,
    session: ScrapflyClient,
    checkin: str = "",
    checkout: str = "",
    number_of_rooms=1,
):
    """scrape all hotel previews from a given search query"""
    first_page_url = create_search_page_url(
        query=query, checkin=checkin, checkout=checkout, number_of_rooms=number_of_rooms
    )
    first_page = await session.async_scrape(ScrapeConfig(url=first_page_url, country="US"))
    total_results = parse_search_total_results(first_page)
    hotel_previews = parse_search_page(first_page)
    other_page_urls = [
        create_search_page_url(
            query=query,
            checkin=checkin,
            checkout=checkout,
            number_of_rooms=number_of_rooms,
            offset=offset,
        )
        for offset in range(25, total_results, 25)
    ]
    async for result in session.concurrent_scrape([ScrapeConfig(url, country="US") for url in other_page_urls]):
        hotel_previews.update(parse_search_page(result))
    return hotel_previews


class Hotel(TypedDict):
    """type hint for hotel data storage"""

    title: str
    description: str
    address: str
    lat: str
    lng: str
    features: dict
    id: str
    url: str
    price: dict


def parse_hotel(result: ScrapeApiResponse) -> Hotel:
    """parse hotel page for hotel information (no pricing or reviews)"""
    css = lambda selector, sep="": sep.join(result.selector.css(selector).getall()).strip()
    css_first = lambda selector: result.selector.css(selector).get("")
    lat, lng = css_first(".show_map_hp_link::attr(data-atlas-latlng)").split(",")
    features = defaultdict(list)
    for feat_box in result.selector.css(".hotel-facilities-group"):
        type_ = "".join(feat_box.css(".bui-title__text::text").getall()).strip()
        features[type_].extend([f.strip() for f in feat_box.css(".bui-list__description::text").getall() if f.strip()])
    data = {
        "title": css("h2#hp_hotel_name::text"),
        "description": css("div#property_description_content ::text", "\n"),
        "address": css(".hp_address_subtitle::text"),
        "lat": lat,
        "lng": lng,
        "features": dict(features),
        "id": re.findall(r"b_hotel_id:\s*'(.+?)'", result.content)[0],
    }
    return data


async def scrape_hotels(urls: List[str], session: ScrapflyClient, price_start_dt: str, price_n_days=30) -> List[Hotel]:
    """scrape list of hotel urls with pricing details"""

    async def scrape_hotel(url: str) -> Hotel:
        url += "?" + urlencode({"cur_currency": "usd"})
        _scrapfly_session = ""
        result_hotel = await session.async_scrape(ScrapeConfig(url, country="US"))
        hotel = parse_hotel(result_hotel)
        hotel["url"] = str(result_hotel.context["url"])

        # for background requests we need to find some secret tokens:
        csrf_token = re.findall(r"b_csrf_token:\s*'(.+?)'", result_hotel.content)[0]
        aid = re.findall(r"b_aid:\s*'(.+?)'", result_hotel.content)[0]
        sid = re.findall(r"b_sid:\s*'(.+?)'", result_hotel.content)[0]
        price_calendar_form = {
            "name": "hotel.availability_calendar",
            "result_format": "price_histogram",
            "hotel_id": hotel["id"],
            "search_config": json.dumps(
                {
                    # we can adjust pricing configuration here but this is the default
                    "b_adults_total": 2,
                    "b_nr_rooms_needed": 1,
                    "b_children_total": 0,
                    "b_children_ages_total": [],
                    "b_is_group_search": 0,
                    "b_pets_total": 0,
                    "b_rooms": [{"b_adults": 2, "b_room_order": 1}],
                }
            ),
            "checkin": price_start_dt,
            "n_days": price_n_days,
            "respect_min_los_restriction": 1,
            "los": 1,
        }
        result_price = await session.async_scrape(
            ScrapeConfig(
                url="https://www.booking.com/fragment.json?cur_currency=usd",
                method="POST",
                data=price_calendar_form,
                # we need to use cookies we received from hotel scrape to access background requests like this one
                cookies=CaseInsensitiveDict({v['name']: v['value'] for v in result_hotel.scrape_result['cookies']}),
                headers={
                    "X-Booking-CSRF": csrf_token,
                    "X-Requested-With": "XMLHttpRequest",
                    "X-Booking-AID": aid,
                    "X-Booking-Session-Id": sid,
                },
                country="US",
            )
        )
        hotel["price"] = json.loads(result_price.content)["data"]
        return hotel

    hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
    return hotels


class Review(TypedDict):
    """type hint for review information storage"""

    id: str
    score: str
    title: str
    date: str
    user_name: str
    user_country: str
    text: str
    lang: str


def parse_reviews(result: ScrapeApiResponse) -> List[Review]:
    """parse review page for review data"""
    parsed = []
    for review_box in result.selector.css(".review_list_new_item_block"):
        get_css = lambda css: review_box.css(css).get("").strip()
        parsed.append(
            {
                "id": review_box.xpath("@data-review-url").get(),
                "score": get_css(".bui-review-score__badge::text"),
                "title": get_css(".c-review-block__title::text"),
                "date": get_css(".c-review-block__date::text"),
                "user_name": get_css(".bui-avatar-block__title::text"),
                "user_country": get_css(".bui-avatar-block__subtitle::text"),
                "text": "".join(review_box.css(".c-review__body ::text").getall()),
                "lang": review_box.css(".c-review__body::attr(lang)").get(),
            }
        )
    return parsed


async def scrape_reviews(hotel_id: str, session: ScrapflyClient) -> List[dict]:
    """scrape all reviews of a hotel"""

    def create_review_url(page, page_size=25):  # 25 is largest possible page size for this endpoint
        """create url for specific page of hotel review pagination"""
        return "https://www.booking.com/reviewlist.html?" + urlencode(
            {
                "type": "total",
                "lang": "en-us",
                "sort": "f_recent_desc",
                "cc1": "gb",
                "dist": 1,
                "pagename": hotel_id,
                "rows": page_size,
                "offset": page * page_size,
            }
        )

    first_page = await session.async_scrape(ScrapeConfig(url=create_review_url(1), country="US"))
    total_pages = first_page.selector.css(".bui-pagination__link::attr(data-page-number)").getall()
    total_pages = max(int(page) for page in total_pages)
    other_page_urls = [create_review_url(page) for page in range(2, total_pages + 1)]
    reviews = parse_reviews(first_page)
    async for result in session.concurrent_scrape([ScrapeConfig(url, country="US") for url in other_page_urls]):
        reviews.extend(parse_reviews(result))
    return reviews


# Example use:
async def run():
    with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=10) as session:
        # we can find hotel previews
        result_search = await scrape_search("London", session)
        # and scrape hotel data itself
        result_hotels = await scrape_hotels(
            urls=["https://www.booking.com/hotel/gb/gardencourthotel.html"],
            session=session,
            # get pricing data of last 7 days
            price_start_dt="2022-05-25",
            price_n_days=7,
        )
        result_reviews = await scrape_reviews("gardencourthotel", session)
        return result_search, result_hotels, result_reviews


if __name__ == "__main__":
    asyncio.run(run())

In the code above to enable ScrapFly all we had to do is replace the httpx session object with ScrapFlyClient! Now, we can scrape the whole of booking.com without being worried about blocking or captchas.

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping Booking.com:

How to change currency when scraping booking.com?

Booking.com automatically chooses currency based on web scrapers' IP address' geographical location. The easiest approach is to use a proxy of a specific location; for example, in ScrapFly we can use country=US argument in our request to receive USD prices.
Alternatively, we can manually change the currency for our scraping session via GET request with selected_currency parameter.

import httpx
with httpx.Client() as session:
    currency = 'USD'
    url = f"https://www.booking.com/?change_currency=1;selected_currency={currency};top_currency=1"
    response = session.get(url)

This request will return currency cookies which we will be preserved in our session - making any further requests response with the same currency. Note that this has to be done every time we start a new http session.

How to scrape more than 1000 booking.com hotels?

Like many result paging systems, Booking.com's search returns a limited amount of results. In this case, it's 1000 results per query and that might not be enough to fully cover some broader queries.
The best approach to this is to split the query into several smaller queries. For example, instead of searching "London", we can split the search by scraping each neighborhood:

illustration of booking.com search split

Summary

In this web scraping tutorial, we built a small Booking.com scraper which uses search to discover hotel listing previews and then scrapes hotel data and pricing information.

For this we've used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Related post

How to Scrape YellowPages.com

Tutorial on how to scrape yellowpages.com business and review data using Python. How to avoid blocking to scrape data at scale and other tips.

How to Scrape Amazon.com

Tutorial on how to scrape Amazon.com's product and review data using Python and how to avoid blocking to scrape this information at scale.

How to Scrape Zillow.com

Tutorial on how to scrape Zillow.com sale and rent property data, using Python and how to avoid blocking to scrape at scale.

How to Scrape TripAdvisor.com

Tutorial on how to scrape TripAdvisor.com hotel, restaurant, tour and review data using Python and how to avoid blocking to scrape at scale.