How to Scrape Booking.com (2024 Update)

article feature image

Booking.com is the biggest travel reservation service out there, and it contains public data on thousands of hotels, resorts, airbnbs and so on.

In this tutorial, we'll take a look at how to scrape booking.com in Python programming language.
We'll start with a quick overview of booking.com's website functions. Then, we'll replicate its behavior in our python scraper to scrape hotel information and price data.
Finally, we'll wrap everything up by taking a look at some tips and tricks and frequently encountered challenges when web scraping booking.com. So, let's dive in!

Latest Booking.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Project Setup

In this tutorial, we'll be using Python with two packages:

  • httpx - HTTP client library which will let us communicate with Booking.com's servers
  • parsel - HTML parsing library which will help us to parse our web scraped HTML files for hotel data.

Both of these packages can be easily installed via pip command:

$ pip install "httpx[http2,brotli]" parsel

were using httpx with http2 and brotli features to improve our chances when it comes bypass of booking.com blocking

Alternatively, other HTTP client libraries such as %url https://pypi.org/project/requests/ requests %] will work the same but are much more likely to cause blocking.

As for parsel, another great alternative is beautifulsoup package.

Finding Booking Hotels

Our first step is to figure out how we can discover hotel pages, so we can start scraping their data. On Booking.com platform, we have several ways to achieve that.

Using Sitemaps

Booking.com is easily accessible through its vast sitemap system. Sitemaps are compressed XML files that contain all URLs available on the websites categorized by subject.

To find sitemaps, we first must visit the /robots.txt page, and here we can find Sitemap links:

Sitemap: https://www.booking.com/sitembk-airport-index.xml
Sitemap: https://www.booking.com/sitembk-articles-index.xml
Sitemap: https://www.booking.com/sitembk-attractions-index.xml
Sitemap: https://www.booking.com/sitembk-beaches-index.xml
Sitemap: https://www.booking.com/sitembk-beach-holidays-index.xml
Sitemap: https://www.booking.com/sitembk-cars-index.xml
Sitemap: https://www.booking.com/sitembk-city-index.xml
Sitemap: https://www.booking.com/sitembk-country-index.xml
Sitemap: https://www.booking.com/sitembk-district-index.xml
Sitemap: https://www.booking.com/sitembk-hotel-index.xml
Sitemap: https://www.booking.com/sitembk-landmark-index.xml
Sitemap: https://www.booking.com/sitembk-region-index.xml
Sitemap: https://www.booking.com/sitembk-tourism-index.xml
Sitemap: https://www.booking.com/sitembk-themed-city-villas-index.xml
Sitemap: https://www.booking.com/sitembk-themed-country-golf-index.xml
Sitemap: https://www.booking.com/sitembk-themed-region-budget-index.xml

Here, we can see URLs categorized by cities, landmarks or even themes. For example, if we take a look at the /sitembk-hotel-index.xml we can see that it splits into another set of sitemaps as a single sitemap is only allowed to contain 50 000 results:

<sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.9">
<sitemap>
<loc>https://www.booking.com/sitembk-hotel-zh-tw.0037.xml.gz</loc>
<lastmod>2022-05-17</lastmod>
</sitemap>
<sitemap>
<loc>https://www.booking.com/sitembk-hotel-zh-tw.0036.xml.gz</loc>
<lastmod>2022-05-17</lastmod>
</sitemap>
...

Here, we have 1710 sitemaps - meaning 85 million links to various hotel pages. Of course, not all are unique hotel pages (some are duplicates), but that's the easiest and most efficient way to discover hotel listing pages on booking.com.

Using sitemaps is an efficient and easy way to find hotels, but it's not a flexible discovery method. Usually, when we scrape this type of data we have specific areas or categories in mind.

To scrape hotels available in a specific area or containing certain features, we need to scrape Booking.com's search system instead. So, let's take a look at how to do that.

Alternatively, we can take advantage of the search system running on booking.com just like a human user would.

illustration of booking.com search bar
booking.com's search system looking up hotels located in London

Booking's search might appear to be complex at first because of long URLs, but if we dig a bit deeper, we can see that it's rather simple, as most URL parameters are optional. For example, let's take a look at this query of "Hotels in London":

https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaN0BiAEBmAExuAEKyAEF2AED6AEB-AECiAIBqAIDuAK-u5eUBsACAdICJGRlN2VhYzYyLTJiYzItNDE0MS1iYmY4LWYwZjkxNTc0OGY4ONgCBOACAQ
&sid=51b2c8cd7b3c8377e83692903e6f19ca
&sb=1
&sb_lp=1
&src=index
&src_elem=sb
&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaN0BiAEBmAExuAEKyAEF2AED6AEB-AECiAIBqAIDuAK-u5eUBsACAdICJGRlN2VhYzYyLTJiYzItNDE0MS1iYmY4LWYwZjkxNTc0OGY4ONgCBOACAQ%26sid%3D51b2c8cd7b3c8377e83692903e6f19ca%26sb_price_type%3Dtotal%26%26
&ss=London%2C+Greater+London%2C+United+Kingdom
&is_ski_area=
&ssne=London
&ssne_untouched=London
&checkin_year=2022
&checkin_month=6
&checkin_monthday=9
&checkout_year=2022
&checkout_month=6
&checkout_monthday=11
&group_adults=2
&group_children=0
&no_rooms=1
&b_h4u_keep_filters=
&from_sf=1
&search_pageview_id=f25c2a9ee3630134
&ac_suggestion_list_length=5
&ac_suggestion_theme_list_length=0
&ac_position=0
&ac_langcode=en
&ac_click_type=b
&dest_id=-2601889
&dest_type=city
&iata=LON
&place_id_lat=51.507393
&place_id_lon=-0.127634
&search_pageview_id=f25c2a9ee3630134
&search_selected=true
&ss_raw=London

Lots of scary parameters! Fortunately, we can distill it to a few mandatory ones in our Python web scraper. Let's write the first part of our scraper - function to retrieve a single search result page:

from urllib.parse import urlencode
from httpx import AsyncClient


async def search_page(
    query,
    session: AsyncClient,
    checkin: str = "",
    checkout: str = "",
    number_of_rooms=1,
    offset: int = 0,
):
    """scrapes a single hotel search page of booking.com"""
    checkin_year, checking_month, checking_day = checkin.split("-") if checkin else ("", "", "")
    checkout_year, checkout_month, checkout_day = checkout.split("-") if checkout else ("", "", "")

    url = "https://www.booking.com/searchresults.html"
    url += "?" + urlencode(
        {
            "ss": query,
            "checkin_year": checkin_year,
            "checkin_month": checking_month,
            "checkin_monthday": checking_day,
            "checkout_year": checkout_year,
            "checkout_month": checkout_month,
            "checkout_monthday": checkout_day,
            "no_rooms": number_of_rooms,
            "offset": offset,
        }
    )
    return await session.get(url, follow_redirects=True)

# Example use:
# first we need to immitate web browser headers to not get blocked instantly
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Connection": "keep-alive",
    "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}


async def run():
    async with AsyncClient(headers=HEADERS) as session:
        await search_page("London", session)

if __name__ == "__main__":
    asyncio.run(run())

Here, we've defined a function that requests a single search result page from a given search query and check-in data. We're also using some common scraping idioms:

  • We set our HTTP client HEADERS to that of a common web browser to avoid being instantly blocked. In this case, we're using Chrome on Windows OS.
    We use the follow_redirects keyword to automatically follow all 30X responses as our generated query parameters are missing some optional values.

Another key parameter here is offset which controls the search result paging. Providing offset tells that we want 25 results to start from the X point of the result set. So, let's use it to implement full paging and hotel preview data parsing:

def parse_search_total_results(html: str):
    """parse total number of results from search page HTML"""
    sel = Selector(text=html)
    # parse total amount of pages from heading1 text:
    # e.g. "London: 1,232 properties found"
    total_results = int(sel.css("h1").re("([\d,]+) properties found")[0].replace(",", ""))
    return total_results


def parse_search_page(html: str):
    """parse hotel preview data from search page HTML"""
    sel = Selector(text=html)

    hotel_previews = {}
    for hotel_box in sel.xpath('//div[@data-testid="property-card"]'):
        url = hotel_box.xpath('.//h3/a[@data-testid="title-link"]/@href').get("").split("?")[0]
        hotel_previews[url] = {
            "name": hotel_box.xpath('.//h3/a[@data-testid="title-link"]/div/text()').get(""),
            "location": hotel_box.xpath('.//span[@data-testid="address"]/text()').get(""),
            "score": hotel_box.xpath('.//div[@data-testid="review-score"]/div/text()').get(""),
            "review_count": hotel_box.xpath('.//div[@data-testid="review-score"]/div[2]/div[2]/text()').get(""),
            "stars": len(hotel_box.xpath('.//div[@data-testid="rating-stars"]/span').getall()),
            "image": hotel_box.xpath('.//img[@data-testid="image"]/@src').get(),
        }
    return hotel_previews


async def scrape_search(
    query,
    session: AsyncClient,
    checkin: str = "",
    checkout: str = "",
    number_of_rooms=1,
):
    """scrape all hotel previews from a given search query"""
    first_page = await search_page(
        query=query, session=session, checkin=checkin, checkout=checkout, number_of_rooms=number_of_rooms
    )
    total_results = parse_search_total_results(first_page.text)
    other_pages = await asyncio.gather(
        *[
            search_page(
                query=query,
                session=session,
                checkin=checkin,
                checkout=checkout,
                number_of_rooms=number_of_rooms,
                offset=offset,
            )
            # decrease the number of total_results to avoid long waiting time 
            # for offset in range(25, total_results, 25)
            for offset in range(25, 3, 25)
        ]
    )
    hotel_previews = {}
    for response in [first_page, *other_pages]:
        hotel_previews.update(parse_search_page(response.text))
    return hotel_previews

There's quite a bit of code here so let's unpack it bit by bit:

First, we define our scrape_search() function which loops through our previously defined search_page() function to scrape all pages instead of just the first one. We do this by taking advantage of a common web scraping idiom for scraping known size pagination -- we scrape the first page, find the number of results and scrape the rest of the pages concurrently:

illustration of efficient paging scraping idiom
illustration of concurrent search paging scraping

Then, we parse preview data from each result page by using XPATH selectors. We do this by iterating through each of the 25 hotel preview boxes present on the page and extracting details such as name, location, score, URL, review count and start value.

Let's run our search scraper:

Run code & example output
import json

async def run():
    async with AsyncClient(headers=HEADERS) as session:
        results = await scrape_search("London", session)
        print(json.dumps(results, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
{
  "https://www.booking.com/hotel/gb/nobu-hotel-london-portman-square.html": {
    "name": "Nobu Hotel London Portman Square",
    "location": "Westminster Borough, London",
    "score": "8.9",
    "review_count": "445 reviews",
    "stars": 5,
    "image": "https://cf.bstatic.com/xdata/images/hotel/square200/339532965.webp?k=ba363634cf1e7c91ac2e64f701bf702d520b133c311ac91e2b3df118d0570aaa&o=&s=1"
  },
...

We've successfully scraped booking.com's search page to discover hotels located in London. Furthermore, we got some valuable metadata and URL to the hotel page itself so next, we can scrape detailed hotel data and pricing!

Scraping Booking.com Hotel Data

Now that we have a scraper that can scrape booking.com's hotel preview data we can further collect remaining hotel data like description, address, feature list etc. by scraping each individual hotel URL.

hotel page parsing markup
Hotel page field markup - we will scrape these fields

We'll continue with using httpx for connecting to booking.com and parsel for processing hotel HTML pages:

from collections import defaultdict
from parsel import Selector
from httpx import AsyncClient

def parse_hotel(html: str):
    sel = Selector(text=html)
    css = lambda selector, sep="": sep.join(sel.css(selector).getall()).strip()
    css_first = lambda selector: sel.css(selector).get("")
    # get latitude and longitude of the hotel address:
    lat, lng = css_first(".show_map_hp_link::attr(data-atlas-latlng)").split(",")
    # get hotel features by type
    features = defaultdict(list)
    for feat_box in sel.css("[data-capla-component*=FacilitiesBlock]>div>div>div"):
        type_ = feat_box.xpath('.//span[contains(@data-testid, "facility-group-icon")]/../text()').get()
        feats = [f.strip() for f in feat_box.css("li ::text").getall() if f.strip()]
        features[type_] = feats
    data = {
        "title": css("h2#hp_hotel_name::text"),
        "description": css("div#property_description_content ::text", "\n"),
        "address": css(".hp_address_subtitle::text"),
        "lat": lat,
        "lng": lng,
        "features": dict(features),
        "id": re.findall(r"b_hotel_id:\s*'(.+?)'", html)[0],
    }
    return data

async def scrape_hotels(urls: List[str], session: AsyncClient):
    async def scrape_hotel(url: str):
        resp = await session.get(url)
        hotel = parse_hotel(resp.text)
        hotel["url"] = str(resp.url)
        return hotel

    hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
    return hotels

Here, we define our hotel page scraping logic. Our scrape_hotels function takes a list of hotel urls which we scrape via simple GET requests for the HTML data. We then use our HTML parsing library to extract hotel information using CSS selectors.

We can test our scraper out:

Run code & example output
async def run():
    async with AsyncClient(headers=HEADERS) as session:
        hotels = await scrape_hotels(["https://www.booking.com/hotel/gb/gardencourthotel.html"], session)
        print(json.dumps(hotels, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "title": "Garden Court Hotel",
    "description": "You're eligible for a Genius discount at Garden Court Hotel! To save at this property, all you have to do is \nsign in\n.\n\n\nThe 19th-century Garden Court Hotel is superbly located in Kensington Gardens Square. It offers stylish, family-run accommodations, a short walk from Bayswater Underground Station.\n\n\nEach comfortable room is individually designed, with an LCD Freeview cable TV. All rooms have their own private internal private bathrooms, except for a few cozy single rooms which have their own exclusive private external bathrooms.\n\n\nFree Wi-Fi internet access is available throughout the hotel, and there is also free luggage storage and a safe for guests to use at the 24-hour reception.\n\n\nThe hotel is located in fashionable Notting Hill, close to Portobello Antiques Markets and the Royal Parks. Kings Cross Station is 4.8 km away.",
    "address": "30-31 Kensington Gardens Square, Notting Hill, Westminster Borough, London, W2 4BG, United Kingdom",
    "lat": "51.51431706",
    "lng": "-0.19066349",
    "features": {
      "Food & Drink": [ "Breakfast in the room" ],
      "Internet": [],
      "Parking": [ "Electric vehicle charging station", "Street parking" ],
      "Front Desk Services": [ "Invoice provided", "..." ],
      "Cleaning Services": [ "Daily housekeeping", ],
      "Business Facilities": [ "Fax/Photocopying" ],
      "Safety & security": [ "Fire extinguishers", "..." ],
      "General": [ "Shared lounge/TV area", "..." ],
      "Languages Spoken": [ "Arabic", "..." ]
    },
    "id": "102764",
    "url": "https://www.booking.com/hotel/gb/gardencourthotel.html?cur_currency=usd"
  }
]

There's significantly more data available on the page but to keep this tutorial brief we only focused on a few example fields. However, we're missing a very important detail - the price! For that, we need to modify our scraper with an additional request.

Scraping Booking.com Hotel Pricing

Booking.com's hotel pages do not contain pricing in the HTML data, so we'll have to make additional requests to retrieve pricing calendar data. If we scroll down on the hotel page and open up our web inspector we can see how Booking.com populates its pricing calendar:

illustration of booking.com hotel pricing calendar
Pricing calendar request of Booking.com hotel

Here, we can see that a background request is being made to retrieve pricing data for 61 days!

Let's add this functionality to our scraper:

import asyncio
import json
import re
from datetime import datetime
from typing import List

from httpx import AsyncClient

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Connection": "keep-alive",
    "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}


def parse_hotel(html):
    return {}


async def scrape_hotels(
    urls: List[str], session: AsyncClient, price_start_dt: str, price_n_days=30
):
    async def scrape_hotel(url: str):
        resp = await session.get(url)
        hotel = parse_hotel(resp.text)
        hotel["url"] = str(resp.url)
        # for background requests we need to find some variables:
        _hotel_country = re.findall(r'hotelCountry:\s*"(.+?)"', resp.text)[0]
        _hotel_name = re.findall(r'hotelName:\s*"(.+?)"', resp.text)[0]
        _csrf_token = re.findall(r"b_csrf_token:\s*'(.+?)'", resp.text)[0]
        hotel["price"] = await scrape_prices(
            hotel_name=_hotel_name, hotel_country=_hotel_country, csrf=_csrf_token
        )
        return hotel

    async def scrape_prices(hotel_name, hotel_country, csrf):
        # make graphql query from our variables
        gql_body = json.dumps(
            {
                "operationName": "AvailabilityCalendar",
                # hotel varialbes go here
                # you can adjust number of adults, room number etc.
                "variables": {
                    "input": {
                        "travelPurpose": 2,
                        "pagenameDetails": {
                            "countryCode": hotel_country,
                            "pagename": hotel_name,
                        },
                        "searchConfig": {
                            "searchConfigDate": {
                                "startDate": price_start_dt,
                                "amountOfDays": price_n_days,
                            },
                            "nbAdults": 2,
                            "nbRooms": 1,
                        },
                    }
                },
                "extensions": {},
                # this is the query itself, don't alter it
                "query": "query AvailabilityCalendar($input: AvailabilityCalendarQueryInput!) {\n  availabilityCalendar(input: $input) {\n    ... on AvailabilityCalendarQueryResult {\n      hotelId\n      days {\n        available\n        avgPriceFormatted\n        checkin\n        minLengthOfStay\n        __typename\n      }\n      __typename\n    }\n    ... on AvailabilityCalendarQueryError {\n      message\n      __typename\n    }\n    __typename\n  }\n}\n",
            },
            # note: this removes unnecessary whitespace in JSON output
            separators=(",", ":"),
        )
        # scrape booking graphql
        result_price = await session.post(
            "https://www.booking.com/dml/graphql?lang=en-gb",
            body=gql_body,
            # note that we need to set headers to avoid being blocked
            headers={
                "content-type": "application/json",
                "x-booking-csrf-token": _csrf_token,
                "origin": "https://www.booking.com",
            },
        )
        price_data = json.loads(result_price.content)
        return price_data["data"]["availabilityCalendar"]["days"]

    hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
    return hotels


# example use:
if __name__ == "__main__":

    async def run():
        async with AsyncClient(headers=HEADERS) as session:
            hotels = await scrape_hotels(
                ["https://www.booking.com/hotel/gb/gardencourthotel.html"],
                session,
                datetime.now().strftime("%Y-%m-%d"),  # today
            )
            print(json.dumps(hotels, indent=2))

    asyncio.run(run())

We've extended our scrape_hotels function with price scrape functionality by replicating the pricing calendar request we saw in our web inspector. If we run this code, our results should contain pricing data similar to this:

Run code & example output
[
  {
    "title": "Garden Court Hotel",
    "description": "You're eligible for a Genius discount at Garden Court Hotel! To save at this property, all you have to do is \nsign in\n.\n\n\nThe 19th-century Garden Court Hotel is ...",
    "address": "30-31 Kensington Gardens Square, Notting Hill, Westminster Borough, London, W2 4BG, United Kingdom",
    "lat": "51.51431706",
    "lng": "-0.19066349",
    "features": {
      "Food & Drink": [
        "Breakfast in the room"
      ],
      "Internet": [],
      "Parking": [ "Electric vehicle charging station", "Street parking" ],
      "Front Desk Services": [ "Invoice provided", ""],
      "Cleaning Services": [ "Daily housekeeping", "Ironing service" ],
      "Business Facilities": [ "Fax/Photocopying" ],
      "Safety & security": [ "Fire extinguishers", "..."],
      "General": [ "Shared lounge/TV area", "..." ],
      "Languages Spoken": [ "Arabic", "English", "Spanish", "French", "Romanian" ]
    },
    "id": "102764",
    "url": "https://www.booking.com/hotel/gb/gardencourthotel.html?cur_currency=usd",
    "price": [
        {
          "available": true,
          "__typename": "AvailabilityCalendarDay",
          "checkin": "2023-07-05",
          "minLengthOfStay": 1,
          "avgPriceFormatted": "386"
        },
        {
          "available": true,
          "__typename": "AvailabilityCalendarDay",
          "avgPriceFormatted": "623",
          "minLengthOfStay": 1,
          "checkin": "2023-07-07"
        },
        ...
    ]

We can that this request generates price and availability data for each day of the calendar range we've specified.

Scraping Booking.com Hotel Reviews

To scrape booking.com hotel reviews let's take a look at what happens when we explore the reviews page. Let's click 2nd page and observe what happens in our browsers web inspector (F12 in major browsers):

0:00
/
We can see a background request being made when we click on page 2 link

We can see a request to reviewlist.html endpoint is made which returns an HTML page of 10 reviews. We can easily replicate this in our scraper:

def parse_reviews(html: str) -> List[dict]:
    """parse review page for review data """
    sel = Selector(text=html)
    parsed = []
    for review_box in sel.css('.review_list_new_item_block'):
        get_css = lambda css: review_box.css(css).get("").strip()
        parsed.append({
            "id": review_box.xpath('@data-review-url').get(),
            "score": get_css('.bui-review-score__badge::text'),
            "title": get_css('.c-review-block__title::text'),
            "date": get_css('.c-review-block__date::text'),
            "user_name": get_css('.bui-avatar-block__title::text'),
            "user_country": get_css('.bui-avatar-block__subtitle::text'),
            "text": ''.join(review_box.css('.c-review__body ::text').getall()),
            "lang": review_box.css('.c-review__body::attr(lang)').get(),
        })
    return parsed


async def scrape_reviews(hotel_id: str, session) -> List[dict]:
    """scrape all reviews of a hotel"""
    async def scrape_page(page, page_size=25):  # 25 is largest possible page size for this endpoint
        url = "https://www.booking.com/reviewlist.html?" + urlencode(
            {
                "type": "total",
                # we can configure language preference
                "lang": "en-us",
                # we can configure sorting order here, in this case recent reviews are first
                "sort": "f_recent_desc",
                "cc1": "gb",
                "dist": 1,
                "pagename": hotel_id,
                "rows": page_size,
                "offset": page * page_size,
            }
        )
        return await session.get(url)

    first_page = await scrape_page(1)
    total_pages = Selector(text=first_page.text).css(".bui-pagination__link::attr(data-page-number)").getall()
    total_pages = max(int(page) for page in total_pages)
    other_pages = await asyncio.gather(*[scrape_page(i) for i in range(2, total_pages + 1)])

    results = []
    for response in [first_page, *other_pages]:
        results.extend(parse_reviews(response.text))
    return results

In our scraper code above, we're using what we learned before: we collect the first page to extract a total number of pages and then scrape the rest of the pages concurrently. Another thing to note is that we can adjust the default url parameters a bit to our preference. Above we use a page size of 25 instead of the default 10, meaning we have to perform fewer requests to retrieve all reviews.


Finally - our scraper can discover hotels, extract hotel preview data and then scrape each hotel page for hotel information, pricing data and reviews!
However, to adopt this scraper at scale we need one final thing - web scraper blocking avoidance.

Bypass Blocking and Captchas using Scrapfly

Scraping Booking.com seems to be easy though unfortunately when scraping at scale we'll quickly be blocked or served captchas which will prevent us access the hotel/search data.

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!

illustration of scrapfly's middleware

Which offers several powerful features that'll help us to get around Instagram's blocking:

For this, we'll be using scrapfly-sdk python package and ScrapFly's anti scraping protection bypass feature. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Walmart web scraper all we need to do is replace httpx requests with scrapfly-sdk requests.

Full Booking.com Scraper Code

Let's take a look at how our full scraper code would look with ScrapFly integration:

Full Scraper Code with ScrapFly integration
import asyncio
from collections import defaultdict
import json
from pathlib import Path
import re
from typing import List, Optional
from urllib.parse import urlencode
from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse
from parsel import Selector

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")


async def request_hotels_page(
    query,
    checkin: str = "",
    checkout: str = "",
    number_of_rooms=1,
    offset: int = 0,
):
    """scrapes a single hotel search page of booking.com"""
    checkin_year, checking_month, checking_day = checkin.split("-") if checkin else "", "", ""
    checkout_year, checkout_month, checkout_day = checkout.split("-") if checkout else "", "", ""

    url = "https://www.booking.com/searchresults.html"
    url += "?" + urlencode(
        {
            "ss": query,
            "checkin_year": checkin_year,
            "checkin_month": checking_month,
            "checkin_monthday": checking_day,
            "checkout_year": checkout_year,
            "checkout_month": checkout_month,
            "checkout_monthday": checkout_day,
            "no_rooms": number_of_rooms,
            "offset": offset,
        }
    )
    return await scrapfly.async_scrape(ScrapeConfig(url, country="US"))


def parse_search_total_results(html: str):
    sel = Selector(text=html)
    # parse total amount of pages from heading1 text:
    # e.g. "London: 1,232 properties found"
    total_results = int(sel.css("h1").re("([\d,]+) properties found")[0].replace(",", ""))
    return total_results


def parse_search_hotels(html: str):
    sel = Selector(text=html)

    hotel_previews = {}
    for hotel_box in sel.xpath('//div[@data-testid="property-card"]'):
        url = hotel_box.xpath('.//h3/a[@data-testid="title-link"]/@href').get("").split("?")[0]
        hotel_previews[url] = {
            "name": hotel_box.xpath('.//h3/a[@data-testid="title-link"]/div/text()').get(""),
            "location": hotel_box.xpath('.//span[@data-testid="address"]/text()').get(""),
            "score": hotel_box.xpath('.//div[@data-testid="review-score"]/div/text()').get(""),
            "review_count": hotel_box.xpath('.//div[@data-testid="review-score"]/div[2]/div[2]/text()').get(""),
            "stars": len(hotel_box.xpath('.//div[@data-testid="rating-stars"]/span').getall()),
            "image": hotel_box.xpath('.//img[@data-testid="image"]/@src').get(),
        }
    return hotel_previews


async def scrape_search(
    query,
    checkin: str = "",
    checkout: str = "",
    number_of_rooms=1,
    max_results: Optional[int] = None,
):
    first_page = await request_hotels_page(
        query=query, checkin=checkin, checkout=checkout, number_of_rooms=number_of_rooms
    )
    hotel_previews = parse_search_hotels(first_page.content)
    total_results = parse_search_total_results(first_page.content)
    if max_results and total_results > max_results:
        total_results = max_results
    other_pages = await asyncio.gather(
        *[
            request_hotels_page(
                query=query,
                checkin=checkin,
                checkout=checkout,
                number_of_rooms=number_of_rooms,
                offset=offset,
            )
            for offset in range(25, total_results, 25)
        ]
    )
    for result in other_pages:
        hotel_previews.update(parse_search_hotels(result.content))
    return hotel_previews


def parse_hotel(html: str):
    sel = Selector(text=html)
    css = lambda selector, sep="": sep.join(sel.css(selector).getall()).strip()
    css_first = lambda selector: sel.css(selector).get("")
    lat, lng = css_first(".show_map_hp_link::attr(data-atlas-latlng)").split(",")
    features = defaultdict(list)
    for feat_box in sel.css("[data-capla-component*=FacilitiesBlock]>div>div>div"):
        type_ = feat_box.xpath('.//span[contains(@data-testid, "facility-group-icon")]/../text()').get()
        feats = [f.strip() for f in feat_box.css("li ::text").getall() if f.strip()]
        features[type_] = feats
    data = {
        "title": css("h2#hp_hotel_name::text"),
        "description": css("div#property_description_content ::text", "\n"),
        "address": css(".hp_address_subtitle::text"),
        "lat": lat,
        "lng": lng,
        "features": dict(features),
        "id": re.findall(r"b_hotel_id:\s*'(.+?)'", html)[0],
    }
    return data


async def scrape_hotels(urls: List[str], price_start_dt: str, price_n_days=30):
    async def scrape_hotel(url: str):
        result = await scrapfly.async_scrape(ScrapeConfig(
            url, 
            session=url.split("/")[-1].split(".")[0],
            country="US",
        ))
        hotel = parse_hotel(result.content)
        hotel["url"] = result.context['url']
        csrf_token = re.findall(r"b_csrf_token:\s*'(.+?)'", result.content)[0]
        hotel["price"] = await scrape_prices(csrf_token=csrf_token, hotel_id=hotel["id"], hotel_url=url)
        return hotel

    async def scrape_prices(hotel_id, csrf_token, hotel_url):
        data = {
            "name": "hotel.availability_calendar",
            "result_format": "price_histogram",
            "hotel_id": hotel_id,
            "search_config": json.dumps(
                {
                    # we can adjust pricing configuration here but this is the default
                    "b_adults_total": 2,
                    "b_nr_rooms_needed": 1,
                    "b_children_total": 0,
                    "b_children_ages_total": [],
                    "b_is_group_search": 0,
                    "b_pets_total": 0,
                    "b_rooms": [{"b_adults": 2, "b_room_order": 1}],
                }
            ),
            "checkin": price_start_dt,
            "n_days": price_n_days,
            "respect_min_los_restriction": 1,
            "los": 1,
        }
        result = await scrapfly.async_scrape(
            ScrapeConfig(
                url="https://www.booking.com/fragment.json?cur_currency=usd",
                method="POST",
                data=data,
                headers={"X-Booking-CSRF": csrf_token},
                session=hotel_url.split("/")[-1].split(".")[0],
                country="US",
            )
        )
        return json.loads(result.content)["data"]

    hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
    return hotels


async def run():
    out = Path(__file__).parent / "results"
    out.mkdir(exist_ok=True)

    result_hotels = await scrape_hotels(
        ["https://www.booking.com/hotel/gb/gardencourthotel.html"],
        price_start_dt="2023-04-20",
        price_n_days=7,
    )
    out.joinpath("hotels.json").write_text(json.dumps(result_hotels, indent=2, ensure_ascii=False))

    result_search = await scrape_search("London", checkin="2023-04-20", checkout="2023-04-27", max_results=100)
    out.joinpath("search.json").write_text(json.dumps(result_search, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    asyncio.run(run())

In the code above to enable ScrapFly all we had to do is replace the httpx session object with ScrapFlyClient! Now, we can scrape the whole of booking.com without being worried about blocking or captchas.

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping Booking.com:

Yes. Booking hotel data is publicly available; we're not extracting anything personal or private. Scraping booking.com at slow, respectful rates would fall under the ethical scraping definition. See our Is Web Scraping Legal? article for more.

How to change currency when scraping booking.com?

Booking.com automatically chooses currency based on client IP's geographical location. The easiest approach is to use a proxy of a specific location; for example, in ScrapFly we can use country=US argument in our request to receive USD prices.
Alternatively, we can manually change the currency for our scraping session via GET request with the selected_currency parameter.

import httpx
with httpx.Client() as session:
    currency = 'USD'
    url = f"https://www.booking.com/?change_currency=1;selected_currency={currency};top_currency=1"
    response = session.get(url)

This request will return currency cookies which we can reuse to retrieve any other page in this currency. Note that this has to be done every time we start a new HTTP session.

How to scrape more than 1000 booking.com hotels?

Like many result paging systems, Booking.com's search returns a limited amount of results. In this case, 1000 results per query might not be enough to cover some broader queries fully.
The best approach here, is to split the query into several smaller queries. For example, instead of searching "London", we can split the search by scraping each of London's neighborhoods:

illustration of booking.com search split
Latest Booking.com Scraper Code
https://github.com/scrapfly/scrapfly-scrapers/

Booking.com Scraping Summary

In this web scraping tutorial, we built a small Booking.com scraper that uses search to discover hotel listing previews and then scrapes hotel data and pricing information.

For this we've used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Related Posts

How to Scrape Reddit Posts, Subreddits and Profiles

In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.

How to Scrape LinkedIn.com Profile, Company, and Job Data

In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.

How to Scrape SimilarWeb Website Traffic Analytics

In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.