Booking.com is the biggest travel reservation service out there, and it contains public data on thousands of hotels, resorts, airbnbs and so on.
In this tutorial, we'll take a look at how to scrape booking.com in Python programming language.
We'll start with a quick overview of booking.com's website functions. Then, we'll replicate its behavior in our python scraper to scrape hotel information and price data.
Finally, we'll wrap everything up by taking a look at some tips and tricks and frequently encountered challenges when web scraping booking.com. So, let's dive in!
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Project Setup
In this tutorial, we'll be using Python with two packages:
httpx - HTTP client library which will let us communicate with Booking.com's servers
parsel - HTML parsing library which will help us to parse our web scraped HTML files for hotel data.
Both of these packages can be easily installed via pip command:
$ pip install "httpx[http2,brotli]" parsel
were using httpx with http2 and brotli features to improve our chances when it comes bypass of booking.com blocking
Alternatively, other HTTP client libraries such as %url https://pypi.org/project/requests/ requests %] will work the same but are much more likely to cause blocking.
As for parsel, another great alternative is beautifulsoup package.
Finding Booking Hotels
Our first step is to figure out how we can discover hotel pages, so we can start scraping their data. On Booking.com platform, we have several ways to achieve that.
Using Sitemaps
Booking.com is easily accessible through its vast sitemap system. Sitemaps are compressed XML files that contain all URLs available on the websites categorized by subject.
To find sitemaps, we first must visit the /robots.txt page, and here we can find Sitemap links:
Here, we can see URLs categorized by cities, landmarks or even themes. For example, if we take a look at the /sitembk-hotel-index.xml we can see that it splits into another set of sitemaps as a single sitemap is only allowed to contain 50 000 results:
Here, we have 1710 sitemaps - meaning 85 million links to various hotel pages. Of course, not all are unique hotel pages (some are duplicates), but that's the easiest and most efficient way to discover hotel listing pages on booking.com.
Using sitemaps is an efficient and easy way to find hotels, but it's not a flexible discovery method. Usually, when we scrape this type of data we have specific areas or categories in mind.
To scrape hotels available in a specific area or containing certain features, we need to scrape Booking.com's search system instead. So, let's take a look at how to do that.
Scraping Booking.com Search
Alternatively, we can take advantage of the search system running on booking.com just like a human user would.
Booking's search might appear to be complex at first because of long URLs, but if we dig a bit deeper, we can see that it's rather simple, as most URL parameters are optional. For example, let's take a look at this query of "Hotels in London":
Lots of scary parameters! Fortunately, we can distill it to a few mandatory ones in our Python web scraper. Let's write the first part of our scraper - function to retrieve a single search result page:
async def run():
await search_page("London")
if __name__ == "__main__":
asyncio.run(run())
Here, we've defined a function that requests a single search result page from a given search query and check-in data. We're also using some common scraping idioms:
We set our HTTP client headers to that of a common web browser to avoid being instantly blocked. In this case, we're using Chrome on Windows OS.
We use the follow_redirects keyword to automatically follow all 30X responses as our generated query parameters are missing some optional values.
We can successfully automate the search functionalities and retrieve a search page as HTML. The next step is parsing Booking search pages.
Search pages use dynamic scroll actions to load more results. Simulating scroll actions using headless browser automation to load more results is possible, but it's not recommended. Since search pages can include thousands of results, it makes it unrealistic to scroll tens of pages in a single session.
Alternatively, we'll scrape the hidden APIs responsible for fetching search results while scrolling. To capture this API, follow the below steps:
Open the browser developer tools by pressing the F12 key.
Select the network tab and filter by Fetch/XHR requests.
Scroll down the page to load more data.
After following the above steps, you will find the a request sent to the booking.com/dml/graphql captured:
The above request represents a GraphQL call. In order to scrape it, we have to fetch the required request body from the search HTML page:
import json
from typing import Dict, List
from httpx import AsyncClient, Response
from parsel import Selector
# initialize an async httpx client
client = AsyncClient(
# ....
)
def retrieve_graphql_body(response) -> List[Dict]:
"""parse the graphql search query from the HTML and return the full graphql body"""
selector = Selector(response.text)
script_data = selector.xpath("//script[@data-capla-store-data='apollo']/text()").get()
json_script_data = json.loads(script_data)
keys_list = list(json_script_data["ROOT_QUERY"]["searchQueries"].keys())
second_key = keys_list[1]
search_query_string = second_key[len("search("):-1]
input_json_object = json.loads(search_query_string)
return {
"operationName": "FullSearch",
"variables": {
"input": input_json_object["input"],
"carouselLowCodeExp": False
},
"extensions": {},
"query": "" # use the full query from the browser
}
def generate_graphql_request(url_params: str, body: Dict, offset: int):
"""create a scrape config for the search graphql request"""
body["variables"]["input"]["pagination"]["offset"] = offset
client.headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept":"*/*",
"cache-control":"no-cache",
"content-type": "application/json",
"origin":"https://www.booking.com",
"pragma":"no-cache",
"priority":"u=1, i",
"referer":"https://www.booking.com/searchresults.en-gb.html?" + url_params,
}
return client.post(
"https://www.booking.com/dml/graphql?" + url_params,
json=body
)
def parse_graphql_response(response: Response) -> List[Dict]:
"""parse the search results from the graphql response"""
data = json.loads(response.text)
parsed_data = data["data"]["searchQueries"]["search"]["results"]
return parsed_data
Here, we define three functions, let's break them down:
generate_graphql_request: To create a request object with the required body, offset, and headers. It represents the main search API calls for the search data.
retrieve_graphql_body: To parse the GraphQL request body from the HTML. Note that the body uses a query object, which can be retreived the browser XHR calls.
parse_graphql_response: To parse the response of the search API requests.
Another key parameter here is offset which controls the search result paging. Providing offset tells that we want 25 results to start from the X point of the result set. So, let's use it to implement full paging and hotel preview data parsing.
We'll utilize the above logic to crawl and scrape Booking search:
import json
import asyncio
from typing import Dict, List, Optional
from urllib.parse import urlencode
from loguru import logger as log
from httpx import AsyncClient, Response
from parsel import Selector
# initialize an async httpx client
client = AsyncClient(
# ...
)
def parse_graphql_response(response: Response) -> List[Dict]:
# ...
def retrieve_graphql_body(response) -> List[Dict]:
# ...
def generate_graphql_request(url_params: str, body: Dict, offset: int):
# ...
async def scrape_search(
query,
checkin: str = "", # e.g. 2023-05-30
checkout: str = "", # e.g. 2023-06-26
number_of_rooms=1,
max_pages: Optional[int] = None,
) -> List[Dict]:
"""scrapes a single hotel search page of booking.com"""
checkin_year, checking_month, checking_day = checkin.split("-") if checkin else ("", "", "")
checkout_year, checkout_month, checkout_day = checkout.split("-") if checkout else ("", "", "")
url_params = urlencode(
{
"ss": query,
"checkin_year": checkin_year,
"checkin_month": checking_month,
"checkin_monthday": checking_day,
"checkout_year": checkout_year,
"checkout_month": checkout_month,
"checkout_monthday": checkout_day,
"no_rooms": number_of_rooms
}
)
search_url = "https://www.booking.com/searchresults.en-gb.html?" + url_params
first_page = await client.get(search_url)
body = retrieve_graphql_body(first_page)
selector = Selector(first_page.text)
_total_results = int(selector.css("h1").re(r"([\d,]+) properties found")[0].replace(",", ""))
_max_scrape_results = max_pages * 25
if _max_scrape_results and _max_scrape_results < _total_results:
_total_results = _max_scrape_results
data = []
to_scrape = [
generate_graphql_request(url_params, body, offset)
for offset in range(0, _total_results, 25)
]
for response in asyncio.as_completed(to_scrape):
response = await response
data.extend(parse_graphql_response(response))
log.success(f"scraped {len(data)} results from search pages")
return data
Run the code
async def run():
data = await scrape_search(
query="London",
max_pages=3
)
with open("search_results.json", "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
There's quite a bit of code here so let's unpack it bit by bit.
First, we define our scrape_search() function which loops through our previously defined search_page() function to scrape all pages instead of just the first one. We do this by taking advantage of a common web scraping idiom for scraping known size pagination -- we scrape the first page, find the number of results and scrape the rest of the pages concurrently:
Here's a sample output of the results we got:
Run the code
[
{
"bookerExperienceContentUIComponentProps": [],
"matchingUnitConfigurations": null,
"location": {
"publicTransportDistanceDescription": "Bethnal Green station is within 150 metres",
"__typename": "SearchResultsPropertyLocation",
"beachDistance": null,
"nearbyBeachNames": [],
"geoDistanceMeters": null,
"beachWalkingTime": null,
"displayLocation": "Tower Hamlets, London",
"mainDistance": "5.5 km from centre",
"skiLiftDistance": null
},
"showAdLabel": false,
"relocationMode": null,
"mlBookingHomeTags": [],
"priceDisplayInfoIrene": null,
"basicPropertyData": {
"isTestProperty": false,
"ufi": -2601889,
"externalReviewScore": null,
"id": 3788761,
"isClosed": false,
"pageName": "yotel-london-shoreditch",
"photos": {},
"paymentConfig": null,
"accommodationTypeId": 204,
"__typename": "BasicPropertyData",
"starRating": {
"value": 3,
"caption": {
"translation": "This star rating is provided to Booking.com by the property, and is usually determined by an official hotel rating organisation or another third party. ",
"__typename": "TranslationTag"
},
"tocLink": {
"__typename": "TranslationTag",
"translation": "Learn more on the \"How we work\" page"
},
"showAdditionalInfoIcon": false,
"__typename": "StarRating",
"symbol": "STARS"
},
"location": {
"address": "309 - 317 Cambridge Heath Road Bethnal Green",
"countryCode": "gb",
"city": "London",
"__typename": "Location"
},
"reviewScore": {
"showScore": true,
"score": 7.6,
"secondaryTextTag": {
"__typename": "TranslationTag",
"translation": null
},
"showSecondaryScore": false,
"secondaryScore": 0,
"totalScoreTextTag": {
"translation": "Good",
"__typename": "TranslationTag"
},
"__typename": "Reviews",
"reviewCount": 4116
}
},
....
]
We've successfully scraped booking.com's search page to discover hotels located in London. Furthermore, we got some valuable metadata and URL to the hotel page itself so next, we can scrape detailed hotel data and pricing!
Scraping Booking.com Hotel Data
Now that we have a scraper that can scrape booking.com's hotel preview data we can further collect remaining hotel data like description, address, feature list etc. by scraping each individual hotel URL.
We'll continue with using httpx for connecting to booking.com and parsel for processing hotel HTML pages:
from collections import defaultdict
from parsel import Selector
from httpx import AsyncClient
def parse_hotel(html: str):
sel = Selector(text=html)
css = lambda selector, sep="": sep.join(sel.css(selector).getall()).strip()
css_first = lambda selector: sel.css(selector).get("")
# get latitude and longitude of the hotel address:
lat, lng = css_first(".show_map_hp_link::attr(data-atlas-latlng)").split(",")
# get hotel features by type
features = defaultdict(list)
for feat_box in sel.css("[data-capla-component*=FacilitiesBlock]>div>div>div"):
type_ = feat_box.xpath('.//span[contains(@data-testid, "facility-group-icon")]/../text()').get()
feats = [f.strip() for f in feat_box.css("li ::text").getall() if f.strip()]
features[type_] = feats
data = {
"title": css("h2#hp_hotel_name::text"),
"description": css("div#property_description_content ::text", "\n"),
"address": css(".hp_address_subtitle::text"),
"lat": lat,
"lng": lng,
"features": dict(features),
"id": re.findall(r"b_hotel_id:\s*'(.+?)'", html)[0],
}
return data
async def scrape_hotels(urls: List[str], session: AsyncClient):
async def scrape_hotel(url: str):
resp = await session.get(url)
hotel = parse_hotel(resp.text)
hotel["url"] = str(resp.url)
return hotel
hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
return hotels
Here, we define our hotel page scraping logic. Our scrape_hotels function takes a list of hotel urls which we scrape via simple GET requests for the HTML data. We then use our HTML parsing library to extract hotel information using CSS selectors.
We can test our scraper out:
Run code & example output
async def run():
async with AsyncClient(headers=HEADERS) as session:
hotels = await scrape_hotels(["https://www.booking.com/hotel/gb/gardencourthotel.html"], session)
print(json.dumps(hotels, indent=2))
if __name__ == "__main__":
asyncio.run(run())
[
{
"title": "Garden Court Hotel",
"description": "You're eligible for a Genius discount at Garden Court Hotel! To save at this property, all you have to do is \nsign in\n.\n\n\nThe 19th-century Garden Court Hotel is superbly located in Kensington Gardens Square. It offers stylish, family-run accommodations, a short walk from Bayswater Underground Station.\n\n\nEach comfortable room is individually designed, with an LCD Freeview cable TV. All rooms have their own private internal private bathrooms, except for a few cozy single rooms which have their own exclusive private external bathrooms.\n\n\nFree Wi-Fi internet access is available throughout the hotel, and there is also free luggage storage and a safe for guests to use at the 24-hour reception.\n\n\nThe hotel is located in fashionable Notting Hill, close to Portobello Antiques Markets and the Royal Parks. Kings Cross Station is 4.8 km away.",
"address": "30-31 Kensington Gardens Square, Notting Hill, Westminster Borough, London, W2 4BG, United Kingdom",
"lat": "51.51431706",
"lng": "-0.19066349",
"features": {
"Food & Drink": [ "Breakfast in the room" ],
"Internet": [],
"Parking": [ "Electric vehicle charging station", "Street parking" ],
"Front Desk Services": [ "Invoice provided", "..." ],
"Cleaning Services": [ "Daily housekeeping", ],
"Business Facilities": [ "Fax/Photocopying" ],
"Safety & security": [ "Fire extinguishers", "..." ],
"General": [ "Shared lounge/TV area", "..." ],
"Languages Spoken": [ "Arabic", "..." ]
},
"id": "102764",
"url": "https://www.booking.com/hotel/gb/gardencourthotel.html?cur_currency=usd"
}
]
There's significantly more data available on the page but to keep this tutorial brief we only focused on a few example fields. However, we're missing a very important detail - the price! For that, we need to modify our scraper with an additional request.
Scraping Booking.com Hotel Pricing
Booking.com's hotel pages do not contain pricing in the HTML data, so we'll have to make additional requests to retrieve pricing calendar data. If we scroll down on the hotel page and open up our web inspector we can see how Booking.com populates its pricing calendar:
Here, we can see that a background request is being made to retrieve pricing data for 61 days!
Let's add this functionality to our scraper:
import asyncio
import json
import re
from datetime import datetime
from typing import List
from httpx import AsyncClient
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Connection": "keep-alive",
"Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}
def parse_hotel(html):
return {}
async def scrape_hotels(
urls: List[str], session: AsyncClient, price_start_dt: str, price_n_days=30
):
async def scrape_hotel(url: str):
resp = await session.get(url)
hotel = parse_hotel(resp.text)
hotel["url"] = str(resp.url)
# for background requests we need to find some variables:
_hotel_country = re.findall(r'hotelCountry:\s*"(.+?)"', resp.text)[0]
_hotel_name = re.findall(r'hotelName:\s*"(.+?)"', resp.text)[0]
_csrf_token = re.findall(r"b_csrf_token:\s*'(.+?)'", resp.text)[0]
hotel["price"] = await scrape_prices(
hotel_name=_hotel_name, hotel_country=_hotel_country, csrf=_csrf_token
)
return hotel
async def scrape_prices(hotel_name, hotel_country, csrf):
# make graphql query from our variables
gql_body = json.dumps(
{
"operationName": "AvailabilityCalendar",
# hotel varialbes go here
# you can adjust number of adults, room number etc.
"variables": {
"input": {
"travelPurpose": 2,
"pagenameDetails": {
"countryCode": hotel_country,
"pagename": hotel_name,
},
"searchConfig": {
"searchConfigDate": {
"startDate": price_start_dt,
"amountOfDays": price_n_days,
},
"nbAdults": 2,
"nbRooms": 1,
},
}
},
"extensions": {},
# this is the query itself, don't alter it
"query": "query AvailabilityCalendar($input: AvailabilityCalendarQueryInput!) {\n availabilityCalendar(input: $input) {\n ... on AvailabilityCalendarQueryResult {\n hotelId\n days {\n available\n avgPriceFormatted\n checkin\n minLengthOfStay\n __typename\n }\n __typename\n }\n ... on AvailabilityCalendarQueryError {\n message\n __typename\n }\n __typename\n }\n}\n",
},
# note: this removes unnecessary whitespace in JSON output
separators=(",", ":"),
)
# scrape booking graphql
result_price = await session.post(
"https://www.booking.com/dml/graphql?lang=en-gb",
body=gql_body,
# note that we need to set headers to avoid being blocked
headers={
"content-type": "application/json",
"x-booking-csrf-token": _csrf_token,
"origin": "https://www.booking.com",
},
)
price_data = json.loads(result_price.content)
return price_data["data"]["availabilityCalendar"]["days"]
hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
return hotels
# example use:
if __name__ == "__main__":
async def run():
async with AsyncClient(headers=HEADERS) as session:
hotels = await scrape_hotels(
["https://www.booking.com/hotel/gb/gardencourthotel.html"],
session,
datetime.now().strftime("%Y-%m-%d"), # today
)
print(json.dumps(hotels, indent=2))
asyncio.run(run())
We've extended our scrape_hotels function with price scrape functionality by replicating the pricing calendar request we saw in our web inspector. If we run this code, our results should contain pricing data similar to this:
Run code & example output
[
{
"title": "Garden Court Hotel",
"description": "You're eligible for a Genius discount at Garden Court Hotel! To save at this property, all you have to do is \nsign in\n.\n\n\nThe 19th-century Garden Court Hotel is ...",
"address": "30-31 Kensington Gardens Square, Notting Hill, Westminster Borough, London, W2 4BG, United Kingdom",
"lat": "51.51431706",
"lng": "-0.19066349",
"features": {
"Food & Drink": [
"Breakfast in the room"
],
"Internet": [],
"Parking": [ "Electric vehicle charging station", "Street parking" ],
"Front Desk Services": [ "Invoice provided", ""],
"Cleaning Services": [ "Daily housekeeping", "Ironing service" ],
"Business Facilities": [ "Fax/Photocopying" ],
"Safety & security": [ "Fire extinguishers", "..."],
"General": [ "Shared lounge/TV area", "..." ],
"Languages Spoken": [ "Arabic", "English", "Spanish", "French", "Romanian" ]
},
"id": "102764",
"url": "https://www.booking.com/hotel/gb/gardencourthotel.html?cur_currency=usd",
"price": [
{
"available": true,
"__typename": "AvailabilityCalendarDay",
"checkin": "2023-07-05",
"minLengthOfStay": 1,
"avgPriceFormatted": "386"
},
{
"available": true,
"__typename": "AvailabilityCalendarDay",
"avgPriceFormatted": "623",
"minLengthOfStay": 1,
"checkin": "2023-07-07"
},
...
]
We can that this request generates price and availability data for each day of the calendar range we've specified.
Scraping Booking.com Hotel Reviews
To scrape booking.com hotel reviews let's take a look at what happens when we explore the reviews page. Let's click 2nd page and observe what happens in our browsers web inspector (F12 in major browsers):
We can see a request to reviewlist.html endpoint is made which returns an HTML page of 10 reviews. We can easily replicate this in our scraper:
def parse_reviews(html: str) -> List[dict]:
"""parse review page for review data """
sel = Selector(text=html)
parsed = []
for review_box in sel.css('.review_list_new_item_block'):
get_css = lambda css: review_box.css(css).get("").strip()
parsed.append({
"id": review_box.xpath('@data-review-url').get(),
"score": get_css('.bui-review-score__badge::text'),
"title": get_css('.c-review-block__title::text'),
"date": get_css('.c-review-block__date::text'),
"user_name": get_css('.bui-avatar-block__title::text'),
"user_country": get_css('.bui-avatar-block__subtitle::text'),
"text": ''.join(review_box.css('.c-review__body ::text').getall()),
"lang": review_box.css('.c-review__body::attr(lang)').get(),
})
return parsed
async def scrape_reviews(hotel_id: str, session) -> List[dict]:
"""scrape all reviews of a hotel"""
async def scrape_page(page, page_size=25): # 25 is largest possible page size for this endpoint
url = "https://www.booking.com/reviewlist.html?" + urlencode(
{
"type": "total",
# we can configure language preference
"lang": "en-us",
# we can configure sorting order here, in this case recent reviews are first
"sort": "f_recent_desc",
"cc1": "gb",
"dist": 1,
"pagename": hotel_id,
"rows": page_size,
"offset": page * page_size,
}
)
return await session.get(url)
first_page = await scrape_page(1)
total_pages = Selector(text=first_page.text).css(".bui-pagination__link::attr(data-page-number)").getall()
total_pages = max(int(page) for page in total_pages)
other_pages = await asyncio.gather(*[scrape_page(i) for i in range(2, total_pages + 1)])
results = []
for response in [first_page, *other_pages]:
results.extend(parse_reviews(response.text))
return results
In our scraper code above, we're using what we learned before: we collect the first page to extract a total number of pages and then scrape the rest of the pages concurrently. Another thing to note is that we can adjust the default url parameters a bit to our preference. Above we use a page size of 25 instead of the default 10, meaning we have to perform fewer requests to retrieve all reviews.
Finally - our scraper can discover hotels, extract hotel preview data and then scrape each hotel page for hotel information, pricing data and reviews!
However, to adopt this scraper at scale we need one final thing - web scraper blocking avoidance.
Bypass Blocking and Captchas using Scrapfly
Scraping Booking.com seems to be easy though unfortunately when scraping at scale we'll quickly be blocked or served captchas which will prevent us access the hotel/search data.
For scraping booking.com using Scrapfly, we'll be using scrapfly-sdk python package and ScrapFly's anti scraping protection bypass feature. First, let's install scrapfly-sdk using pip:
$ pip install scrapfly-sdk
To take advantage of ScrapFly's API in our Walmart web scraper all we need to do is replace httpx requests with scrapfly-sdk requests.
Full Booking.com Scraper Code
Let's take a look at how our full scraper code would look with ScrapFly integration:
Full Scraper Code with ScrapFly integration
import asyncio
from collections import defaultdict
import json
from pathlib import Path
import re
from typing import List, Optional
from urllib.parse import urlencode
from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse
from parsel import Selector
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
async def request_hotels_page(
query,
checkin: str = "",
checkout: str = "",
number_of_rooms=1,
offset: int = 0,
):
"""scrapes a single hotel search page of booking.com"""
checkin_year, checking_month, checking_day = checkin.split("-") if checkin else "", "", ""
checkout_year, checkout_month, checkout_day = checkout.split("-") if checkout else "", "", ""
url = "https://www.booking.com/searchresults.html"
url += "?" + urlencode(
{
"ss": query,
"checkin_year": checkin_year,
"checkin_month": checking_month,
"checkin_monthday": checking_day,
"checkout_year": checkout_year,
"checkout_month": checkout_month,
"checkout_monthday": checkout_day,
"no_rooms": number_of_rooms,
"offset": offset,
}
)
return await scrapfly.async_scrape(ScrapeConfig(url, country="US"))
def parse_search_total_results(html: str):
sel = Selector(text=html)
# parse total amount of pages from heading1 text:
# e.g. "London: 1,232 properties found"
total_results = int(sel.css("h1").re("([\d,]+) properties found")[0].replace(",", ""))
return total_results
def parse_search_hotels(html: str):
sel = Selector(text=html)
hotel_previews = {}
for hotel_box in sel.xpath('//div[@data-testid="property-card"]'):
url = hotel_box.xpath('.//h3/a[@data-testid="title-link"]/@href').get("").split("?")[0]
hotel_previews[url] = {
"name": hotel_box.xpath('.//h3/a[@data-testid="title-link"]/div/text()').get(""),
"location": hotel_box.xpath('.//span[@data-testid="address"]/text()').get(""),
"score": hotel_box.xpath('.//div[@data-testid="review-score"]/div/text()').get(""),
"review_count": hotel_box.xpath('.//div[@data-testid="review-score"]/div[2]/div[2]/text()').get(""),
"stars": len(hotel_box.xpath('.//div[@data-testid="rating-stars"]/span').getall()),
"image": hotel_box.xpath('.//img[@data-testid="image"]/@src').get(),
}
return hotel_previews
async def scrape_search(
query,
checkin: str = "",
checkout: str = "",
number_of_rooms=1,
max_results: Optional[int] = None,
):
first_page = await request_hotels_page(
query=query, checkin=checkin, checkout=checkout, number_of_rooms=number_of_rooms
)
hotel_previews = parse_search_hotels(first_page.content)
total_results = parse_search_total_results(first_page.content)
if max_results and total_results > max_results:
total_results = max_results
other_pages = await asyncio.gather(
*[
request_hotels_page(
query=query,
checkin=checkin,
checkout=checkout,
number_of_rooms=number_of_rooms,
offset=offset,
)
for offset in range(25, total_results, 25)
]
)
for result in other_pages:
hotel_previews.update(parse_search_hotels(result.content))
return hotel_previews
def parse_hotel(html: str):
sel = Selector(text=html)
css = lambda selector, sep="": sep.join(sel.css(selector).getall()).strip()
css_first = lambda selector: sel.css(selector).get("")
lat, lng = css_first(".show_map_hp_link::attr(data-atlas-latlng)").split(",")
features = defaultdict(list)
for feat_box in sel.css("[data-capla-component*=FacilitiesBlock]>div>div>div"):
type_ = feat_box.xpath('.//span[contains(@data-testid, "facility-group-icon")]/../text()').get()
feats = [f.strip() for f in feat_box.css("li ::text").getall() if f.strip()]
features[type_] = feats
data = {
"title": css("h2#hp_hotel_name::text"),
"description": css("div#property_description_content ::text", "\n"),
"address": css(".hp_address_subtitle::text"),
"lat": lat,
"lng": lng,
"features": dict(features),
"id": re.findall(r"b_hotel_id:\s*'(.+?)'", html)[0],
}
return data
async def scrape_hotels(urls: List[str], price_start_dt: str, price_n_days=30):
async def scrape_hotel(url: str):
result = await scrapfly.async_scrape(ScrapeConfig(
url,
session=url.split("/")[-1].split(".")[0],
country="US",
))
hotel = parse_hotel(result.content)
hotel["url"] = result.context['url']
csrf_token = re.findall(r"b_csrf_token:\s*'(.+?)'", result.content)[0]
hotel["price"] = await scrape_prices(csrf_token=csrf_token, hotel_id=hotel["id"], hotel_url=url)
return hotel
async def scrape_prices(hotel_id, csrf_token, hotel_url):
data = {
"name": "hotel.availability_calendar",
"result_format": "price_histogram",
"hotel_id": hotel_id,
"search_config": json.dumps(
{
# we can adjust pricing configuration here but this is the default
"b_adults_total": 2,
"b_nr_rooms_needed": 1,
"b_children_total": 0,
"b_children_ages_total": [],
"b_is_group_search": 0,
"b_pets_total": 0,
"b_rooms": [{"b_adults": 2, "b_room_order": 1}],
}
),
"checkin": price_start_dt,
"n_days": price_n_days,
"respect_min_los_restriction": 1,
"los": 1,
}
result = await scrapfly.async_scrape(
ScrapeConfig(
url="https://www.booking.com/fragment.json?cur_currency=usd",
method="POST",
data=data,
headers={"X-Booking-CSRF": csrf_token},
session=hotel_url.split("/")[-1].split(".")[0],
country="US",
)
)
return json.loads(result.content)["data"]
hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
return hotels
async def run():
out = Path(__file__).parent / "results"
out.mkdir(exist_ok=True)
result_hotels = await scrape_hotels(
["https://www.booking.com/hotel/gb/gardencourthotel.html"],
price_start_dt="2023-04-20",
price_n_days=7,
)
out.joinpath("hotels.json").write_text(json.dumps(result_hotels, indent=2, ensure_ascii=False))
result_search = await scrape_search("London", checkin="2023-04-20", checkout="2023-04-27", max_results=100)
out.joinpath("search.json").write_text(json.dumps(result_search, indent=2, ensure_ascii=False))
if __name__ == "__main__":
asyncio.run(run())
In the code above to enable ScrapFly all we had to do is replace the httpx session object with ScrapFlyClient! Now, we can scrape the whole of booking.com without being worried about blocking or captchas.
FAQ
To wrap this guide up let's take a look at some frequently asked questions about web scraping Booking.com:
Is web scraping booking.com legal?
Yes. Booking hotel data is publicly available; we're not extracting anything personal or private. Scraping booking.com at slow, respectful rates would fall under the ethical scraping definition. See our Is Web Scraping Legal? article for more.
How to change currency when scraping booking.com?
Booking.com automatically chooses currency based on client IP's geographical location. The easiest approach is to use a proxy of a specific location; for example, in ScrapFly we can use country=US argument in our request to receive USD prices.
Alternatively, we can manually change the currency for our scraping session via GET request with the selected_currency parameter.
import httpx
with httpx.Client() as session:
currency = 'USD'
url = f"https://www.booking.com/?change_currency=1;selected_currency={currency};top_currency=1"
response = session.get(url)
This request will return currency cookies which we can reuse to retrieve any other page in this currency. Note that this has to be done every time we start a new HTTP session.
How to scrape more than 1000 booking.com hotels?
Like many result paging systems, Booking.com's search returns a limited amount of results. In this case, 1000 results per query might not be enough to cover some broader queries fully.
The best approach here, is to split the query into several smaller queries. For example, instead of searching "London", we can split the search by scraping each of London's neighborhoods:
In this web scraping tutorial, we built a small Booking.com scraper that uses search to discover hotel listing previews and then scrapes hotel data and pricing information.
For this we've used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!
In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.
In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.
In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.