How to Scrape Goat.com for Fashion Apparel Data in Python
Goat.com is a rising storefront for luxury fashion apparel items. It's known for high quality apparel data so in this tutorial we'll take a look how to scrape it using Python.
Booking.com is the biggest travel reservation service out there, and it contains public data on thousands of hotels, resorts, airbnbs and so on.
In this tutorial, we'll take a look at how to scrape booking.com in Python programming language.
We'll start with a quick overview of booking.com's website functions. Then, we'll replicate its behavior in our python scraper to scrape hotel information and price data.
Finally, we'll wrap everything up by taking a look at some tips and tricks and frequently encountered challenges when web scraping booking.com. So, let's dive in!
In this tutorial, we'll be using Python with two packages:
Both of these packages can be easily installed via pip
command:
$ pip install httpx parsel
Alternatively, you're free to swap httpx
out with any other HTTP client library, such as %url https://pypi.org/project/requests/ requests %] as we'll only need basic HTTP functions which are almost interchangeable in every library. As for parsel
, another great alternative is beautifulsoup package.
Our first step is to figure out how we can discover hotel pages, so we can start scraping their data. On Booking.com platform, we have several ways to achieve that.
Booking.com is easily accessible through its vast sitemap system. Sitemaps are heavily compressed XML files that contain all URLs available on the websites categorized by subject.
To find sitemaps, we first must visit the /robots.txt page, and here we can find Sitemap links:
Sitemap: https://www.booking.com/sitembk-airport-index.xml
Sitemap: https://www.booking.com/sitembk-articles-index.xml
Sitemap: https://www.booking.com/sitembk-attractions-index.xml
Sitemap: https://www.booking.com/sitembk-beaches-index.xml
Sitemap: https://www.booking.com/sitembk-beach-holidays-index.xml
Sitemap: https://www.booking.com/sitembk-cars-index.xml
Sitemap: https://www.booking.com/sitembk-city-index.xml
Sitemap: https://www.booking.com/sitembk-country-index.xml
Sitemap: https://www.booking.com/sitembk-district-index.xml
Sitemap: https://www.booking.com/sitembk-hotel-index.xml
Sitemap: https://www.booking.com/sitembk-landmark-index.xml
Sitemap: https://www.booking.com/sitembk-region-index.xml
Sitemap: https://www.booking.com/sitembk-tourism-index.xml
Sitemap: https://www.booking.com/sitembk-themed-city-villas-index.xml
Sitemap: https://www.booking.com/sitembk-themed-country-golf-index.xml
Sitemap: https://www.booking.com/sitembk-themed-region-budget-index.xml
Here, we can see urls categorized in by cities, landmarks or even themes. For example, if we take a look at the https://www.booking.com/sitembk-hotel-index.xml we can see that it splits into another set of sitemaps (as a single sitemap is only allowed to contain 50 000 results):
<sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.9">
<sitemap>
<loc>https://www.booking.com/sitembk-hotel-zh-tw.0037.xml.gz</loc>
<lastmod>2022-05-17</lastmod>
</sitemap>
<sitemap>
<loc>https://www.booking.com/sitembk-hotel-zh-tw.0036.xml.gz</loc>
<lastmod>2022-05-17</lastmod>
</sitemap>
...
Here, we have 1710 sitemaps - meaning 85 million links to various hotel pages. Of course, not all are unique hotel pages (some are duplicate), but that's an easy way to discover hotel listings on booking.com.
Using sitemaps is an efficient and easy way to find hotels, but it's not a flexible discovery method. To scrape hotels available in a specific area or containing certain features, we need to scrape Booking.com's search system instead. So, let's take a look at how to do that.
Alternatively, we can take advantage of the search system running on booking.com just like a human user would.
Booking's search might appear to be complex at first because of long URLs, but if we dig a bit deeper, we can see that it's rather simple, as most URL parameters are optional. For example, let's take a look at this query of "Hotels in London":
https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaN0BiAEBmAExuAEKyAEF2AED6AEB-AECiAIBqAIDuAK-u5eUBsACAdICJGRlN2VhYzYyLTJiYzItNDE0MS1iYmY4LWYwZjkxNTc0OGY4ONgCBOACAQ
&sid=51b2c8cd7b3c8377e83692903e6f19ca
&sb=1
&sb_lp=1
&src=index
&src_elem=sb
&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaN0BiAEBmAExuAEKyAEF2AED6AEB-AECiAIBqAIDuAK-u5eUBsACAdICJGRlN2VhYzYyLTJiYzItNDE0MS1iYmY4LWYwZjkxNTc0OGY4ONgCBOACAQ%26sid%3D51b2c8cd7b3c8377e83692903e6f19ca%26sb_price_type%3Dtotal%26%26
&ss=London%2C+Greater+London%2C+United+Kingdom
&is_ski_area=
&ssne=London
&ssne_untouched=London
&checkin_year=2022
&checkin_month=6
&checkin_monthday=9
&checkout_year=2022
&checkout_month=6
&checkout_monthday=11
&group_adults=2
&group_children=0
&no_rooms=1
&b_h4u_keep_filters=
&from_sf=1
&search_pageview_id=f25c2a9ee3630134
&ac_suggestion_list_length=5
&ac_suggestion_theme_list_length=0
&ac_position=0
&ac_langcode=en
&ac_click_type=b
&dest_id=-2601889
&dest_type=city
&iata=LON
&place_id_lat=51.507393
&place_id_lon=-0.127634
&search_pageview_id=f25c2a9ee3630134
&search_selected=true
&ss_raw=London
Lots of scary parameters! Fortunately, we can distill it to a few mandatory ones in our Python web scraper. Let's write the first part of our scraper - function to retrieve a single search result page:
from urllib.parse import urlencode
from httpx import AsyncClient
async def search_page(
query,
session: AsyncClient,
checkin: str = "",
checkout: str = "",
number_of_rooms=1,
offset: int = 0,
):
"""scrapes a single hotel search page of booking.com"""
checkin_year, checking_month, checking_day = checkin.split("-") if checkin else "", "", ""
checkout_year, checkout_month, checkout_day = checkout.split("-") if checkout else "", "", ""
url = "https://www.booking.com/searchresults.html"
url += "?" + urlencode(
{
"ss": query,
"checkin_year": checkin_year,
"checkin_month": checking_month,
"checkin_monthday": checking_day,
"checkout_year": checkout_year,
"checkout_month": checkout_month,
"checkout_monthday": checkout_day,
"no_rooms": number_of_rooms,
"offset": offset,
}
)
return await session.get(url, follow_redirects=True)
# Example use:
# first we need to immitate web browser headers to not get blocked instantly
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Connection": "keep-alive",
"Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}
async def run():
async with AsyncClient(headers=HEADERS) as session:
await search_page("London", session)
if __name__ == "__main__":
asyncio.run(run())
Here, we've defined a function that requests a single search result page from a given search query and check-in data. We're also using some common scraping idioms:
HEADERS
to that of a common web browser to avoid being instantly blocked. In this case, we're using Chrome on Windows OS.follow_redirects
keyword to automatically follow all 30X responses as our generated query parameters are missing some optional values.Another key parameter here is offset
which controls the search result paging. Providing offset tells that we want 25 results to start from the X point of the result set. So, let's use it to implement full paging and hotel preview data parsing:
def parse_search_total_results(html: str):
"""parse total number of results from search page HTML"""
sel = Selector(text=html)
# parse total amount of pages from heading1 text:
# e.g. "London: 1,232 properties found"
total_results = int(sel.css("h1").re("([\d,]+) properties found")[0].replace(",", ""))
return total_results
def parse_search_page(html: str):
"""parse hotel preview data from search page HTML"""
sel = Selector(text=html)
hotel_previews = {}
for hotel_box in sel.xpath('//div[@data-testid="property-card"]'):
url = hotel_box.xpath('.//h3/a[@data-testid="title-link"]/@href').get("").split("?")[0]
hotel_previews[url] = {
"name": hotel_box.xpath('.//h3/a[@data-testid="title-link"]/div/text()').get(""),
"location": hotel_box.xpath('.//span[@data-testid="address"]/text()').get(""),
"score": hotel_box.xpath('.//div[@data-testid="review-score"]/div/text()').get(""),
"review_count": hotel_box.xpath('.//div[@data-testid="review-score"]/div[2]/div[2]/text()').get(""),
"stars": len(hotel_box.xpath('.//div[@data-testid="rating-stars"]/span').getall()),
"image": hotel_box.xpath('.//img[@data-testid="image"]/@src').get(),
}
return hotel_previews
async def scrape_search(
query,
session: AsyncClient,
checkin: str = "",
checkout: str = "",
number_of_rooms=1,
):
"""scrape all hotel previews from a given search query"""
first_page = await search_page(
query=query, session=session, checkin=checkin, checkout=checkout, number_of_rooms=number_of_rooms
)
total_results = parse_search_total_results(first_page.text)
other_pages = await asyncio.gather(
*[
search_page(
query=query,
session=session,
checkin=checkin,
checkout=checkout,
number_of_rooms=number_of_rooms,
offset=offset,
)
for offset in range(25, total_results, 25)
]
)
hotel_previews = {}
for response in [first_page, *other_pages]:
hotel_previews.update(parse_search_page(response.text))
return hotel_previews
There's quite a bit of code here so let's unpack it bit by bit:
First, we define our scrape_search()
function which loops through our previously defined search_page()
function to scrape all pages instead of just the first one. We do this by taking advantage of a common web scraping idiom for scraping known size pagination -- we scrape the first page, find the number of results and scrape the rest of the pages concurrently:
Then, we parse preview data from each result page by using XPATH selectors. We do this by iterating through each of the 25 hotel preview boxes present on the page and extracting details such as name, location, score, URL, review count and start value.
Let's run our search scraper:
import json
async def run():
async with AsyncClient(headers=HEADERS) as session:
results = await scrape_search("London", session)
print(json.dumps(results, indent=2))
if __name__ == "__main__":
asyncio.run(run())
{
"https://www.booking.com/hotel/gb/nobu-hotel-london-portman-square.html": {
"name": "Nobu Hotel London Portman Square",
"location": "Westminster Borough, London",
"score": "8.9",
"review_count": "445 reviews",
"stars": 5,
"image": "https://cf.bstatic.com/xdata/images/hotel/square200/339532965.webp?k=ba363634cf1e7c91ac2e64f701bf702d520b133c311ac91e2b3df118d0570aaa&o=&s=1"
},
...
We've successfully scraped booking.com's search page to discover hotels located in London. Furthermore, we got some valuable metadata and URL to the hotel page itself so next, we can scrape detailed hotel data and pricing!
Now that we have a scraper that can scrape booking.com's hotel preview data we can further collect remaining hotel data like description, address, feature list etc. by scraping each individual hotel URL.
We'll continue with using httpx
for connecting to booking.com and parsel
for processing hotel HTML pages:
from collections import defaultdict
from parsel import Selector
from httpx import AsyncClient
def parse_hotel(html: str):
sel = Selector(text=html)
css = lambda selector, sep="": sep.join(sel.css(selector).getall()).strip()
css_first = lambda selector: sel.css(selector).get("")
# get latitude and longitude of the hotel address:
lat, lng = css_first(".show_map_hp_link::attr(data-atlas-latlng)").split(",")
# get hotel features by type
features = defaultdict(list)
for feat_box in sel.css("[data-capla-component*=FacilitiesBlock]>div>div>div"):
type_ = feat_box.xpath('.//span[contains(@data-testid, "facility-group-icon")]/../text()').get()
feats = [f.strip() for f in feat_box.css("li ::text").getall() if f.strip()]
features[type_] = feats
data = {
"title": css("h2#hp_hotel_name::text"),
"description": css("div#property_description_content ::text", "\n"),
"address": css(".hp_address_subtitle::text"),
"lat": lat,
"lng": lng,
"features": dict(features),
"id": re.findall(r"b_hotel_id:\s*'(.+?)'", html)[0],
}
return data
async def scrape_hotels(urls: List[str], session: AsyncClient):
async def scrape_hotel(url: str):
resp = await session.get(url)
hotel = parse_hotel(resp.text)
hotel["url"] = str(resp.url)
return hotel
hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
return hotels
Here, we define our hotel page scraping logic. Our scrape_hotels
function takes a list of hotel urls which we scrape via simple GET requests for the HTML data. We then use our HTML parsing library to extract hotel information using CSS selectors.
We can test our scraper out:
async def run():
async with AsyncClient(headers=HEADERS) as session:
hotels = await scrape_hotels(["https://www.booking.com/hotel/gb/gardencourthotel.html"], session, '')
print(json.dumps(hotels, indent=2))
if __name__ == "__main__":
asyncio.run(run())
[
{
"title": "Garden Court Hotel",
"description": "You're eligible for a Genius discount at Garden Court Hotel! To save at this property, all you have to do is \nsign in\n.\n\n\nThe 19th-century Garden Court Hotel is superbly located in Kensington Gardens Square. It offers stylish, family-run accommodations, a short walk from Bayswater Underground Station.\n\n\nEach comfortable room is individually designed, with an LCD Freeview cable TV. All rooms have their own private internal private bathrooms, except for a few cozy single rooms which have their own exclusive private external bathrooms.\n\n\nFree Wi-Fi internet access is available throughout the hotel, and there is also free luggage storage and a safe for guests to use at the 24-hour reception.\n\n\nThe hotel is located in fashionable Notting Hill, close to Portobello Antiques Markets and the Royal Parks. Kings Cross Station is 4.8 km away.",
"address": "30-31 Kensington Gardens Square, Notting Hill, Westminster Borough, London, W2 4BG, United Kingdom",
"lat": "51.51431706",
"lng": "-0.19066349",
"features": {
"Food & Drink": [ "Breakfast in the room" ],
"Internet": [],
"Parking": [ "Electric vehicle charging station", "Street parking" ],
"Front Desk Services": [ "Invoice provided", "..." ],
"Cleaning Services": [ "Daily housekeeping", ],
"Business Facilities": [ "Fax/Photocopying" ],
"Safety & security": [ "Fire extinguishers", "..." ],
"General": [ "Shared lounge/TV area", "..." ],
"Languages Spoken": [ "Arabic", "..." ]
},
"id": "102764",
"url": "https://www.booking.com/hotel/gb/gardencourthotel.html?cur_currency=usd"
}
]
There's significantly more data available on the page but to keep this tutorial brief we only focused on a few example fields. However, we're missing a very important detail - the price! For that, we need to modify our scraper with an additional request.
Booking.com's hotel pages do not contain pricing in the HTML data, so we'll have to make additional requests to retrieve pricing calendar data. If we scroll down on the hotel page and open up our web inspector we can see how Booking.com populates its pricing calendar:
Here, we can see that a background request is being made to retrieve pricing data for 61 days! Let's add this functionality to our scraper:
async def scrape_hotels(urls: List[str], session: AsyncClient, price_start_dt: str, price_n_days=30):
async def scrape_hotel(url: str):
resp = await session.get(url)
hotel = parse_hotel(resp.text)
hotel["url"] = str(resp.url)
# for background requests we need to find cross-site-reference token
csrf_token = re.findall(r"b_csrf_token:\s*'(.+?)'", resp.text)[0]
hotel['price'] = await scrape_prices(csrf_token=csrf_token, hotel_id=hotel['id'])
return hotel
async def scrape_prices(hotel_id, csrf_token):
data = {
"name": "hotel.availability_calendar",
"result_format": "price_histogram",
"hotel_id": hotel_id,
"search_config": json.dumps({
# we can adjust pricing configuration here but this is the default
"b_adults_total": 2,
"b_nr_rooms_needed": 1,
"b_children_total": 0,
"b_children_ages_total": [],
"b_is_group_search": 0,
"b_pets_total": 0,
"b_rooms": [{"b_adults": 2, "b_room_order": 1}],
}),
"checkin": price_start_dt,
"n_days": price_n_days,
"respect_min_los_restriction": 1,
"los": 1,
}
resp = await session.post(
"https://www.booking.com/fragment.json?cur_currency=usd",
headers={**session.headers, "X-Booking-CSRF": csrf_token},
data=data,
)
return resp.json()["data"]
hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
return hotels
We've extended our scrape_hotels
function with price scrape functionality by replicating the pricing calendar request we saw in our web inspector. If we run this code, our results should contain pricing data similar to this:
async def run():
async with AsyncClient(headers=HEADERS) as session:
hotels = await scrape_hotels(["https://www.booking.com/hotel/gb/gardencourthotel.html"], session, '2022-05-20')
print(json.dumps(hotels, indent=2))
if __name__ == "__main__":
asyncio.run(run())
[
{
"title": "Garden Court Hotel",
"description": "You're eligible for a Genius discount at Garden Court Hotel! To save at this property, all you have to do is \nsign in\n.\n\n\nThe 19th-century Garden Court Hotel is ...",
"address": "30-31 Kensington Gardens Square, Notting Hill, Westminster Borough, London, W2 4BG, United Kingdom",
"lat": "51.51431706",
"lng": "-0.19066349",
"features": {
"Food & Drink": [
"Breakfast in the room"
],
"Internet": [],
"Parking": [ "Electric vehicle charging station", "Street parking" ],
"Front Desk Services": [ "Invoice provided", ""],
"Cleaning Services": [ "Daily housekeeping", "Ironing service" ],
"Business Facilities": [ "Fax/Photocopying" ],
"Safety & security": [ "Fire extinguishers", "..."],
"General": [ "Shared lounge/TV area", "..." ],
"Languages Spoken": [ "Arabic", "English", "Spanish", "French", "Romanian" ]
},
"id": "102764",
"url": "https://www.booking.com/hotel/gb/gardencourthotel.html?cur_currency=usd",
"price": {
"min_los": 1,
"days": [
{
"b_is_weekend": 0,
"b_month": "05",
"b_month_name": "May",
"b_price_pretty": "USD\u00a0276,69",
"checkin": "2022-05-20",
"b_full_year": "2022",
"b_length_of_stay": 1,
"b_price": 276.6988213958,
"b_day": "20",
"b_avg_price_pretty": "276",
"b_short_month_name": "May",
"b_checkout": "2022-05-21",
"b_weekday": 5,
"b_avg_price_raw": "276",
"b_available": 1,
"b_epoch": 1652997600,
"b_weekday_name": "Fr",
"b_min_length_of_stay": 1,
"b_url_hp": "/hotel/gb/gardencourthotel.html?label=gen173nr-1DEghmcmFnbWVudCiCAjjoB0gzWARo3QGIAQGYATG4ARfIAQzYAQPoAQH4AQOIAgGoAgO4AoLPnZQGwAIB0gIkNzllMjNlMDItMjRlNC00M2U0LTk0YzYtY2JlNDlkMjA5NzI52AIE4AIB&sid=3af1cb864972f2e88ef99b900927c6f1&checkin=2022-05-20&checkout=2022-05-21&room1=A%2CA%2C&#maxotel_rooms",
"b_checkin": "2022-05-20"
},
...
We can that this request generates price and availability data for each day of the calendar range we've specified.
To scrape booking.com hotel reviews let's take a look at what happens when we explore the reviews page. Let's click 2nd page and observe what happens in our browsers web inspector (F12 in major browsers):
We can see a request to reviewlist.html
endpoint is made which returns an HTML page of 10 reviews. We can easily replicate this in our scraper:
def parse_reviews(html: str) -> List[dict]:
"""parse review page for review data """
sel = Selector(text=html)
parsed = []
for review_box in sel.css('.review_list_new_item_block'):
get_css = lambda css: review_box.css(css).get("").strip()
parsed.append({
"id": review_box.xpath('@data-review-url').get(),
"score": get_css('.bui-review-score__badge::text'),
"title": get_css('.c-review-block__title::text'),
"date": get_css('.c-review-block__date::text'),
"user_name": get_css('.bui-avatar-block__title::text'),
"user_country": get_css('.bui-avatar-block__subtitle::text'),
"text": ''.join(review_box.css('.c-review__body ::text').getall()),
"lang": review_box.css('.c-review__body::attr(lang)').get(),
})
return parsed
async def scrape_reviews(hotel_id: str, session) -> List[dict]:
"""scrape all reviews of a hotel"""
async def scrape_page(page, page_size=25): # 25 is largest possible page size for this endpoint
url = "https://www.booking.com/reviewlist.html?" + urlencode(
{
"type": "total",
# we can configure language preference
"lang": "en-us",
# we can configure sorting order here, in this case recent reviews are first
"sort": "f_recent_desc",
"cc1": "gb",
"dist": 1,
"pagename": hotel_id,
"rows": page_size,
"offset": page * page_size,
}
)
return await session.get(url)
first_page = await scrape_page(1)
total_pages = Selector(text=first_page.text).css(".bui-pagination__link::attr(data-page-number)").getall()
total_pages = max(int(page) for page in total_pages)
other_pages = await asyncio.gather(*[scrape_page(i) for i in range(2, total_pages + 1)])
results = []
for response in [first_page, *other_pages]:
results.extend(parse_reviews(response.text))
return results
In our scraper code above, we're using what we learned before: we collect the first page to extract a total number of pages and then scrape the rest of the pages concurrently. Another thing to note is that we can adjust the default url parameters a bit to our preference. Above we use a page size of 25 instead of the default 10, meaning we have to perform fewer requests to retrieve all reviews.
Finally - our scraper can discover hotels, extract hotel preview data and then scrape each hotel page for hotel information, pricing data and reviews!
However, to adopt this scraper at scale we need one final thing - web scraper blocking avoidance.
Scraping Booking.com seems to be easy though unfortunately when scraping at scale we'll quickly be blocked or served captchas which will prevent us access the hotel/search data.
To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!
Which offers several powerful features that'll help us to get around Instagram's blocking:
For this, we'll be using scrapfly-sdk python package and ScrapFly's anti scraping protection bypass feature. First, let's install scrapfly-sdk
using pip:
$ pip install scrapfly-sdk
To take advantage of ScrapFly's API in our Walmart web scraper all we need to do is replace httpx
requests with scrapfly-sdk
requests.
Let's take a look at how our full scraper code would look with ScrapFly integration:
import asyncio
from collections import defaultdict
import json
from pathlib import Path
import re
from typing import List, Optional
from urllib.parse import urlencode
from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse
from parsel import Selector
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
async def request_hotels_page(
query,
checkin: str = "",
checkout: str = "",
number_of_rooms=1,
offset: int = 0,
):
"""scrapes a single hotel search page of booking.com"""
checkin_year, checking_month, checking_day = checkin.split("-") if checkin else "", "", ""
checkout_year, checkout_month, checkout_day = checkout.split("-") if checkout else "", "", ""
url = "https://www.booking.com/searchresults.html"
url += "?" + urlencode(
{
"ss": query,
"checkin_year": checkin_year,
"checkin_month": checking_month,
"checkin_monthday": checking_day,
"checkout_year": checkout_year,
"checkout_month": checkout_month,
"checkout_monthday": checkout_day,
"no_rooms": number_of_rooms,
"offset": offset,
}
)
return await scrapfly.async_scrape(ScrapeConfig(url, country="US"))
def parse_search_total_results(html: str):
sel = Selector(text=html)
# parse total amount of pages from heading1 text:
# e.g. "London: 1,232 properties found"
total_results = int(sel.css("h1").re("([\d,]+) properties found")[0].replace(",", ""))
return total_results
def parse_search_hotels(html: str):
sel = Selector(text=html)
hotel_previews = {}
for hotel_box in sel.xpath('//div[@data-testid="property-card"]'):
url = hotel_box.xpath('.//h3/a[@data-testid="title-link"]/@href').get("").split("?")[0]
hotel_previews[url] = {
"name": hotel_box.xpath('.//h3/a[@data-testid="title-link"]/div/text()').get(""),
"location": hotel_box.xpath('.//span[@data-testid="address"]/text()').get(""),
"score": hotel_box.xpath('.//div[@data-testid="review-score"]/div/text()').get(""),
"review_count": hotel_box.xpath('.//div[@data-testid="review-score"]/div[2]/div[2]/text()').get(""),
"stars": len(hotel_box.xpath('.//div[@data-testid="rating-stars"]/span').getall()),
"image": hotel_box.xpath('.//img[@data-testid="image"]/@src').get(),
}
return hotel_previews
async def scrape_search(
query,
checkin: str = "",
checkout: str = "",
number_of_rooms=1,
max_results: Optional[int] = None,
):
first_page = await request_hotels_page(
query=query, checkin=checkin, checkout=checkout, number_of_rooms=number_of_rooms
)
hotel_previews = parse_search_hotels(first_page.content)
total_results = parse_search_total_results(first_page.content)
if max_results and total_results > max_results:
total_results = max_results
other_pages = await asyncio.gather(
*[
request_hotels_page(
query=query,
checkin=checkin,
checkout=checkout,
number_of_rooms=number_of_rooms,
offset=offset,
)
for offset in range(25, total_results, 25)
]
)
for result in other_pages:
hotel_previews.update(parse_search_hotels(result.content))
return hotel_previews
def parse_hotel(html: str):
sel = Selector(text=html)
css = lambda selector, sep="": sep.join(sel.css(selector).getall()).strip()
css_first = lambda selector: sel.css(selector).get("")
lat, lng = css_first(".show_map_hp_link::attr(data-atlas-latlng)").split(",")
features = defaultdict(list)
for feat_box in sel.css("[data-capla-component*=FacilitiesBlock]>div>div>div"):
type_ = feat_box.xpath('.//span[contains(@data-testid, "facility-group-icon")]/../text()').get()
feats = [f.strip() for f in feat_box.css("li ::text").getall() if f.strip()]
features[type_] = feats
data = {
"title": css("h2#hp_hotel_name::text"),
"description": css("div#property_description_content ::text", "\n"),
"address": css(".hp_address_subtitle::text"),
"lat": lat,
"lng": lng,
"features": dict(features),
"id": re.findall(r"b_hotel_id:\s*'(.+?)'", html)[0],
}
return data
async def scrape_hotels(urls: List[str], price_start_dt: str, price_n_days=30):
async def scrape_hotel(url: str):
result = await scrapfly.async_scrape(ScrapeConfig(
url,
session=url.split("/")[-1].split(".")[0],
country="US",
))
hotel = parse_hotel(result.content)
hotel["url"] = result.context['url']
csrf_token = re.findall(r"b_csrf_token:\s*'(.+?)'", result.content)[0]
hotel["price"] = await scrape_prices(csrf_token=csrf_token, hotel_id=hotel["id"], hotel_url=url)
return hotel
async def scrape_prices(hotel_id, csrf_token, hotel_url):
data = {
"name": "hotel.availability_calendar",
"result_format": "price_histogram",
"hotel_id": hotel_id,
"search_config": json.dumps(
{
# we can adjust pricing configuration here but this is the default
"b_adults_total": 2,
"b_nr_rooms_needed": 1,
"b_children_total": 0,
"b_children_ages_total": [],
"b_is_group_search": 0,
"b_pets_total": 0,
"b_rooms": [{"b_adults": 2, "b_room_order": 1}],
}
),
"checkin": price_start_dt,
"n_days": price_n_days,
"respect_min_los_restriction": 1,
"los": 1,
}
result = await scrapfly.async_scrape(
ScrapeConfig(
url="https://www.booking.com/fragment.json?cur_currency=usd",
method="POST",
data=data,
headers={"X-Booking-CSRF": csrf_token},
session=hotel_url.split("/")[-1].split(".")[0],
country="US",
)
)
return json.loads(result.content)["data"]
hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
return hotels
async def run():
out = Path(__file__).parent / "results"
out.mkdir(exist_ok=True)
result_hotels = await scrape_hotels(
["https://www.booking.com/hotel/gb/gardencourthotel.html"],
price_start_dt="2023-04-20",
price_n_days=7,
)
out.joinpath("hotels.json").write_text(json.dumps(result_hotels, indent=2, ensure_ascii=False))
result_search = await scrape_search("London", checkin="2023-04-20", checkout="2023-04-27", max_results=100)
out.joinpath("search.json").write_text(json.dumps(result_search, indent=2, ensure_ascii=False))
if __name__ == "__main__":
asyncio.run(run())
In the code above to enable ScrapFly all we had to do is replace the httpx
session object with ScrapFlyClient
! Now, we can scrape the whole of booking.com without being worried about blocking or captchas.
To wrap this guide up let's take a look at some frequently asked questions about web scraping Booking.com:
Yes. Booking hotel data is publicly available; we're not extracting anything personal or private. Scraping booking.com at slow, respectful rates would fall under the ethical scraping definition. See our Is Web Scraping Legal? article for more.
Booking.com automatically chooses currency based on web scrapers' IP address' geographical location. The easiest approach is to use a proxy of a specific location; for example, in ScrapFly we can use country=US
argument in our request to receive USD prices.
Alternatively, we can manually change the currency for our scraping session via GET request with the selected_currency
parameter.
import httpx
with httpx.Client() as session:
currency = 'USD'
url = f"https://www.booking.com/?change_currency=1;selected_currency={currency};top_currency=1"
response = session.get(url)
This request will return currency cookies which we will be preserved in our session - making any further requests respond with the same currency. Note that this has to be done every time we start a new http session.
Like many result paging systems, Booking.com's search returns a limited amount of results. In this case, 1000 results per query might not be enough to cover some broader queries fully.
The best approach here, is to split the query into several smaller queries. For example, instead of searching "London", we can split the search by scraping each of London's neighborhoods:
In this web scraping tutorial, we built a small Booking.com scraper that uses search to discover hotel listing previews and then scrapes hotel data and pricing information.
For this we've used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!