How to Scrape YouTube in 2025
Learn how to scrape YouTube, channel, video, and comment data using Python directly in JSON.
In this article, we'll explain how to scrape BestBuy, one of the most popular retail stores for electronic stores in the United States. We'll scrape different data types from product, search, review, and sitemap pages. Additionally, we'll employ a wide range of web scraping tricks, such as hidden JSON data, hidden APIs, HTML, and XML parsing. So, this guide serves as a comprehensive web scraping introduction!
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
The amount of data that web scraping BestBuy can allow is numerous. It can empower both businesses and retail buyers in different ways:
For further details, refer to our introduction on web scraping use cases.
To web scrape BestBuy, we'll use Python with a few community libraries:
Since asyncio
comes pre-installed in Python, we'll only have to install the other packages using the following pip
command:
pip install httpx parsel jmespath loguru
Scraping sitemaps is an efficient way to discover thousands of organized URLs. They are provided for search engine crawlers to index the pages, which we can use to discover web scraping targets on a website.
BestBuy's sitemaps can be found at bestbuy.com/robots.txt. It's a text file that provides crawling instructions along with the website's sitemap directory:
Sitemap: https://sitemaps.bestbuy.com/sitemaps_discover_learn.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_pdp.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_promos.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_qna.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_rnr.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_search_plps.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_standalone_qa.xml
Sitemap: https://www.bestbuy.com/sitemap.xml
Each of the above sitemaps represents a group of related page URLs found under an XML file that's compressed to a gzip
file to reduce its size:
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>https://sitemaps.bestbuy.com/sitemaps_pdp.0000.xml.gz</loc><lastmod>2024-03-08T10:16:14.901109+00:00</lastmod></sitemap>
<sitemap><loc>https://sitemaps.bestbuy.com/sitemaps_pdp.0001.xml.gz</loc><lastmod>2024-03-08T10:16:14.901109+00:00</lastmod></sitemap>
</sitemapindex>
The above gz
file looks like the following after extracting:
<?xml version='1.0' encoding='utf-8'?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>https://www.bestbuy.com/site/aventon-aventure-step-over-ebike-w-45-mile-max-operating-range-and-28-mph-max-speed-medium-fire-black/6487149.p?skuId=6487149</loc></url>
<url><loc>https://www.bestbuy.com/site/detective-story-1951/34804554.p?skuId=34804554</loc></url>
<url><loc>https://www.bestbuy.com/site/flowers-lp-vinyl/35944053.p?skuId=35944053</loc></url>
<url><loc>https://www.bestbuy.com/site/apple-iphone-15-pro-max-1tb-natural-titanium-verizon/6525500.p?skuId=6525500</loc></url>
<url><loc>https://www.bestbuy.com/site/geeni-dual-outlet-outdoor-wi-fi-smart-plug-gray/6388590.p?skuId=6388590</loc></url>
<url><loc>https://www.bestbuy.com/site/dynasty-the-sixth-season-vol-1-4-discs-dvd/20139655.p?skuId=20139655</loc></url>
To scrape BestBuy's sitemaps, we'll request the compressed XML file, decode it, and parse it for the URLs. For this example, we'll use the promotions sitemap:
import asyncio
import json
import gzip
from typing import List
from httpx import AsyncClient, Response
from parsel import Selector
from loguru import logger as log
# initialize an async httpx client
client = AsyncClient(
# enable http2
http2=True,
# add basic browser like headers to prevent getting blocked
headers={
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
},
)
def parse_sitemaps(response: Response) -> List[str]:
"""parse links for bestbuy sitemaps"""
# decode the .gz file
xml = str(gzip.decompress(response.content), 'utf-8')
selector = Selector(xml)
data = []
for url in selector.xpath("//url/loc/text()"):
data.append(url.get())
return data
async def scrape_sitemaps(url: str) -> List[str]:
"""scrape link data from bestbuy sitemaps"""
response = await client.get(url)
promo_urls = parse_sitemaps(response)
log.success(f"scraped {len(promo_urls)} urls from sitemaps")
return promo_urls
import asyncio
import json
import gzip
from typing import List
from parsel import Selector
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def parse_sitemaps(response: ScrapeApiResponse) -> List[str]:
"""parse links for bestbuy sitemaps"""
# decode the .gz file
bytes_data = response.scrape_result['content'].getvalue()
xml = str(gzip.decompress(bytes_data), 'utf-8')
selector = Selector(xml)
data = []
for url in selector.xpath("//url/loc/text()"):
data.append(url.get())
return data
async def scrape_sitemaps(url: str) -> List[str]:
"""scrape link data from bestbuy sitemaps"""
response = await SCRAPFLY.async_scrape(ScrapeConfig(url, country="US",))
promo_urls = parse_sitemaps(response)
log.success(f"scraped {len(promo_urls)} urls from sitemaps")
return promo_urls
async def run():
promo_urls = await scrape_sitemaps(
url="https://sitemaps.bestbuy.com/sitemaps_promos.0000.xml.gz"
)
# save the data to a JSON file
with open("promos.json", "w", encoding="utf-8") as file:
json.dump(promo_urls, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
In the above code, we define an httpx
with common browser headers to minimize the chances of getting blocked. Additionally, we define two functions, let's break them down:
scrape_sitemaps
: To request the sitemap URL using the defined httpx
client.parse_sitemaps
: To decode the gz
file into its XML content and then parse the XML for the URLs using the XPath selector.Here is a sample output of the results we got:
[
"https://www.bestbuy.com/site/promo/4k-capable-memory-cards",
"https://www.bestbuy.com/site/promo/all-total-by-verizon",
"https://www.bestbuy.com/site/promo/shop-featured-intel-evo",
"https://www.bestbuy.com/site/promo/laser-heat-therapy",
"https://www.bestbuy.com/site/promo/save-on-select-grills",
....
]
For further details on scraping and discovering sitemaps, refer to our dedicated guide:
Introduction to scraping and discovering sitemaps. You will learn how to find, navigate, and use Python and JavaScript tools for XML parsing.
Let's start with the first part of our BestBuy scraper code: search pages. Search for any product on the website, like the "macbook" keyword, and you will get a page that looks the following:
To scrape BestBuy search pages, we'll request the search page URL and then parse the HTML. First, let's start with the parsing logic:
def parse_search(response: ScrapeApiResponse):
"""parse search data from search pages"""
selector = response.selector
data = []
for item in selector.css("#main-results li"):
name = item.css(".product-title::attr(title)").get()
link = item.css("a.product-list-item-link::attr(href)").get()
price = selector.css('div.customer-price::text').re('\d+\.\d{2}')[0]
original_price = (selector.css('div.regular-price::text').re('\d+\.\d{2}') or [None]) [0]
sku = item.xpath("@data-testid").get()
_rating_data = item.css(".c-ratings-reviews p::text")
rating = (_rating_data.re(r"\d+\.*\d*") or [None])[0]
rating_count = int((_rating_data.re('(\d+) reviews') or [0])[0])
images = item.css("img[data-testid='product-image']::attr(srcset)").getall()
data.append({
"name": name,
"link": "https://www.bestbuy.com" + link if link else None,
"images": images,
"sku": sku,
"price": price,
"original_price": original_price,
"rating": rating,
"rating_count": rating_count,
})
if len(data):
_total_count = selector.css("div.results-title span:nth-of-type(2)::text").re('\d+')[0]
total_pages = int(_total_count) // len(data)
else:
total_pages = 1
return {"data": data, "total_pages": total_pages}
Here, we define a parse_search
function, which does the following:
Next, we'll utilize the above parsing logic while sending requests to scrape and crawl the search pages:
import asyncio
import json
from typing import Union
from loguru import logger as log
from urllib.parse import urlencode, quote_plus
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your Scrapfly API key")
BASE_CONFIG = {
# bypass bestbuy.com web scraping blocking
"asp": True,
# set the proxy country to US
"country": "US",
"headers": {
"cookie": "intl_splash=false"
}
}
def parse_search(response: ScrapeApiResponse):
"""parse search data from search pages"""
# the same function logic
async def scrape_search(search_query: str, sort: Union["-bestsellingsort", "-Best-Discount"] = None, max_pages=None):
"""scrape search data from bestbuy search"""
def form_search_url(page_number: int):
"""form the search url"""
base_url = "https://www.bestbuy.com/site/searchpage.jsp?"
# search parameters
params = {"st": quote_plus(search_query)}
if page_number > 1:
params["cp"] = page_number
if sort:
params["sp"] = sort
return base_url + urlencode(params)
first_page = await SCRAPFLY.async_scrape(
ScrapeConfig(
form_search_url(1), render_js=True, rendering_wait=5000,auto_scroll=True,
wait_for_selector="#main-results li", **BASE_CONFIG
)
)
data = parse_search(first_page)
search_data = data["data"]
total_pages = data["total_pages"]
# get the number of total search pages to scrape
if max_pages and max_pages < total_pages:
total_pages = max_pages
log.info(f"scraping search pagination, {total_pages - 1} more pages")
# add the remaining pages to a scraping list to scrape them concurrently
to_scrape = [
ScrapeConfig(form_search_url(page_number), **BASE_CONFIG)
for page_number in range(2, total_pages + 1)
]
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
data = parse_search(response)["data"]
search_data.extend(data)
log.success(f"scraped {len(search_data)} products from search pages")
return search_data
async def run():
search_data = await scrape_search(
search_query="macbook",
max_pages=3
)
# save the results to a JSOn file
with open("search.json", "w", encoding="utf-8") as file:
json.dump(search_data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
Let's break down the execution flow of the above scrape_search
function:
parse_search
function.max_pages
parameter.The above BestBuy scraping code will extract product data from three search pages. Here is what the results should look like:
[
{
"name": "MacBook Pro 13.3\" Laptop - Apple M2 chip - 24GB Memory - 1TB SSD (Latest Model) - Silver",
"link": "https://www.bestbuy.com/site/macbook-pro-13-3-laptop-apple-m2-chip-24gb-memory-1tb-ssd-latest-model-silver/6382795.p?skuId=6382795",
"image": "https://pisces.bbystatic.com/image2/BestBuy_US/images/products/6382/6382795_sd.jpg;maxHeight=200;maxWidth=300",
"sku": "6382795",
"model": "MNEX3LL/A",
"price": 1499,
"original_price": 2099,
"save": "28.59%",
"rating": 4.8,
"rating_count": 4,
"is_sold_out": false
},
....
]
The above code can scrape the product data that is visible on the search pages. However, it can be extended with crawling logic to scrape the full details of each product from its respective URL. For further details on crawling while scraping, refer to our dedicated guide.
Take a deep dive into building web crawlers with Python. We'll start by defining the common crawling concepts and challenges. Then, we'll go through a practical example of creating a web crawler for a target website.
Let's add support for scraping product pages to our BestBuy scraper. But before we start, let's have a look at what product pages look like. Go to any product page on the website, like this one, and you will get a page similar to this:
Data on product pages is comprehensive, and it's scattered across the page. Therefore, it's challenging to scrape it using selectors. Instead, we'll scrape them as JSON datasets from script tags. To locate these script tags, follow the below steps:
F12
key.//script[contains(text(),'productBySkuId')]/text()
.After following the above steps, you will find several script tags that include JSON data. Each script tag contains certain type of data about the product, such as pricing, shipping, review, etc. For example, here is what the product specification data looks like:
The above JSON data are the same on the page but before getting rendered into the HTML, which is often known as hidden web data.
Learn what hidden data is through some common examples. You will also learn how to scrape it using regular expressions and other clever parsing algorithms.
To scrape BestBuy product data, we will select the script tags containing the JSON data and parse them:
import os
import json
import asyncio
from typing import Dict, List
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient("Your Scrapfly API key")
BASE_CONFIG = {
# bypass bestbuy.com web scraping blocking
"asp": True,
# set the proxy country to US
"country": "US",
"headers": {
"cookie": "intl_splash=false"
}
}
def extract_json(script: str) -> Dict:
"""extract JSON data from a script tag content"""
start_index = script.find('.push(')
brace_start = script.find('{', start_index)
# find the JSON block
brace_count = 0
for i in range(brace_start, len(script)):
if script[i] == '{':
brace_count += 1
elif script[i] == '}':
brace_count -= 1
if brace_count == 0:
brace_end = i + 1
break
raw_json = script[brace_start:brace_end]
cleaned_json = raw_json.replace("undefined", "null")
parsed_data = json.loads(cleaned_json)
return parsed_data
def _extract_nested(data, keys, default=None):
for key in keys:
data = data.get(key, {})
return data or default
def parse_product(response: ScrapeApiResponse) -> Dict:
"""parse product data from bestbuy product pages"""
selector = response.selector
data = {}
product_info = extract_json(
selector.xpath("//script[contains(text(),'productBySkuId')]/text()").get()
)
product_features = extract_json(
selector.xpath("//script[contains(text(),'R1eapefmjttrkq')]/text()").get()
)
buying_options = extract_json(
selector.xpath("//script[contains(text(), 'R3vmipefmjttrkqH1')]/text()").get()
)
product_faq = extract_json(
selector.xpath("//script[contains(text(), 'ProductQuestionConnection')]/text()").get()
)
data["product-info"] = _extract_nested(product_info, ["rehydrate", ":Rp9efmjttrkq:", "data", "productBySkuId"])
data["product-features"] = _extract_nested(product_features, ["rehydrate", ":R1eapefmjttrkq:", "data", "productBySkuId", "features"])
data["buying-options"] = _extract_nested(buying_options, ["rehydrate", ":R3vmipefmjttrkqH1:", "data", "productBySkuId", "buyingOptions"])
data["product-faq"] = _extract_nested(product_faq, ["rehydrate", ":R1fapefmjttrkq:", "data", "productBySkuId", "questions"])
return data
async def scrape_products(urls: List[str]) -> List[Dict]:
"""scrapy product data from bestbuy product pages"""
to_scrape = [ScrapeConfig(url, **BASE_CONFIG, render_js=True) for url in urls]
data = []
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
try:
product_data = parse_product(response)
data.append(product_data)
except:
pass
log.debug("expired product page")
log.success(f"scraped {len(data)} products from product pages")
return data
async def run():
product_data = await scrape_products(
urls=[
"https://www.bestbuy.com/site/apple-macbook-air-13-inch-apple-m4-chip-built-for-apple-intelligence-16gb-memory-256gb-ssd-midnight/6565862.p",
"https://www.bestbuy.com/site/apple-geek-squad-certified-refurbished-macbook-pro-16-display-intel-core-i7-16gb-memory-amd-radeon-pro-5300m-512gb-ssd-space-gray/6489615.p",
"https://www.bestbuy.com/site/apple-macbook-pro-14-inch-apple-m4-chip-built-for-apple-intelligence-16gb-memory-512gb-ssd-space-black/6602741.p",
"https://www.bestbuy.com/site/apple-macbook-pro-14-laptop-m3-pro-chip-18gb-memory-14-core-gpu-512gb-ssd-latest-model-space-black/6534615.p"
])
# save the data to a JSON file
with open("products.json", "w", encoding="utf-8") as file:
json.dump(product_data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
Let's break down the functions we use in the above BestBuy scraping code:
extract_json
: To extract JSON datasets from a given script
tags.parse_product
: To parse the script
tags containing the product data from the page HTML.scrape_products
: To add the product page URLs into a scraping list while scraping them concurrently.The output is a comprehensive JSON dataset that looks like the following:
[
{
"product-info": {
"__typename": "Product",
"brand": "Apple",
"skuId": "6534615",
"name": {
"__typename": "ProductName",
"short": "Apple - MacBook Pro 14\" Laptop - M3 Pro chip Built for Apple Intelligence - 18GB Memory - 14-core GPU - 512GB SSD - Space Black"
},
"manufacturer": {
"__typename": "Manufacturer",
"modelNumber": "MRX33LL/A"
},
"hierarchy": {
"__typename": "ProductHierarchy",
"bbypres": [
{
"__typename": "ProductHierarchyLink",
"id": "pcmcat247400050001",
"primary": true,
"href": "http://data.bestbuy.com/v2/hierarchy/bbypres/id/pcmcat247400050001",
"categoryDetail": {
"__typename": "CategoryDetail",
"hierarchyId": "bbypres",
"name": "MacBooks",
"seoUrl": "https://www.bestbuy.com/site/all-laptops/macbooks/pcmcat247400050001.c?id=pcmcat247400050001",
"startDate": "2011-07-13T05:00Z",
"template": null,
"broaderTerms": {
"__typename": "HierarchyBroaderTerms",
"primaryLineage": [
{
"__typename": "HierarchyLineage",
"id": "pcmcat138500050001",
"name": "All Laptops",
"seoUrl": "https://www.bestbuy.com/site/laptop-computers/all-laptops/pcmcat138500050001.c?id=pcmcat138500050001",
"sequence": 0,
"startDate": "2007-12-09T06:00Z"
},
....
]
}
}
},
]
},
"esrbRating": null,
"releaseDateDisplayValue": null,
"dotComStreetDate": "2023-11-07T06:00Z",
"inStoreServiceType": null,
"badges": [],
"openBoxCondition": null,
"whatItIs": [
"Laptop Computer",
"MacBook"
],
"specificationGroups": [
{
"__typename": "ProductSpecificationGroup",
"name": "Key Specs",
"specifications": [
{
"__typename": "ProductSpecification",
"definition": null,
"displayName": "Screen Type",
"value": "Retina Display"
},
....
]
}
],
"highlights": {
"__typename": "Highlights",
"entries": [
{
"__typename": "Highlight",
"name": "Processor Model",
"classification": "High",
"description": "The CPU, or central processing unit, is essentially the brain of your computer. The faster your CPU, the faster your computer will run.",
"link": "Why is the processor important?",
"key": "d2c3dcc5-ac5e-411d-9bf1-5344c4ec9cf6",
"classifications": [
{
"__typename": "Classification",
"bullets": [
"Great portability",
"Budget friendly",
"Basic internet tasks"
],
"description": "Works well for very basic Internet tasks, such as casual browsing. Commonly found in the most portable laptops, which tend to have smaller screens and less storage.",
"icon": "https://pisces.bbystatic.com/image2/vector/BestBuy_US/dam/icon-highlight-cpu-budget-dd17b005-a44b-49e9-87d0-661ac5cefa5a.svg",
"key": "14fab14b-e67b-42fd-aefb-e08def5464ee",
"name": "Budget",
"sampleValue": null
}
],
"value": "Apple M3 Pro"
}
],
"typeInfoDefinition": "Laptop_Computers",
"highlightsCollectionId": "0bbc3112-2558-4a49-bc86-71b6de7b47af",
"skuId": "6534615"
},
"operationalAttributes": [
{
"__typename": "ProductOperationalAttribute",
"displayName": "Box_Contents",
"values": [
"14-inch MacBook Pro",
"70W USB-C Power Adapter",
"USB-C to MagSafe 3 Cable (2 m)"
]
},
....
],
"productSelectorId": null
},
"product-features": [
{
"__typename": "ProductFeature",
"description": "SUPERCHARGED BY M3 PRO OR M3 MAX—The Apple M3 Pro chip, with an up to 12-core CPU and up to 18-core GPU using hardware-accelerated ray tracing, delivers amazing.",
"sequence": 0,
"title": null
},
....
],
"buying-options": [
{
"__typename": "InboundBuyingOption",
"type": "New",
"product": {
"__typename": "Product",
"brand": "Apple",
"skuId": "6534615",
"url": {
"__typename": "ProductUrl",
"pdp": "https://www.bestbuy.com/site/apple-macbook-pro-14-laptop-m3-pro-chip-built-for-apple-intelligence-18gb-memory-14-core-gpu-512gb-ssd-space-black/6534615.p?skuId=6534615",
"relativePdp": "/site/apple-macbook-pro-14-laptop-m3-pro-chip-built-for-apple-intelligence-18gb-memory-14-core-gpu-512gb-ssd-space-black/6534615.p?skuId=6534615"
},
"price": {
"__typename": "ItemPrice",
"customerPrice": 1599,
"skuId": "6534615"
},
"fulfillmentOptions": {
"__typename": "FulfillmentOptionsList",
"shippingDetails": [
{
"__typename": "FulfillmentShippingDetail",
"shippingAvailability": [
{
"__typename": "FulfillmentShippingAvailability",
"shippingEligible": false,
"customerLOSGroup": null
}
]
}
],
"ispuDetails": [
{
"__typename": "InStorePickupDetail",
"ispuAvailability": [
{
"__typename": "InStorePickupAvailability",
"pickupEligible": true,
"instoreInventoryAvailable": false,
"quantity": null,
"minPickupInHours": null,
"maxDate": null
}
]
}
],
"buttonStates": [
{
"__typename": "ButtonState",
"buttonState": "SOLD_OUT"
}
]
},
"openBoxCondition": null
},
"description": "New",
"code": null,
"skuId": "6534615"
}
],
"product-faq": {
"__typename": "ProductQuestionConnection",
"results": [
{
"__typename": "ProductQuestion",
"answerCount": 4,
"bazaarvoiceId": "10325575",
"id": "e5dae9a9-45d4-3da6-9c4d-55827db44478",
"isAiGenerated": false,
"negativeFeedbackCount": 0,
"positiveFeedbackCount": 0,
"submissionTime": "2023-11-21T07:42:05.000-06:00",
"text": null,
"title": "does it include apple guarantee?",
"userNickname": "sofia",
"answers": [
{
"__typename": "ProductQuestionAnswer",
"brandImageUrl": null,
"id": "bf7cdf34-e646-3e28-b86e-5cee0139e98d",
"negativeFeedbackCount": 0,
"positiveFeedbackCount": 10,
"submissionTime": "2023-11-21T22:35:34.000-06:00",
"text": "All Apple products come with a 60-day AppleCare warranty. If you have the TotalTech package membership, they threw in 3 years of AppleCare+ for free. If you want to get a TV mounted or have some other use of BestBuy's package, its very much worth it. I got AppleCare+ for free with a recent MacBook Air because I had it from a TV purchase/install. So, you should ask about that to see if it works for your situation.\n\nSide note: I got this MBA before the new MBP M3s were out. I love my fan-less MBA, but i'd of probably paid extra for the MBP M3, if it was an option then.",
"userNickname": "JustinL",
"badges": [
{
"__typename": "ProductQuestionBadge",
"code": "rewardZoneNumberV3",
"description": "My Best Buy members receive promotional considerations or entries into drawings for writing reviews.",
"name": "My Best Buy\\u00ae Member"
},
.....
],
"images": []
}
....
],
"images": []
}
],
"pageInfo": {
"__typename": "ProductQuestionPageInfo",
"page": 1,
"pageSize": 8,
"totalResults": 50
},
"totalResults": 50
}
},
....
]
🙋 Note that the HTML structure of the BestBuy product pages differs based on product type and category. Therefore, the above product parsing logic should be adjusted for other product types.
Cool! The above BestBuy scraping code can extract the full details of each product. However, it lacks the product reviews - let's scrape them in the next section!
Reviews on BestBuy can be found on each product page:
The above review data are split into two categories:
Product ratings
Review and rating data into each product's specification, which we scraped earlier from the product page itself.
User reviews
Detailed user reviews of the product, which we'll scrape in this section.
To scrape BestBuy reviews, we'll utilize the hidden reviews API. To locate this API, follow the below steps:
F12
key.network
tab and filter by Fetch/XHR
requests.After following the above steps, you will find the reviews API recorded on the browser:
The API above is called in the background using the browser and then rendered into HTML. The request can be copied as a cURL and imported into HTTP clients like Postman.
Learn how to find hidden APIs, how to scrape them, and what are some common challenges faced when developing web scrapers for hidden APIs.
To scrape the product reviews, we'll request the above API and paginate it:
import asyncio
import json
from typing import Dict, List
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def parse_reviews(response: ScrapeApiResponse) -> List[Dict]:
"""parse review data from the review API responses"""
data = json.loads(response.scrape_result['content'])
total_count = data["totalPages"]
review_data = data["topics"]
return {"data": review_data, "total_count": total_count}
async def scrape_reviews(skuid: int, max_pages: int=None) -> List[Dict]:
"""scrape review data from the reviews API"""
first_page = await SCRAPFLY.async_scrape(ScrapeConfig(
f"https://www.bestbuy.com/ugc/v2/reviews?page=1&pageSize=20&sku={skuid}&sort=MOST_RECENT",
asp=True, country="US"
))
data = parse_reviews(first_page)
review_data = data["data"]
total_count = data["total_count"]
# get the number of total review pages to scrape
if max_pages and max_pages < total_count:
total_count = max_pages
log.info(f"scraping reviews pagination, {total_count - 1} more pages")
# add the remaining pages to a scraping list to scrape them concurrently
to_scrape = [
ScrapeConfig(
f"https://www.bestbuy.com/ugc/v2/reviews?page={page_number}&pageSize=20&sku={skuid}&sort=MOST_RECENT",
asp=True, country="US"
)
for page_number in range(2, total_count + 1)
]
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
data = parse_reviews(response)["data"]
review_data.extend(data)
log.success(f"scraped {len(review_data)} reviews from the reviews API")
return review_data
async def run():
review_data = await scrape_reviews(
skuid="6565065",
max_pages=3
)
with open("reviews.json", "w", encoding="utf-8") as file:
json.dump(review_data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
The above part of our BestBuy scraper is fairly straightforward. We only use two functions:
scrape_reviews
: For requesting the reviews API, which accepts product skuID, sorting option, and page number. It starts by requesting the first page and then adding the remaining API URLs to a scraping list to request them concurrently.parse_reviews
: For parsing the JSON response of the reviews API. The response contains various review data types, but the function only parses the user reviews.Here is a sample output of the above BestBuy scraping code:
[
{
"id": "6b88383f-3830-3c78-915c-d3cf9f16596d",
"topicType": "review",
"rating": 5,
"recommended": true,
"title": "Amazing!",
"text": "An absolutly amazing console very fast and smooth.",
"author": "CocaNoot",
"positiveFeedbackCount": 0,
"negativeFeedbackCount": 0,
"commentCount": 0,
"writeCommentUrl": "/site/reviews/submission/6565065/review/337294210?campaignid=RR_&return=",
"submissionTime": "2024-03-02T10:52:07.000-06:00",
"brandResponses": [],
"badges": [
{
"badgeCode": "Incentivized",
"badgeDescription": "This reviewer received promo considerations or sweepstakes entry for writing a review.",
"badgeName": "Incentivized",
"badgeType": "Custom",
"fileName": null,
"iconText": null,
"iconPath": null,
"index": 90900
},
{
"badgeCode": "VerifiedPurchaser",
"badgeDescription": "We’ve verified that this content was written by people who purchased this item at Best Buy.",
"badgeName": "Verified Purchaser",
"badgeType": "Custom",
"fileName": "badgeContextual-verifiedPurchaser.jpg",
"imageURL": "https://bestbuy.ugc.bazaarvoice.com/static/3545w/badgeContextual-verifiedPurchaser.jpg",
"iconText": "Verified Purchase",
"iconPath": "/ugc-raas/ugc-common-assets/ugc-badge-verified-check.svg",
"index": 100000,
"iconUrl": "https://www.bestbuy.com/~assets/bby/_com/ugc-raas/ugc-common-assets/ugc-badge-verified-check.svg"
},
{
"badgeCode": "rewardZoneNumberV3",
"badgeDescription": "My Best Buy members receive promotional considerations or entries into drawings for writing reviews.",
"badgeName": "My Best Buy\\u00ae Member",
"badgeType": "Custom",
"fileName": "badgeRewardZoneStd.gif",
"imageURL": "https://bestbuy.ugc.bazaarvoice.com/static/3545w/badgeRewardZoneStd.gif",
"iconText": "",
"iconPath": "/ugc-raas/ugc-common-assets/badge-my-bestbuy-core.svg",
"index": 100500,
"iconUrl": "https://www.bestbuy.com/~assets/bby/_com/ugc-raas/ugc-common-assets/badge-my-bestbuy-core.svg"
}
],
"photos": [
{
"photoId": "008b1a1e-ba1b-38ea-b86e-effb7c0ca162",
"caption": null,
"normalUrl": "https://photos-us.bazaarvoice.com/photo/2/cGhvdG86YmVzdGJ1eQ/e79a5ff1-e891-57fa-ae03-e9f52bb4d7c4",
"piscesUrl": "https://pisces.bbystatic.com/image2/BestBuy_US/ugc/photos/thumbnail/8db68b60f7a60bcea8f6cd1470938da9.jpg",
"thumbnailUrl": "https://photos-us.bazaarvoice.com/photo/2/cGhvdG86YmVzdGJ1eQ/bd287ee8-1c8b-52ae-9c12-4a379d7ecb24",
"reviewId": "6b88383f-3830-3c78-915c-d3cf9f16596d"
}
],
"qualityRating": null,
"valueRating": null,
"easeOfUseRating": null,
"daysOfOwnership": 70,
"pros": null,
"cons": null,
"secondaryRatings": [
{
"attribute": "Performance",
"value": 5,
"attributeLabel": "Performance",
"valueLabel": "Excellent"
},
{
"attribute": "StorageCapacity",
"value": 5,
"attributeLabel": "Storage Capacity",
"valueLabel": "Excellent"
},
{
"attribute": "Controller",
"value": 5,
"attributeLabel": "Controller",
"valueLabel": "Excellent"
}
]
},
....
]
With this last feature, our BestBuy scraper is complete. It can scrape sitemaps, search, product, and review data.
We have successfully scraped BestBuy data from various pages. However, attempting to scale our scraping rate will lead the website to block the IP address.
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
Here is how we can scrape without getting blocked with ScrapFly. All we have to do is replace the HTTP client with the ScrapFly client, enable the asp
parameter, and select a proxy country:
# standard web scraping code
import httpx
from parsel import Selector
response = httpx.get("some bestbuy.com URL")
selector = Selector(response.text)
# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient
# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
response = scrapfly.scrape(ScrapeConfig(
url="website URL",
asp=True, # enable the anti scraping protection to bypass blocking
country="US", # set the proxy location to a specfic country
render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))
# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']
To wrap up this guide on web scraping BestBuy, let's have a look at some frequently asked questions.
Yes, BestBuy offers APIs for developers. We have scraped review data from hidden BestBuy APIs. The same approach can be utilized to scrape other data sources on the website.
Yes, other popular e-commerce platforms include Amazon and Walmart. We have covered scraping Amazon and Walmart in previous tutorials. For more guides on similar scraping targets, refer to our #scrapeguide blog tag.
In this guide, we have explained how to scrape BestBuy. We went through a step-by-step guide on scraping BestBuy with Python for different pages on the website, which are: