In this tutorial, we'll look at how to scrape from Amazon - the biggest e-commerce website in the world!
Amazon contains millions of products and operates in many different countries making it a great target for public market analytics data.
To scrape Amazon product data, prices and reviews, we'll be using Python with a few community packages and common Amazon web scraping idioms. So, let's dive in!
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Why Scrape Amazon.com?
Amazon contains loads of valuable e-commerce data: product details, prices and reviews. It's a leading e-commerce platform in many regions around the world. This makes Amazon's public data ideal for market analytics, business intelligence and many niches of data science.
Amazon is often used by companies to track the performance of their products sold by 3rd party resellers. So, needless to say, there are almost countless ways to make use of this public data! For more on scraping use cases see our extensive web scraping use case article
Project Setup
In this tutorial, we'll be using Python and two major community packages:
httpx - HTTP client library which will let us communicate with amazon.com's servers
parsel - HTML parsing library which will help us to parse our web scraped HTML files. In this tutorial, we'll be using a mixture of CSS selectors and XPath selectors to parse HTML - both of which are supported by parsel.
Optionally we'll also use loguru - a pretty logging library that'll help us keep track of what's going on.
These packages can be easily installed via pip install command:
$ pip install httpx parsel loguru
Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package or anything that supports CSS selectors which is what we'll be using in this tutorial.
Finding Amazon Products
There are several ways of finding products on Amazon though, the most flexible and powerful one is amazon's search system.
We can see when we type our search term amazon redirects us to search page https://www.amazon.com/s?k=<search query> which we can use in our scraper:
import httpx
def parse_search(response):
pass # we'll fill this in later
async def search(query:str, session: httpx.AsyncClient):
"""Search for amazon products using searchbox"""
log.info(f"{query}: scraping first page")
# first, let's scrape first query page to find out how many pages we have in total:
first_page = await session.get(f"https://www.amazon.com/s?k={query}&page=1")
sel = Selector(text=first_page.text)
_page_numbers = sel.xpath(
'//a[has-class("s-pagination-item")][not(has-class("s-pagination-separator"))]/text()'
).getall()
total_pages = max(int(number) for number in _page_numbers)
# now we can scrape remaining pages concurrently
log.info(f"{query}: found {total_pages}, scraping them concurrently")
other_pages = await asyncio.gather(
*[session.get(f"https://www.amazon.com/s?k={query}&page={page}") for page in range(2, total_pages + 1)]
)
# parse all of search pages for product preview data:
previews = []
for response in [first_page, *other_pages]:
previews.extend(parse_search(response))
log.info(f"{query}: found total of {len(previews)} product previews")
return previews
Here, in our search function we collect the first results page of a given query. Then, we find total pages this query contains and scrape the rest of the pages concurrently. This is a common pagination scraping idiom for when we can find total amount of pages which allows us to take advantage of concurrent web scraping.
from typing import List, Optional
from typing_extensions import TypedDict
from loguru import logger as log
class ProductPreview(TypedDict):
"""result generated by search scraper"""
url: str
title: str
price: str
real_price: str
rating: str
rating_count: str
def parse_search(resp) -> List[ProductPreview]:
"""Parse search result page for product previews"""
previews = []
sel = Selector(text=resp.text)
# find boxes of each product preview
product_boxes = sel.css("div.s-result-item[data-component-type=s-search-result]")
for box in product_boxes:
url = urljoin(str(resp.url), box.css("h2>a::attr(href)").get()).split("?")[0]
if "/slredirect/" in url: # skip ads etc.
continue
previews.append(
{
"url": url,
"title": box.css("h2>a>span::text").get(),
# big price text is discounted price
"price": box.css(".a-price[data-a-size=xl] .a-offscreen::text").get(),
# small price text is "real" price
"real_price": box.css(".a-price[data-a-size=b] .a-offscreen::text").get(),
"rating": (box.css("span[aria-label~=stars]::attr(aria-label)").re(r"(\d+\.*\d*) out") or [None])[0],
"rating_count": box.xpath("//div[contains(@data-csa-c-content-id, 'ratings-count')]/span/@aria-label").get()
}
)
log.debug(f"found {len(previews)} product listings in {resp.url}")
return previews
async def search(query, session):
"""Search for amazon products using searchbox"""
log.info(f"{query}: scraping first page")
# first, let's scrape first query page to find out how many pages we have in total:
first_page = await session.get(f"https://www.amazon.com/s?k={query}&page=1")
sel = Selector(text=first_page.text)
_page_numbers = sel.xpath('//a[has-class("s-pagination-item")][not(has-class("s-pagination-separator"))]/text()').getall()
total_pages = max(int(number) for number in _page_numbers)
# now we can scrape remaining pages concurrently
log.info(f"{query}: found {total_pages}, scraping them concurrently")
other_pages = await asyncio.gather(
*[session.get(f"https://www.amazon.com/s?k={query}&page={page}") for page in range(2, total_pages + 1)]
)
# parse all of search pages for product preview data:
previews = []
for response in [first_page, *other_pages]:
previews.extend(parse_search(response))
log.info(f"{query}: found total of {len(previews)} product previews")
return previews
We are using parsel CSS selector feature to select product preview containers and iterate through each one of them:
Each container contains preview information of the product that we can extract using a few relative CSS selectors. Let's run our current Amazon scraper and see the results it generates:
Run code and example output
import httpx
import json
import asyncio
# We need to use browser-like headers for our requests to avoid being blocked
# here we set headers of Chrome browser on Windows:
BASE_HEADERS = {
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
async def run():
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
data = await search("kindle", session=session)
print(json.dumps(data, indent=2))
if __name__ == "__main__":
asyncio.run(run())
[
{
"url": "https://www.amazon.com/Kindle-Now-with-Built-in-Front-Light/dp/B07978J597/ref=sr_1_1",
"title": "Kindle - With a Built-in Front Light - Black",
"price": "$59.99",
"real_price": "$89.99",
"rating": "4.6",
"rating_count": "36,856"
},
{
"url": "https://www.amazon.com/All-new-Kindle-Paperwhite-adjustable-Ad-Supported/dp/B08KTZ8249/ref=sr_1_2",
"title": "Kindle Paperwhite (8 GB) \u2013 Now with a 6.8\" display and adjustable warm light",
"price": "$139.99",
"real_price": null,
"rating": "4.7",
"rating_count": "10,775"
},
...
]
Now that we can effectively find products, let's take a look at how can we scrape product data itself.
Scraping Amazon Products
To scrape product info we'll retrieve each product's HTML page and parse it using our parsel package. For this, we'll be using parsel's CSS selector feature.
Scraping Product Info
To retrieve product data we mostly just need the ASIN (Amazon Standard Identification Number) code. This unique 10-character identifier is assigned to every product and product variant on Amazon. We can usually extract it from product URL like:
This also means that we can find the URL of any product as long as we know its ASIN code. Let's give it a shot:
import asyncio
import httpx
import re
import json
from typing_extensions import TypedDict, List
from parsel import Selector
class ProductInfo(TypedDict):
"""type hint for our scraped product result"""
name: str
stars: str
rating_coutn: str
features: List[str]
images: dict
def parse_product(result) -> ProductInfo:
"""parse Amazon's product page (e.g. https://www.amazon.com/dp/B07KR2N2GF) for essential product data"""
# images are stored in javascript state data found in the html
# for this we can use a simple regex pattern that can be in one of those locations:
color_images = re.findall(r"colorImages':.*'initial':\s*(\[.+?\])},\n", result.text)
image_gallery = re.findall(r"imageGalleryData'\s*:\s*(\[.+\]),\n", result.text)
if color_images:
images = [img['large'] for img in json.loads(color_images[0])]
elif image_gallery:
images = [img['mainUrl'] for img in json.loads(image_gallery[0])]
else:
print(f"no images found for {result.url}")
# the other fields can be extracted with simple css selectors
# we can define our helper functions to keep our code clean
sel = Selector(text=result.text)
parsed = {
"name": sel.css("#productTitle::text").get("").strip(),
"asin": sel.css("input[name=ASIN]::attr(value)").get("").strip(),
"style": sel.xpath("//span[@class='selection']/text()").get("").strip(),
"description": '\n'.join(sel.css("#productDescription p span ::text").getall()).strip(),
"stars": sel.css("i[data-hook=average-star-rating] ::text").get("").strip(),
"rating_count": sel.css("span[data-hook=total-review-count] ::text").get("").strip(),
"features": [value.strip() for value in sel.css("#feature-bullets li ::text").getall()],
"images": images,
}
# extract details from "Product Information" table:
info_table = {}
for row in sel.css('#productDetails_detailBullets_sections1 tr'):
label = row.css("th::text").get("").strip()
value = row.css("td::text").get("").strip()
if not value:
value = row.css("td span::text").get("").strip()
info_table[label] = value
info_table['Customer Reviews'] = sel.xpath("//td[div[@id='averageCustomerReviews']]//span[@class='a-icon-alt']/text()").get()
rank = sel.xpath("//tr[th[text()=' Best Sellers Rank ']]//td//text()").getall()
info_table['Best Sellers Rank'] = ' '.join([text.strip() for text in rank if text.strip()])
parsed['info_table'] = info_table
return parsed
async def scrape_product(asin: str, session: httpx.AsyncClient) -> ProductInfo:
print(f"scraping {asin}")
response = await session.get(f"https://www.amazon.com/dp/{asin}")
return parse_product(response)
Above, we define our Amazon product scraper that retrieves the product's page from the given ASIN code and parses basic information like name, ratings, etc. Let's run it:
Run code and example output
async def run():
BASE_HEADERS = {
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
data = await scrape_product("B0BCNKKZ91", session=session)
print(json.dumps(data, indent=2))
if __name__ == "__main__":
asyncio.run(run())
{
"name": "PlayStation 5 Console (PS5)",
"asin": "B0BCNKKZ91",
"style": "Disc",
"description": "The PS5 console unleashes new gaming possibilities that you never anticipated. Experience lightning fast loading with an ultra-high speed SSD, deeper immersion with support for haptic feedback, adaptive triggers, and 3D Audio, and an all-new generation of incredible PlayStation games.",
"stars": "4.8 out of 5 stars",
"rating_count": "8,300 global ratings",
"features": [
"Model Number CFI-1215A01X.",
"Stunning Games - Marvel at incredible graphics and experience new PS5 features.",
"Breathtaking Immersion - Discover a deeper gaming experience with support for haptic feedback, adaptive triggers, and 3D Audio technology.",
"Lightning Speed - Harness the power of a custom CPU, GPU, and SSD with Integrated I/O that rewrite the rules of what a PlayStation console can do."
],
"images": [
"https://m.media-amazon.com/images/I/31i9Fft3dqL.jpg",
"https://m.media-amazon.com/images/I/31RcmxBRhdL.jpg",
"https://m.media-amazon.com/images/I/3104a50oATL.jpg"
],
"info_table": {
"ASIN": "B0BCNKKZ91",
"Release date": "October 1, 2022",
"Customer Reviews": "4.8 out of 5 stars",
"Best Sellers Rank": "#1,527 in Video Games ( See Top 100 in Video Games ) #7 in PlayStation 5 Consoles",
"Product Dimensions": "19 x 17 x 8 inches; 9.74 Pounds",
"Type of item": "Video Game",
"Language": "Multilingual",
"Item model number": "CFI-1215",
"Item Weight": "9.72 pounds",
"Manufacturer": "Sony Interactive Entertainment",
"Batteries": "1 Lithium Ion batteries required. (included)",
"Date First Available": "October 1, 2022"
}
}
However, this code is missing an essential detail - prices! For that let's take a look at how Amazon.com price its products and how can we scrape that information.
Scraping Product Variants and Prices
Every product on Amazon can have multiple variants. For example, let's take a look at this product:
We can see that this product is customizable by several options. Each of these option combos is represented by its own ASIN identifier. So, if we take a look at the page source and find all identifiers of this product we can see multiple ASIN codes:
We can see that variant ASIN codes and descriptions are present in a javascript variable hidden in the HTML source of the page. To be more exact, it's in dimensionValuesDisplayData field of a dataToReturn variable. We can easily extract this using a small regular expressions pattern:
import re
import httpx
product_html = httpx.get("https://www.amazon.com/dp/B07F7TLZF4").text
# this pattern selects value between curly braces that follow dimensionValeusDsiplayData key:
variant_data = re.findall(r'dimensionValuesDisplayData"\s*:\s* ({.+?}),\n', product_html)
print(variant_data)
Now, we can implement this logic to our scraper by extracting variant ASIN identifiers and scraping each variant for its price details:
With this function, we can extract product variants' prices. Let's extend our product scraper function with variant scraping logic:
async def scrape_product(asin: str, session: httpx.AsyncClient, reviews=True, pricing=True) -> ProductData:
log.info(f"scraping {asin}")
response_product = await session.get(f"https://www.amazon.com/dp/{asin}")
# parse current page as first variant
variants = [parse_variant(response_product)]
# if product has variants - we want to scrape all of them
_variation_data = re.findall(r'dimensionValuesDisplayData"\s*:\s* ({.+?}),\n', response_product.text)
variant_asins = list(json.loads(_variation_data[0]))
log.info(f"scraping {len(variant_asins)} variants: {variant_asins}")
if _variation_data:
variants.extend(await asyncio.gather(*[scrape_variant(asin, session) for asin in variant_asins]))
return {
"info": parse_product(response_product),
"variants": variants,
}
The interesting thing to note here is that not every product has multiple variants but every product has at least 1 variant. Let's take this scraper for a spin:
Run code and example output
async def run():
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
data = await scrape_product("B07L5G6M1Q", session=session, reviews=True)
print(json.dumps(data, indent=2))
if __name__ == "__main__":
asyncio.run(run())
{
"info": {
"name": "Kindle Oasis \u2013 Now with adjustable warm light - Without Lockscreen Ads",
"stars": "4.6 out of 5 stars",
"rating_count": "19,779 global ratings",
"features": [
" Our best 7\", 300 ppi flush-front Paperwhite display. ",
" Adjustable warm light to shift screen shade from white to amber. ",
" Waterproof (IPX8) so you can read in the bath or by the pool. Your Kindle has been tested to withstand accidental immersion in water. ",
" Thin and light ergonomic design with page turn buttons. ",
" Reads like real paper with the latest e-ink technology for fast page turns. ",
" Instant access to millions of books, newspapers, and audiobooks. ",
" Works with Audible - pair with Bluetooth headphones or speakers to switch seamlessly between reading and listening. "
],
"images": [
{
"hiRes": "https://m.media-amazon.com/images/I/614TlIaYBvL._AC_SL1000_.jpg",
"thumb": "https://m.media-amazon.com/images/I/41XunSGTqvL._AC_US40_.jpg",
"large": "https://m.media-amazon.com/images/I/41XunSGTqvL._AC_.jpg",
...
},
...
]
},
"variants": [
{
"asin": "B07L5G6M1Q",
"price": "$299.99"
},
{
"asin": "B07F7TLZF4",
"price": "$249.99"
},
...
]
}
We can see that now our scraper generates product information and a list of variant data points where each contains price and its own ASIN identifier.
The only details we're missing now are product reviews, so next, let's take a look at how to scrape amazon product reviews.
Scraping Amazon Reviews
To scrape Amazon product reviews, let's take a look at where we can find them. If we scroll to the bottom of the page, we can see a link that says "See All Reviews" and if we click it, we can see that we are taken to a new location that follows this URL format:
We can see that just like for product information, all we need is the ASIN identifier to find the review page of a product. Let's add this logic to our scraper:
from typing_extensions import TypedDict
import httpx
class ReviewData(TypedDict):
"""storage type hint for amazons review object"""
title: str
text: str
location_and_date: str
verified: bool
rating: float
def parse_reviews(response) -> ReviewData:
"""parse review from single review page"""
sel = Selector(text=response.text)
review_boxes = sel.css("#cm_cr-review_list div.review")
parsed = []
for box in review_boxes:
parsed.append({
"text": "".join(box.css("span[data-hook=review-body] ::text").getall()).strip(),
"title": box.css("*[data-hook=review-title]>span::text").get(),
"location_and_date": box.css("span[data-hook=review-date] ::text").get(),
"verified": bool(box.css("span[data-hook=avp-badge] ::text").get()),
"rating": box.css("*[data-hook*=review-star-rating] ::text").re(r"(\d+\.*\d*) out")[0],
})
return parsed
async def scrape_reviews(asin, session: httpx.AsyncClient) -> ReviewData:
"""scrape all reviews of a given ASIN of an amazon product"""
url = f"https://www.amazon.com/product-reviews/{asin}/"
log.info(f"scraping review page: {url}")
# find first page
first_page = await session.get(url)
sel = Selector(text=first_page.text)
# find total amount of pages
total_reviews = sel.css("div[data-hook=cr-filter-info-review-rating-count] ::text").re(r"(\d+,*\d*)")[1]
total_reviews = int(total_reviews.replace(",", ""))
total_pages = int(math.ceil(total_reviews / 10.0))
log.info(f"found total {total_reviews} reviews across {total_pages} pages -> scraping")
_next_page = urljoin(url, sel.css(".a-pagination .a-last>a::attr(href)").get())
if _next_page:
next_page_urls = [_next_page.replace("pageNumber=2", f"pageNumber={i}") for i in range(2, total_pages + 1)]
assert len(set(next_page_urls)) == len(next_page_urls)
other_pages = await asyncio.gather(*[session.get(url) for url in next_page_urls])
else:
other_pages = []
reviews = []
for response in [first_page, *other_pages]:
reviews.extend(parse_reviews(response))
log.info(f"scraped total {len(reviews)} reviews")
return reviews
In the above scraper we are putting together everything we've learned in this tutorial:
To scrape pagination we are using the same technique we used in scraping search: scrape first page, find total pages and scrape the rest concurrently.
To parse reviews are also using the same technique we used in parsing search: iterate through each box containing the review and parse the data using CSS selectors.
Let's run this scraper and see what output it generates:
Run code and example output
async def run():
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
data = await scrape_reviews("B07L5G6M1Q", session=session)
print(json.dumps(data, indent=2))
if __name__ == "__main__":
asyncio.run(run())
[
{
"text": "I have the previous generation oasis as well (on the left side in the pics) and wanted this one for reading at night. Overall there's not many differences between the two, so if the light tone customizability isn't important to you I wouldn't particularly recommend this one over the 9th gen. However, the lighting is noticeably more even with the 10th gen (my older one visibly fades from one side to the other) and there's a ton of variability in the tone of the screen. Overall, for me it was worth it, but your mileage may vary if you don't read in a dark room (so as not to wake the spouse) before bed very often.",
"title": "Loving it so far",
"location_and_date": "Reviewed in the United States on July 29, 2019",
"verified": true,
"rating": "5.0"
},
{
"text": "So I've been using a Kindle Paperwhite since 2014 and absolutely loved it. Despite it being five years old, it still worked great and has been a pleasure as a reading device. ",
"title": "From 2014 Paperwhite to 2019 Oasis",
"location_and_date": "Reviewed in the United States on August 9, 2019",
"verified": true,
"rating": "3.0"
},
...
]
By this point, we've learned how to find products on Amazon and scrape their description, pricing and review data. However, to scrape Amazon at scale we have to fortify our scraper from being blocked - let's see how we can do that using ScrapFly web scraping API!
Bypass Blocking and Captchas with ScrapFly
We looked at how to Scrape Amazon.com though unfortunately when scraping at scale it's very likely that Amazon will start to either block us or start serving us captchas, which will hinder or completely disable our web scraper.
For this, we'll be using scrapfly-sdk python package. First, let's install scrapfly-sdk using pip:
$ pip install scrapfly-sdk
To take advantage of ScrapFly's API in our Amazon web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests. For more see the latest scraper code on github:
To wrap this guide up let's take a look at some frequently asked questions about web scraping Amazon:
Is it legal to scrape Amazon.com?
Yes. Amazon's data is publicly available, and we're not extracting anything personal or private. Scraping Amazon.com at slow, respectful rates would fall under the ethical scraping definition.
That being said, attention should be paid to GDRP compliance in the EU when scraping personal data such as personal people's data from the reviews section. For more, see our Is Web Scraping Legal? article.
How to crawl Amazon.com?
It's easy to crawl Amazon products because of the extensive related product and recommendation system featured on every page. In other words, we can write a crawler that takes in a seed of amazon product URLs, scrapes them, extracts more product URLs from the related product section - and loops this on. For more on crawling see How to Crawl the Web with Python.
Summary
In this tutorial, we built an Amazon product scraper by understanding how the website functions so we could replicate its functionality in our web scraper. First, we replicated the search function to find products, then we scraped product information and variant data and finally, we scraped all of the reviews of each product.
We can see that web scraping amazon in Python is pretty easy thanks to brilliant community tools like httpx and parsel.
To prevent being blocked, we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!
In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.
In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.
In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.