How to Scrape Amazon.com Product Data and Reviews

article feature image

In this tutorial, we'll look at how to scrape from Amazon - the biggest e-commerce website in the world!

Amazon contains millions of products and operates in many different countries making it a great target for public market analytics data.
To scrape Amazon product data, prices and reviews, we'll be using Python with a few community packages and common Amazon web scraping idioms. So, let's dive in!

Latest Amazon.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Why Scrape Amazon.com?

Amazon contains loads of valuable e-commerce data: product details, prices and reviews. It's a leading e-commerce platform in many regions around the world. This makes Amazon's public data ideal for market analytics, business intelligence and many niches of data science.

Amazon is often used by companies to track the performance of their products sold by 3rd party resellers. So, needless to say, there are almost countless ways to make use of this public data! For more on scraping use cases see our extensive web scraping use case article

Project Setup

In this tutorial, we'll be using Python and two major community packages:

  • httpx - HTTP client library which will let us communicate with amazon.com's servers
  • parsel - HTML parsing library which will help us to parse our web scraped HTML files. In this tutorial, we'll be using a mixture of CSS selectors and XPath selectors to parse HTML - both of which are supported by parsel.

Optionally we'll also use loguru - a pretty logging library that'll help us keep track of what's going on.

These packages can be easily installed via pip install command:

$ pip install httpx parsel loguru

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package or anything that supports CSS selectors which is what we'll be using in this tutorial.

Parsing HTML with CSS Selectors

If you're new to CSS selectors, check out our complete and interactive introduction article that goes over essential CSS selector syntax and common usage in web scraping.

Parsing HTML with CSS Selectors

Finding Amazon Products

There are several ways of finding products on Amazon though, the most flexible and powerful one is amazon's search system.

0:00
/

We can see when we type our search term amazon redirects us to search page https://www.amazon.com/s?k=<search query> which we can use in our scraper:

import httpx
def parse_search(response):
    pass  # we'll fill this in later

async def search(query:str, session: httpx.AsyncClient):
    """Search for amazon products using searchbox"""
    log.info(f"{query}: scraping first page")

    # first, let's scrape first query page to find out how many pages we have in total:
    first_page = await session.get(f"https://www.amazon.com/s?k={query}&page=1")
    sel = Selector(text=first_page.text)
    _page_numbers = sel.xpath(
        '//a[has-class("s-pagination-item")][not(has-class("s-pagination-separator"))]/text()'
    ).getall()
    total_pages = max(int(number) for number in _page_numbers)

    # now we can scrape remaining pages concurrently
    log.info(f"{query}: found {total_pages}, scraping them concurrently")
    other_pages = await asyncio.gather(
        *[session.get(f"https://www.amazon.com/s?k={query}&page={page}") for page in range(2, total_pages + 1)]
    )

    # parse all of search pages for product preview data:
    previews = []
    for response in [first_page, *other_pages]:
        previews.extend(parse_search(response))

    log.info(f"{query}: found total of {len(previews)} product previews")
    return previews

Here, in our search function we collect the first results page of a given query. Then, we find total pages this query contains and scrape the rest of the pages concurrently. This is a common pagination scraping idiom for when we can find total amount of pages which allows us to take advantage of concurrent web scraping.

illustration of pagination scraping idiom
efficient pagination scraping: get total results from first page and then scrape the rest of the pages together!

Further, let's parse our collected search page HTMLs for product preview data:

from typing import List, Optional
from typing_extensions import TypedDict
from loguru import logger as log

class ProductPreview(TypedDict):
    """result generated by search scraper"""
    url: str
    title: str
    price: str
    real_price: str
    rating: str
    rating_count: str


def parse_search(resp) -> List[ProductPreview]:
    """Parse search result page for product previews"""
    previews = []
    sel = Selector(text=resp.text)
    # find boxes of each product preview
    product_boxes = sel.css("div.s-result-item[data-component-type=s-search-result]")
    for box in product_boxes:
        url = urljoin(str(resp.url), box.css("h2>a::attr(href)").get()).split("?")[0]
        if "/slredirect/" in url:  # skip ads etc.
            continue
        previews.append(
            {
                "url": url,
                "title": box.css("h2>a>span::text").get(),
                # big price text is discounted price
                "price": box.css(".a-price[data-a-size=xl] .a-offscreen::text").get(),
                # small price text is "real" price
                "real_price": box.css(".a-price[data-a-size=b] .a-offscreen::text").get(),
                "rating": (box.css("span[aria-label~=stars]::attr(aria-label)").re(r"(\d+\.*\d*) out") or [None])[0],
                "rating_count": box.xpath("//div[contains(@data-csa-c-content-id, 'ratings-count')]/span/@aria-label").get()
            }
        )
    log.debug(f"found {len(previews)} product listings in {resp.url}")
    return previews

async def search(query, session):
    """Search for amazon products using searchbox"""
    log.info(f"{query}: scraping first page")

    # first, let's scrape first query page to find out how many pages we have in total:
    first_page = await session.get(f"https://www.amazon.com/s?k={query}&page=1")
    sel = Selector(text=first_page.text)
    _page_numbers = sel.xpath('//a[has-class("s-pagination-item")][not(has-class("s-pagination-separator"))]/text()').getall()
    total_pages = max(int(number) for number in _page_numbers)

    # now we can scrape remaining pages concurrently
    log.info(f"{query}: found {total_pages}, scraping them concurrently")
    other_pages = await asyncio.gather(
        *[session.get(f"https://www.amazon.com/s?k={query}&page={page}") for page in range(2, total_pages + 1)]
    )

    # parse all of search pages for product preview data:
    previews = []
    for response in [first_page, *other_pages]:
        previews.extend(parse_search(response))

    log.info(f"{query}: found total of {len(previews)} product previews")
    return previews

We are using parsel CSS selector feature to select product preview containers and iterate through each one of them:

illustration of amazon's search
We can see that each product is contained within its own box that we can capture

Each container contains preview information of the product that we can extract using a few relative CSS selectors. Let's run our current Amazon scraper and see the results it generates:

Run code and example output
import httpx
import json
import asyncio

# We need to use browser-like headers for our requests to avoid being blocked
# here we set headers of Chrome browser on Windows:
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}

async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        data = await search("kindle", session=session)
        print(json.dumps(data, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "url": "https://www.amazon.com/Kindle-Now-with-Built-in-Front-Light/dp/B07978J597/ref=sr_1_1",
    "title": "Kindle - With a Built-in Front Light - Black",
    "price": "$59.99",
    "real_price": "$89.99",
    "rating": "4.6",
    "rating_count": "36,856"
  },
  {
    "url": "https://www.amazon.com/All-new-Kindle-Paperwhite-adjustable-Ad-Supported/dp/B08KTZ8249/ref=sr_1_2",
    "title": "Kindle Paperwhite (8 GB) \u2013 Now with a 6.8\" display and adjustable warm light",
    "price": "$139.99",
    "real_price": null,
    "rating": "4.7",
    "rating_count": "10,775"
  },
  ...
]

Now that we can effectively find products, let's take a look at how can we scrape product data itself.

Scraping Amazon Products

To scrape product info we'll retrieve each product's HTML page and parse it using our parsel package. For this, we'll be using parsel's CSS selector feature.

Scraping Product Info

To retrieve product data we mostly just need the ASIN (Amazon Standard Identification Number) code. This unique 10-character identifier is assigned to every product and product variant on Amazon. We can usually extract it from product URL like:

illustration of amazon's product URL

This also means that we can find the URL of any product as long as we know its ASIN code. Let's give it a shot:

from typing_extensions import TypedDict

class ProductInfo(TypedDict):
    """type hint for our scraped product result"""
    name: str
    stars: str
    rating_coutn: str
    features: List[str]
    images: dict

def parse_product(response) -> ProductInfo:
    """parse Amazon's product page (e.g. https://www.amazon.com/dp/B07KR2N2GF) for essential product data"""
    sel = Selector(text=response.text)
    # images are stored in javascript state data found in the html
    # for this we can use a simple regex pattern:
    image_data = json.loads(re.findall(r"colorImages':.*'initial':\s*(\[.+?\])},\n", response.text)[0])
    # the other fields can be extracted with simple css selectors
    # we can define our helper functions to keep our code clean
    return {
        "name": sel.css("#productTitle::text").get("").strip(),
        "stars": sel.css("i[data-hook=average-star-rating] ::text").get("").strip(),
        "rating_count": sel.css("span[data-hook=total-review-count] ::text").get("").strip(),
        "features": sel.css("#feature-bullets li ::text").getall(),
        "images": image_data
    }


async def scrape_product(asin: str, session: httpx.AsyncClient) -> ProductInfo:
    log.info(f"scraping {asin}")
    response = await session.get(f"https://www.amazon.com/dp/{asin}")
    return parse_product(response)

Above, we define our Amazon product scraper that retrieves the product's page from the given ASIN code and parses basic information like name, ratings, etc. Let's run it:

Run code and example output
async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        data = await scrape_product("B07L5G6M1Q", session=session)
        print(json.dumps(data, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
{
  "name": "Kindle Oasis \u2013 Now with adjustable warm light - Without Lockscreen Ads",
  "stars": "4.6 out of 5 stars",
  "rating_count": "19,779 global ratings",
  "features": [
    " Our best 7\", 300 ppi flush-front Paperwhite display.  ",
    " Adjustable warm light to shift screen shade from white to amber.  ",
    " Waterproof (IPX8) so you can read in the bath or by the pool. Your Kindle has been tested to withstand accidental immersion in water.  ",
    " Thin and light ergonomic design with page turn buttons.  ",
    " Reads like real paper with the latest e-ink technology for fast page turns.  ",
    " Instant access to millions of books, newspapers, and audiobooks.  ",
    " Works with Audible - pair with Bluetooth headphones or speakers to switch seamlessly between reading and listening.  "
  ],
  "images": [
    {
      "hiRes": "https://m.media-amazon.com/images/I/614TlIaYBvL._AC_SL1000_.jpg",
      "thumb": "https://m.media-amazon.com/images/I/41XunSGTqvL._AC_US40_.jpg",
      "large": "https://m.media-amazon.com/images/I/41XunSGTqvL._AC_.jpg",
...
}

However, this code is missing an essential detail - prices! For that let's take a look at how Amazon.com price its products and how can we scrape that information.

Scraping Product Variants and Prices

Every product on Amazon can have multiple variants. For example, let's take a look at this product:

illustration of amazon's variants

We can see that this product is customizable by several options. Each of these option combos is represented by its own ASIN identifier. So, if we take a look at the page source and find all identifiers of this product we can see multiple ASIN codes:

illustration of amazon's variants in page source

We can see that variant ASIN codes and descriptions are present in a javascript variable hidden in the HTML source of the page. To be more exact, it's in dimensionValuesDisplayData field of a dataToReturn variable. We can easily extract this using a small regular expressions pattern:

import re
import httpx

product_html = httpx.get("https://www.amazon.com/dp/B07F7TLZF4").text
# this pattern selects value between curly braces that follow dimensionValeusDsiplayData key:
variant_data = re.findall(r'dimensionValuesDisplayData"\s*:\s* ({.+?}),\n', product_html)
print(variant_data)

Now, we can implement this logic to our scraper by extracting variant ASIN identifiers and scraping each variant for its price details:
With this function, we can extract product variants' prices. Let's extend our product scraper function with variant scraping logic:

async def scrape_product(asin: str, session: httpx.AsyncClient, reviews=True, pricing=True) -> ProductData:
    log.info(f"scraping {asin}")
    response_product = await session.get(f"https://www.amazon.com/dp/{asin}")
    # parse current page as first variant
    variants = [parse_variant(response_product)]

    # if product has variants - we want to scrape all of them
    _variation_data = re.findall(r'dimensionValuesDisplayData"\s*:\s* ({.+?}),\n', response_product.text)
    variant_asins = list(json.loads(_variation_data[0]))
    log.info(f"scraping {len(variant_asins)} variants: {variant_asins}")
    if _variation_data:
        variants.extend(await asyncio.gather(*[scrape_variant(asin, session) for asin in variant_asins]))

    return {
        "info": parse_product(response_product),
        "variants": variants,
    }

The interesting thing to note here is that not every product has multiple variants but every product has at least 1 variant. Let's take this scraper for a spin:

Run code and example output
async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        data = await scrape_product("B07L5G6M1Q", session=session, reviews=True)
        print(json.dumps(data, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
{
  "info": {
    "name": "Kindle Oasis \u2013 Now with adjustable warm light - Without Lockscreen Ads",
    "stars": "4.6 out of 5 stars",
    "rating_count": "19,779 global ratings",
    "features": [
      " Our best 7\", 300 ppi flush-front Paperwhite display.  ",
      " Adjustable warm light to shift screen shade from white to amber.  ",
      " Waterproof (IPX8) so you can read in the bath or by the pool. Your Kindle has been tested to withstand accidental immersion in water.  ",
      " Thin and light ergonomic design with page turn buttons.  ",
      " Reads like real paper with the latest e-ink technology for fast page turns.  ",
      " Instant access to millions of books, newspapers, and audiobooks.  ",
      " Works with Audible - pair with Bluetooth headphones or speakers to switch seamlessly between reading and listening.  "
    ],
    "images": [
      {
        "hiRes": "https://m.media-amazon.com/images/I/614TlIaYBvL._AC_SL1000_.jpg",
        "thumb": "https://m.media-amazon.com/images/I/41XunSGTqvL._AC_US40_.jpg",
        "large": "https://m.media-amazon.com/images/I/41XunSGTqvL._AC_.jpg",
        ...
      },
      ...
    ]
  },
  "variants": [
    {
      "asin": "B07L5G6M1Q",
      "price": "$299.99"
    },
    {
      "asin": "B07F7TLZF4",
      "price": "$249.99"
    },
    ...
  ]
}

We can see that now our scraper generates product information and a list of variant data points where each contains price and its own ASIN identifier.
The only details we're missing now are product reviews, so next, let's take a look at how to scrape amazon product reviews.

Scraping Amazon Reviews

To scrape Amazon product reviews, let's take a look at where we can find them. If we scroll to the bottom of the page, we can see a link that says "See All Reviews" and if we click it, we can see that we are taken to a new location that follows this URL format:

illustration of amazon review page url

We can see that just like for product information, all we need is the ASIN identifier to find the review page of a product. Let's add this logic to our scraper:

from typing_extensions import TypedDict
import httpx

class ReviewData(TypedDict):
    """storage type hint for amazons review object"""
    title: str
    text: str
    location_and_date: str
    verified: bool
    rating: float


def parse_reviews(response) -> ReviewData:
    """parse review from single review page"""
    sel = Selector(text=response.text)
    review_boxes = sel.css("#cm_cr-review_list div.review")
    parsed = []
    for box in review_boxes:
        parsed.append({
                "text": "".join(box.css("span[data-hook=review-body] ::text").getall()).strip(),
                "title": box.css("*[data-hook=review-title]>span::text").get(),
                "location_and_date": box.css("span[data-hook=review-date] ::text").get(),
                "verified": bool(box.css("span[data-hook=avp-badge] ::text").get()),
                "rating": box.css("*[data-hook*=review-star-rating] ::text").re(r"(\d+\.*\d*) out")[0],
        })
    return parsed


async def scrape_reviews(asin, session: httpx.AsyncClient) -> ReviewData:
    """scrape all reviews of a given ASIN of an amazon product"""
    url = f"https://www.amazon.com/product-reviews/{asin}/"
    log.info(f"scraping review page: {url}")
    # find first page
    first_page = await session.get(url)
    sel = Selector(text=first_page.text)
    # find total amount of pages 
    total_reviews = sel.css("div[data-hook=cr-filter-info-review-rating-count] ::text").re(r"(\d+,*\d*)")[1]
    total_reviews = int(total_reviews.replace(",", ""))
    total_pages = int(math.ceil(total_reviews / 10.0))

    log.info(f"found total {total_reviews} reviews across {total_pages} pages -> scraping")
    _next_page = urljoin(url, sel.css(".a-pagination .a-last>a::attr(href)").get())
    if _next_page:
        next_page_urls = [_next_page.replace("pageNumber=2", f"pageNumber={i}") for i in range(2, total_pages + 1)]
        assert len(set(next_page_urls)) == len(next_page_urls)
        other_pages = await asyncio.gather(*[session.get(url) for url in next_page_urls])
    else:
        other_pages = []
    reviews = []
    for response in [first_page, *other_pages]:
        reviews.extend(parse_reviews(response))
    log.info(f"scraped total {len(reviews)} reviews")
    return reviews

In the above scraper we are putting together everything we've learned in this tutorial:

  • To scrape pagination we are using the same technique we used in scraping search: scrape first page, find total pages and scrape the rest concurrently.
  • To parse reviews are also using the same technique we used in parsing search: iterate through each box containing the review and parse the data using CSS selectors.

Let's run this scraper and see what output it generates:

Run code and example output
async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        data = await scrape_reviews("B07L5G6M1Q", session=session)
        print(json.dumps(data, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "text": "I have the previous generation oasis as well (on the left side in the pics) and wanted this one for reading at night. Overall there's not many differences between the two, so if the light tone customizability isn't important to you I wouldn't particularly recommend this one over the 9th gen. However, the lighting is noticeably more even with the 10th gen (my older one visibly fades from one side to the other) and there's a ton of variability in the tone of the screen. Overall, for me it was worth it, but your mileage may vary if you don't read in a dark room (so as not to wake the spouse) before bed very often.",
    "title": "Loving it so far",
    "location_and_date": "Reviewed in the United States on July 29, 2019",
    "verified": true,
    "rating": "5.0"
  },
  {
    "text": "So I've been using a Kindle Paperwhite since 2014 and absolutely loved it.  Despite it being five years old, it still worked great and has been a pleasure as a reading device. ",
    "title": "From 2014 Paperwhite to 2019 Oasis",
    "location_and_date": "Reviewed in the United States on August 9, 2019",
    "verified": true,
    "rating": "3.0"
  },
... 
]

By this point, we've learned how to find products on Amazon and scrape their description, pricing and review data. However, to scrape Amazon at scale we have to fortify our scraper from being blocked - let's see how we can do that using ScrapFly web scraping API!

Bypass Blocking and Captchas with ScrapFly

We looked at how to Scrape Amazon.com though unfortunately when scraping at scale it's very likely that Amazon will start to either block us or start serving us captchas, which will hinder or completely disable our web scraper.

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us just with a few extra lines of Python code!

illustration of scrapfly's middleware

ScrapFly offers several powerful features that'll help us to get around Amazon's web scraper blocking:

For this, we'll be using scrapfly-sdk python package. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Amazon web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests. For more see the latest scraper code on github:

Latest Amazon.com Scraper Code
https://github.com/scrapfly/scrapfly-scrapers/

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping Amazon:

Yes. Amazon's data is publicly available, and we're not extracting anything personal or private. Scraping Amazon.com at slow, respectful rates would fall under the ethical scraping definition.
That being said, attention should be paid to GDRP compliance in the EU when scraping personal data such as personal people's data from the reviews section. For more, see our Is Web Scraping Legal? article.

How to crawl Amazon.com?

It's easy to crawl Amazon products because of the extensive related product and recommendation system featured on every page. In other words, we can write a crawler that takes in a seed of amazon product URLs, scrapes them, extracts more product URLs from the related product section - and loops this on. For more on crawling see How to Crawl the Web with Python.

Summary

In this tutorial, we built an Amazon product scraper by understanding how the website functions so we could replicate its functionality in our web scraper. First, we replicated the search function to find products, then we scraped product information and variant data and finally, we scraped all of the reviews of each product.

We can see that web scraping amazon in Python is pretty easy thanks to brilliant community tools like httpx and parsel.

To prevent being blocked, we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Related Posts

How to Scrape Reddit Posts, Subreddits and Profiles

In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.

How to Scrape LinkedIn.com Profile, Company, and Job Data

In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.

How to Scrape SimilarWeb Website Traffic Analytics

In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.