How to Scrape YellowPages.com
Tutorial on how to scrape yellowpages.com business and review data using Python. How to avoid blocking to scrape data at scale and other tips.
In this web scraping tutorial we'll be scraping Amazon.com - the biggest ecommerce website in the world!
Amazon contains millions of products and operates in many different countries making it a great target for public market analytics data. To retrieve all of this product data, prices and reviews we'll be using Python with a few community packages and common web scraping idioms. So, let's dive in!
If you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.
Amazon contains loads of valuable e-commerce data: product details, prices and reviews. It's also a leading e-commerce platform in many regions around the world. All of this makes Amazon's public data ideal for market analytics and business intelligence.
Not only that, often amazon is used by companies to track the performance of their products sold by 3rd party resellers. So, needless to say there are almost countless ways to make use of this public data! For more on scraping use cases see our extensive web scraping use case article
In this tutorial we'll be using Python and two major community packages:
parsel
.Optionally we'll also use loguru - a pretty logging library that'll help us keep track of what's going on.
These packages can be easily installed via pip
command:
$ pip install httpx parsel loguru
Alternatively, feel free to swap httpx
out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel
, another great alternative is beautifulsoup package or anything that supports CSS selectors which is what we'll be using in this tutorial.
If you're new to CSS selectors check out our complete and interactive introduction article that goes over essential CSS selector syntax and common usage in web scraping.
There are several ways of finding products on amazon however, the most flexible and powerful one is amazon's search system.
We can see when we type our search term amazon redirects us to search page https://www.amazon.com/s?k=<search query>
which we can use in our scraper:
import httpx
def parse_search(response):
pass # we'll fill this in later
async def search(query:str, session: httpx.AsyncClient):
"""Search for amazon products using searchbox"""
log.info(f"{query}: scraping first page")
# first, let's scrape first query page to find out how many pages we have in total:
first_page = await session.get(f"https://www.amazon.com/s?k={query}&page=1")
sel = Selector(text=first_page.text)
_page_numbers = sel.xpath(
'//a[has-class("s-pagination-item")][not(has-class("s-pagination-separator"))]/text()'
).getall()
total_pages = max(int(number) for number in _page_numbers)
# now we can scrape remaining pages concurrently
log.info(f"{query}: found {total_pages}, scraping them concurrently")
other_pages = await asyncio.gather(
*[session.get(f"https://www.amazon.com/s?k={query}&page={page}") for page in range(2, total_pages + 1)]
)
# parse all of search pages for product preview data:
previews = []
for response in [first_page, *other_pages]:
previews.extend(parse_search(response))
log.info(f"{query}: found total of {len(previews)} product previews")
return previews
Here, in our search
function we collect the first results page of a given query. Then, we find total pages this query contains and scrape the rest of the pages concurrently. This is a common pagination scraping idiom for when we can find total amount of pages which allows us to take advantage of concurrent web scraping.
efficient pagination scraping: get total results from first page and then scrape the rest of the pages together!
Further, let's parse our collected search page HTMLs for product preview data:
from typing import List, Optional
from typing_extensions import TypedDict
from loguru import logger as log
class ProductPreview(TypedDict):
"""result generated by search scraper"""
url: str
title: str
price: str
real_price: str
rating: str
rating_count: str
def parse_search(resp) -> List[ProductPreview]:
"""Parse search result page for product previews"""
previews = []
sel = Selector(text=resp.text)
# find boxes of each product preview
product_boxes = sel.css("div.s-result-item[data-component-type=s-search-result]")
for box in product_boxes:
url = urljoin(str(resp.url), box.css("h2>a::attr(href)").get()).split("?")[0]
if "/slredirect/" in url: # skip ads etc.
continue
previews.append(
{
"url": url,
"title": box.css("h2>a>span::text").get(),
# big price text is discounted price
"price": box.css(".a-price[data-a-size=xl] .a-offscreen::text").get(),
# small price text is "real" price
"real_price": box.css(".a-price[data-a-size=b] .a-offscreen::text").get(),
"rating": (box.css("span[aria-label~=stars]::attr(aria-label)").re(r"(\d+\.*\d*) out") or [None])[0],
"rating_count": box.css("span[aria-label~=stars] + span::attr(aria-label)").get(),
}
)
log.debug(f"found {len(previews)} product listings in {resp.url}")
return previews
async def search(query, session):
"""Search for amazon products using searchbox"""
log.info(f"{query}: scraping first page")
# first, let's scrape first query page to find out how many pages we have in total:
first_page = await session.get(f"https://www.amazon.com/s?k={query}&page=1")
sel = Selector(text=first_page.text)
_page_numbers = sel.xpath('//a[has-class("s-pagination-item")][not(has-class("s-pagination-separator"))]/text()').getall()
total_pages = max(int(number) for number in _page_numbers)
# now we can scrape remaining pages concurrently
log.info(f"{query}: found {total_pages}, scraping them concurrently")
other_pages = await asyncio.gather(
*[session.get(f"https://www.amazon.com/s?k={query}&page={page}") for page in range(2, total_pages + 1)]
)
# parse all of search pages for product preview data:
previews = []
for response in [first_page, *other_pages]:
previews.extend(parse_search(response))
log.info(f"{query}: found total of {len(previews)} product previews")
return previews
We are using parsel CSS selector feature to select product preview containers and iterate through each one of them:
We can see that each product is contained within its own box that we can capture
Each container contains preview information of the product which we can extract using few relative CSS selectors. Let's run this scraper and see the results it generates:
import httpx
import json
import asyncio
# We need to use browser-like headers for our requests to avoid being blocked
# here we set headers of Chrome browser on Windows:
BASE_HEADERS = {
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
async def run():
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
data = await search("kindle", session=session)
print(json.dumps(data, indent=2))
if __name__ == "__main__":
asyncio.run(run())
[
{
"url": "https://www.amazon.com/Kindle-Now-with-Built-in-Front-Light/dp/B07978J597/ref=sr_1_1",
"title": "Kindle - With a Built-in Front Light - Black",
"price": "$59.99",
"real_price": "$89.99",
"rating": "4.6",
"rating_count": "36,856"
},
{
"url": "https://www.amazon.com/All-new-Kindle-Paperwhite-adjustable-Ad-Supported/dp/B08KTZ8249/ref=sr_1_2",
"title": "Kindle Paperwhite (8 GB) \u2013 Now with a 6.8\" display and adjustable warm light",
"price": "$139.99",
"real_price": null,
"rating": "4.7",
"rating_count": "10,775"
},
...
]
Now that we can effectively find products, let's take a look at how can we scrape product data itself.
To scrape product info we'll retrieve each product's HTML page and parse it using our parsel
package. For this, we'll be using parsel's CSS selector feature.
To retrieve product data we mostly just need the ASIN (Amazon Standard Identification Number) code. This unique 10-character identifier is assigned to every product and product variant on Amazon. We can usually extract it from product URL like:
This also means that we can find the URL of any product as long as we know its ASIN code. Let's give it a shot:
from typing_extensions import TypedDict
class ProductInfo(TypedDict):
"""type hint for our scraped product result"""
name: str
stars: str
rating_coutn: str
features: List[str]
images: dict
def parse_product(response) -> ProductInfo:
"""parse Amazon's product page (e.g. https://www.amazon.com/dp/B07KR2N2GF) for essential product data"""
sel = Selector(text=response.text)
# images are stored in javascript state data found in the html
# for this we can use a simple regex pattern:
image_data = json.loads(re.findall(r"colorImages':.*'initial':\s*(\[.+?\])},\n", response.text)[0])
# the other fields can be extracted with simple css selectors
# we can define our helper functions to keep our code clean
return {
"name": sel.css("#productTitle::text").get("").strip(),
"stars": sel.css("i[data-hook=average-star-rating] ::text").get("").strip(),
"rating_count": sel.css("div[data-hook=total-review-count] ::text").get("").strip(),
"features": sel.css("#feature-bullets li ::text").getall(),
"images": image_data
}
async def scrape_product(asin: str, session: httpx.AsyncClient) -> ProductInfo:
log.info(f"scraping {asin}")
response = await session.get(f"https://www.amazon.com/dp/{asin}")
return parse_product(response)
Above, we define our product scraper that retrieves the product's page from the given ASIN code and parses basic information like name, ratings etc. Let's run it:
async def run():
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
data = await scrape_product("B07L5G6M1Q", session=session)
print(json.dumps(data, indent=2))
if __name__ == "__main__":
asyncio.run(run())
{
"name": "Kindle Oasis \u2013 Now with adjustable warm light - Without Lockscreen Ads",
"stars": "4.6 out of 5 stars",
"rating_count": "19,779 global ratings",
"features": [
" Our best 7\", 300 ppi flush-front Paperwhite display. ",
" Adjustable warm light to shift screen shade from white to amber. ",
" Waterproof (IPX8) so you can read in the bath or by the pool. Your Kindle has been tested to withstand accidental immersion in water. ",
" Thin and light ergonomic design with page turn buttons. ",
" Reads like real paper with the latest e-ink technology for fast page turns. ",
" Instant access to millions of books, newspapers, and audiobooks. ",
" Works with Audible - pair with Bluetooth headphones or speakers to switch seamlessly between reading and listening. "
],
"images": [
{
"hiRes": "https://m.media-amazon.com/images/I/614TlIaYBvL._AC_SL1000_.jpg",
"thumb": "https://m.media-amazon.com/images/I/41XunSGTqvL._AC_US40_.jpg",
"large": "https://m.media-amazon.com/images/I/41XunSGTqvL._AC_.jpg",
...
}
However, this code is missing an essential detail - prices! For that let's take a look at how Amazon.com price its products and how can we scrape that information.
Every product on Amazon can have multiple variants. For example, let's take a look at this product:
We can see that this product is customizable by several options. Each of these option combos is represented by its own ASIN identifier. So, if we take a look at the page source and find all identifiers of this product we can see multiple ASIN codes:
We can see that variant ASIN codes and descriptions are present in a javascript variable hidden in the HTML source of the page. To be more exact, it's in dimensionValuesDisplayData
field of a dataToReturn
variable. We can easily extract this using a small regular expressions pattern:
import re
import httpx
product_html = httpx.get("https://www.amazon.com/dp/B07F7TLZF4").text
# this pattern selects value between curly braces that follow dimensionValeusDsiplayData key:
variant_data = re.findall(r'dimensionValuesDisplayData"\s*:\s* ({.+?}),\n', product_html)
print(variant_data)
Now, we can implement this logic to our scraper by extracting variant ASIN identifiers and scraping each variant for its price details:
With this function, we can extract product variants' prices. Let's extend our product scraper function with variant scraping logic:
async def scrape_product(asin: str, session: httpx.AsyncClient, reviews=True, pricing=True) -> ProductData:
log.info(f"scraping {asin}")
response_product = await session.get(f"https://www.amazon.com/dp/{asin}")
# parse current page as first variant
variants = [parse_variant(response_product)]
# if product has variants - we want to scrape all of them
_variation_data = re.findall(r'dimensionValuesDisplayData"\s*:\s* ({.+?}),\n', response_product.text)
variant_asins = list(json.loads(_variation_data[0]))
log.info(f"scraping {len(variant_asins)} variants: {variant_asins}")
if _variation_data:
variants.extend(await asyncio.gather(*[scrape_variant(asin, session) for asin in variant_asins]))
return {
"info": parse_product(response_product),
"variants": variants,
}
The interesting thing to note here is that not every product has multiple variants but every product has at least 1 variant. Let's take this scraper for a spin:
async def run():
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
data = await scrape_product("B07L5G6M1Q", session=session, reviews=True)
print(json.dumps(data, indent=2))
if __name__ == "__main__":
asyncio.run(run())
{
"info": {
"name": "Kindle Oasis \u2013 Now with adjustable warm light - Without Lockscreen Ads",
"stars": "4.6 out of 5 stars",
"rating_count": "19,779 global ratings",
"features": [
" Our best 7\", 300 ppi flush-front Paperwhite display. ",
" Adjustable warm light to shift screen shade from white to amber. ",
" Waterproof (IPX8) so you can read in the bath or by the pool. Your Kindle has been tested to withstand accidental immersion in water. ",
" Thin and light ergonomic design with page turn buttons. ",
" Reads like real paper with the latest e-ink technology for fast page turns. ",
" Instant access to millions of books, newspapers, and audiobooks. ",
" Works with Audible - pair with Bluetooth headphones or speakers to switch seamlessly between reading and listening. "
],
"images": [
{
"hiRes": "https://m.media-amazon.com/images/I/614TlIaYBvL._AC_SL1000_.jpg",
"thumb": "https://m.media-amazon.com/images/I/41XunSGTqvL._AC_US40_.jpg",
"large": "https://m.media-amazon.com/images/I/41XunSGTqvL._AC_.jpg",
...
},
...
]
},
"variants": [
{
"asin": "B07L5G6M1Q",
"price": "$299.99"
},
{
"asin": "B07F7TLZF4",
"price": "$249.99"
},
...
]
}
We can see that now our scraper generates product information and a list of variant data points where each contains price and it's own ASIN identifier.
The only details we're missing now are product reviews. For that, we'll have to scrape a few more pages so let's take a look at how we can do that.
To scrape product reviews first let's take a look at where we can find them. If we scroll to the bottom of the page we can see a link that says "See All Reviews" and if we click it we can see that we are taken to a URL that follows this format:
We can see that just like for product information all we need is the ASIN identifier to find the review page of a product. Let's add this logic to our scraper:
from typing_extensions import TypedDict
import httpx
class ReviewData(TypedDict):
"""storage type hint for amazons review object"""
title: str
text: str
location_and_date: str
verified: bool
rating: float
def parse_reviews(response) -> ReviewData:
"""parse review from single review page"""
sel = Selector(text=response.text)
review_boxes = sel.css("#cm_cr-review_list div.review")
parsed = []
for box in review_boxes:
parsed.append({
"text": "".join(box.css("span[data-hook=review-body] ::text").getall()).strip(),
"title": box.css("*[data-hook=review-title]>span::text").get(),
"location_and_date": box.css("span[data-hook=review-date] ::text").get(),
"verified": bool(box.css("span[data-hook=avp-badge] ::text").get()),
"rating": box.css("*[data-hook*=review-star-rating] ::text").re(r"(\d+\.*\d*) out")[0],
})
return parsed
async def scrape_reviews(asin, session: httpx.AsyncClient) -> ReviewData:
"""scrape all reviews of a given ASIN of an amazon product"""
url = f"https://www.amazon.com/product-reviews/{asin}/"
log.info(f"scraping review page: {url}")
# find first page
first_page = await session.get(url)
sel = Selector(text=first_page.text)
# find total amount of pages
total_reviews = sel.css("div[data-hook=cr-filter-info-review-rating-count] ::text").re(r"(\d+,*\d*)")[1]
total_reviews = int(total_reviews.replace(",", ""))
total_pages = int(math.ceil(total_reviews / 10.0))
log.info(f"found total {total_reviews} reviews across {total_pages} pages -> scraping")
_next_page = urljoin(url, sel.css(".a-pagination .a-last>a::attr(href)").get())
if _next_page:
next_page_urls = [_next_page.replace("pageNumber=2", f"pageNumber={i}") for i in range(2, total_pages + 1)]
assert len(set(next_page_urls)) == len(next_page_urls)
other_pages = await asyncio.gather(*[session.get(url) for url in next_page_urls])
else:
other_pages = []
reviews = []
for response in [first_page, *other_pages]:
reviews.extend(parse_reviews(response))
log.info(f"scraped total {len(reviews)} reviews")
return reviews
In the above scraper we are putting together everything we've learned in this tutorial:
Let's run this scraper and see what output it generates:
async def run():
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
data = await scrape_reviews("B07L5G6M1Q", session=session)
print(json.dumps(data, indent=2))
if __name__ == "__main__":
asyncio.run(run())
[
{
"text": "I have the previous generation oasis as well (on the left side in the pics) and wanted this one for reading at night. Overall there's not many differences between the two, so if the light tone customizability isn't important to you I wouldn't particularly recommend this one over the 9th gen. However, the lighting is noticeably more even with the 10th gen (my older one visibly fades from one side to the other) and there's a ton of variability in the tone of the screen. Overall, for me it was worth it, but your mileage may vary if you don't read in a dark room (so as not to wake the spouse) before bed very often.",
"title": "Loving it so far",
"location_and_date": "Reviewed in the United States on July 29, 2019",
"verified": true,
"rating": "5.0"
},
{
"text": "So I've been using a Kindle Paperwhite since 2014 and absolutely loved it. Despite it being five years old, it still worked great and has been a pleasure as a reading device. ",
"title": "From 2014 Paperwhite to 2019 Oasis",
"location_and_date": "Reviewed in the United States on August 9, 2019",
"verified": true,
"rating": "3.0"
},
...
]
By this point, we've learned how to find products on Amazon and scrape their description, pricing and review data. However, to scrape Amazon at scale we have to fortify our scraper from being blocked - let's see how we can do that using ScrapFly web scraping API!
We looked at how to Scrape Amazon.com though unfortunately when scraping at scale it's very likely that Amazon will start to either block us or start serving us captchas, which will hinder or completely disable our web scraper.
To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us just with a few extra lines of Python code!
ScrapFly offers several powerful features that'll help us to get around Amazon's web scraper blocking:
For this, we'll be using scrapfly-sdk python package. First, let's install scrapfly-sdk
using pip:
$ pip install scrapfly-sdk
To take advantage of ScrapFly's API in our Amazon web scraper all we need to do is change our httpx
session code with scrapfly-sdk
client requests.
Finally, let's put everything together: product discovery, product and variant scraping, review scraping and ScrapFly integration into our final scraper code:
import asyncio
import json
import math
import re
from typing import List, Optional
from urllib.parse import urljoin
from loguru import logger as log
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient
from typing_extensions import TypedDict
class ProductPreview(TypedDict):
"""result generated by search scraper"""
url: str
title: str
price: str
real_price: str
rating: str
rating_count: str
def parse_search(result: ScrapeApiResponse) -> List[ProductPreview]:
"""Parse search result page for product previews"""
previews = []
product_boxes = result.selector.css("div.s-result-item[data-component-type=s-search-result]")
for box in product_boxes:
url = urljoin("https://www.amazon.com/", box.css("h2>a::attr(href)").get()).split("?")[0]
if "/slredirect/" in url: # skip ads etc.
continue
previews.append(
{
"url": url,
"title": box.css("h2>a>span::text").get(),
# big price text is discounted price
"price": box.css(".a-price[data-a-size=xl] .a-offscreen::text").get(),
# small price text is "real" price
"real_price": box.css(".a-price[data-a-size=b] .a-offscreen::text").get(),
"rating": (box.css("span[aria-label~=stars]::attr(aria-label)").re(r"(\d+\.*\d*) out") or [None])[0],
"rating_count": box.css("span[aria-label~=stars] + span::attr(aria-label)").get(),
}
)
log.info(f"parsed {len(previews)} product previews from search page {result.context['url']}")
return previews
async def search(query: str, session: ScrapflyClient):
"""Search for amazon products using searchbox"""
log.info(f"{query}: scraping first page")
# first, let's scrape first query page to find out how many pages we have in total:
first_page_result = await session.async_scrape(
ScrapeConfig(url=f"https://www.amazon.com/s?k={query}&page=1", country="US")
)
_page_numbers = first_page_result.selector.xpath(
'//a[has-class("s-pagination-item")][not(has-class("s-pagination-separator"))]/text()'
).getall()
total_pages = max(int(number) for number in _page_numbers)
# now we can scrape remaining pages concurrently
log.info(f"{query}: found {total_pages}, scraping them concurrently")
next_urls = [f"https://www.amazon.com/s?k={query}&page={page}" for page in range(2, total_pages + 1)]
previews = parse_search(first_page_result)
async for result in session.concurrent_scrape([ScrapeConfig(url=url, country="US") for url in next_urls]):
result: ScrapeApiResponse
previews.extend(parse_search(result))
log.info(f"{query}: found total of {len(previews)} product previews")
return previews
# --------------------------------------------------
# REVIEWS
# --------------------------------------------------
class ReviewData(TypedDict):
title: str
text: str
location_and_date: str
verified: bool
rating: float
def parse_reviews(result: ScrapeApiResponse):
"""parse review from single review page"""
review_boxes = result.selector.css("#cm_cr-review_list div.review")
parsed = []
for box in review_boxes:
parsed.append(
{
"text": "".join(box.css("span[data-hook=review-body] ::text").getall()).strip(),
"title": box.css("*[data-hook=review-title]>span::text").get(),
"location_and_date": box.css("span[data-hook=review-date] ::text").get(),
"verified": bool(box.css("span[data-hook=avp-badge] ::text").get()),
"rating": box.css("*[data-hook*=review-star-rating] ::text").re(r"(\d+\.*\d*) out")[0],
}
)
return parsed
async def scrape_reviews(asin: str, session: ScrapflyClient):
"""scrape all reviews of a given ASIN of an amazon product"""
url = f"https://www.amazon.com/product-reviews/{asin}/"
log.info(f"scraping review page: {url}")
# find first page
first_page_result = await session.async_scrape(ScrapeConfig(url, country="US"))
total_reviews = first_page_result.selector.css("div[data-hook=cr-filter-info-review-rating-count] ::text").re(
r"(\d+,*\d*)"
)[1]
total_reviews = int(total_reviews.replace(",", ""))
total_pages = int(math.ceil(total_reviews / 10.0))
log.info(f"found total {total_reviews} reviews across {total_pages} pages -> scraping")
_next_page = urljoin(url, first_page_result.selector.css(".a-pagination .a-last>a::attr(href)").get())
next_page_urls = [_next_page.replace("pageNumber=2", f"pageNumber={i}") for i in range(1, total_pages + 1)]
assert len(set(next_page_urls)) == len(next_page_urls)
reviews = parse_reviews(first_page_result)
async for result in session.concurrent_scrape([ScrapeConfig(url, country="US") for url in next_page_urls]):
reviews.extend(parse_reviews(result))
log.info(f"scraped total {len(reviews)} reviews")
return reviews
# --------------------------------------------------
# PRODUCT
# --------------------------------------------------
class ProductInfo(TypedDict):
"""type hint storage of Amazons product information"""
name: str
stars: str
rating_coutn: str
features: List[str]
images: dict
def parse_product(result) -> ProductInfo:
"""parse Amazon's product page (e.g. https://www.amazon.com/dp/B07KR2N2GF) for essential product data"""
# images are stored in javascript state data found in the html
# for this we can use a simple regex pattern:
image_data = json.loads(re.findall(r"colorImages':.*'initial':\s*(\[.+?\])},\n", result.content)[0])
# the other fields can be extracted with simple css selectors
# we can define our helper functions to keep our code clean
sel = result.selector
return {
"name": sel.css("#productTitle::text").get("").strip(),
"stars": sel.css("i[data-hook=average-star-rating] ::text").get("").strip(),
"rating_count": sel.css("div[data-hook=total-review-count] ::text").get("").strip(),
"features": sel.css("#feature-bullets li ::text").getall(),
"images": image_data,
}
class VariantData(TypedDict):
"""type hint storage of Amazon's product variant information"""
asin: str
price: str
name: str
def parse_variant(result: ScrapeApiResponse) -> VariantData:
"""parse current variant data from product page result"""
parsed = {
"asin": result.selector.css("link[rel=canonical]::attr(href)").re("dp/([^/]+)")[0],
"price": result.selector.css("#apex_offerDisplay_desktop .a-price>.a-offscreen::text").get(),
}
# features like color, storage etc. - anything that identifies the variant
features = result.selector.css("#twister>div[id*=variation_]")
for feature in features:
label = feature.xpath("@id").get("").split("variation_")[-1]
selection = feature.css(".selection::text").get("").strip()
parsed[label] = selection
return parsed
async def scrape_variant(asin: str, session: ScrapflyClient) -> VariantData:
"""Scrape Amazon's product current variant data"""
result = await session.async_scrape(ScrapeConfig(f"https://www.amazon.com/dp/{asin}", country="US"))
return parse_variant(result)
class ProductData(TypedDict):
info: dict
variants: List[dict]
reviews: Optional[List[dict]]
async def scrape_product(asin: str, session: ScrapflyClient, reviews=True) -> ProductData:
"""scrape Amazon.com product"""
log.info(f"scraping {asin}")
product_result = await session.async_scrape(ScrapeConfig(f"https://www.amazon.com/dp/{asin}", country="US"))
variants = [parse_variant(product_result)]
# if product has variants - we want to scrape all of them
_variation_data = re.findall(r'dimensionValuesDisplayData"\s*:\s* ({.+?}),\n', product_result.content)
variant_asins = [variant_asin for variant_asin in json.loads(_variation_data[0]) if variant_asin != asin]
log.info(f"scraping {len(variant_asins)} variants: {variant_asins}")
if _variation_data:
variants.extend(await asyncio.gather(*[scrape_variant(asin, session) for asin in variant_asins]))
result = {
"info": parse_product(product_result),
"variants": variants,
}
if reviews:
# find review parameters
result["reviews"] = await scrape_reviews(asin, session=session)
return result
async def scrape_products(urls: List[str], session, reviews=True):
return await asyncio.gather(*[scrape_product(url, session, reviews=reviews) for url in urls])
async def run():
with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=20) as session:
search_results = await search("lego", session=session)
product_result = await scrape_product("B07L5G6M1Q", session=session)
return
if __name__ == "__main__":
asyncio.run(run())
To wrap this guide up let's take a look at some frequently asked questions about web scraping Amazon.com:
Yes. Amazon's data is publicly available, and we're not extracting anything personal or private. Scraping Amazon.com at slow, respectful rates would fall under the ethical scraping definition. That being said, attention should be paid to GDRP compliance in the EU when scraping personal data such as personal people's data from the reviews section. For more, see our Is Web Scraping Legal? article.
In this tutorial, we built an Amazon.com scraper by understanding how the website functions, so we could replicate its functionality in our web scraper. First, we replicated the search function to find products, then we scraped product information and variant data and finally, we scraped all of the reviews of each product.
For this, we used Python with httpx and parsel packages and to prevent being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!