How to scrape Threads by Meta using Python (2023-08 Update)
Guide how to scrape Threads - new social media network by Meta and Instagram - using Python and popular libraries like Playwright and background request capture techniques.
Aliexpress is one the biggest global e-commerce stores from China as well as being a popular web scraping target.
Aliexpress contains millions of products and product reviews that can be used in market analytics, business intelligence and dropshipping.
In this tutorial, we'll take a look at how to scrape Aliexpress. We'll start by finding products by scraping the search system. Then we'll scrape the found product data, pricing and customer reviews.
This will be a relatively easy scraper in just a few lines of Python code. Let's dive in!
There are many reasons to scrape Aliexpress data. For starters, because Aliexpress is the biggest e-commerce platform in the world, it's a prime target for business intelligence or market analytics. Having an awareness of top products and their meta-information on Aliexpress can be used to great advantage in business and market analysis.
Another common use is e-commerce primarily via dropshipping - one of the biggest emergent markets of this century is curating a list of products and reselling them directly rather than managing a warehouse. In this case, many shop curators would scrape Aliexpress products to generate curated product lists for their dropshipping shops.
In this tutorial we'll be using Python with two packages:
All of these packages can be easily installed via pip
command:
$ pip install httpx parsel
Alternatively, you're free to swap httpx
out with any other HTTP client library such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel
, another great alternative is beautifulsoup package.
While our Aliexpress scraper is pretty easy if you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.
There are many ways to discover products on Aliexpress.
We could use the search system to find products we want to scrape or explore many product categories. Whichever approach we take our key target is all the same - scrape product previews and pagination.
Let's take a look at Aliexpress listing page that is used in the search or category view:
If we take a look at the page source of either search or category page we can see that all the product previews are stored in a javascript variable window.runParams
tucked away in the <script>
tag in the HTML source of the page:
This is a common web development pattern, which enables dynamic data management using javascript.
It's good news for us though, as we can pick this data up with a simple regex pattern and parse it like a Python dictionary! This is generally called hidden web data scraping and it's a common pattern in modern web scraping.
With this, we can write the first piece of our scraper code - the product preview parser. We'll be using it to extract product preview data from category or search result pages:
from parsel import Selector
import json
def extract_search(response) -> Dict:
"""extract json data from search page"""
# find script with result.pagectore data in it._it_t_=
script_with_data = sel.xpath('//script[contains(text(),"window.runParams")]')
# select page data from javascript variable in script tag using regex
data = json.loads(script_with_data.re(r'_init_data_\s*=\s*{\s*data:\s*({.+}) }')[0])
return data['data']['root']['fields']
def parse_search(response):
"""Parse search page response for product preview results"""
data = extract_search(response)
parsed = []
for result in data["mods"]["itemList"]["content"]:
parsed.append(
{
"id": result["productId"],
"url": f"https://www.aliexpress.com/item/{result['productId']}.html",
"type": result["productType"], # can be either natural or ad
"title": result["title"]["displayTitle"],
"price": result["prices"]["salePrice"]["minPrice"],
"currency": result["prices"]["salePrice"]["currencyCode"],
"trade": result.get("trade", {}).get("tradeDesc"), # trade line is not always present
"thumbnail": result["image"]["imgUrl"].lstrip("/"),
"store": {
"url": result["store"]["storeUrl"],
"name": result["store"]["storeName"],
"id": result["store"]["storeId"],
"ali_id": result["store"]["aliMemberId"],
},
}
)
return parsed
Let's try our parser out by scraping a single Aliexpress listing page (category page or search results page):
if __name__ == "__main__":
# for example, this category is for android phones:
resp = httpx.get("https://www.aliexpress.com/category/5090301/cellphones.html", follow_redirects=True)
print(json.dumps(parse_search(resp), indent=2))
[
{
"id": "3256804075561256",
"url": "https://www.aliexpress.com/item/3256804075561256.html",
"type": "ad",
"title": "2G/3G Smartphones Original 512MB RAM/1G RAM 4GB ROM android mobile phones new cheap celulares FM unlocked 4.0inch cell",
"price": 21.99,
"currency": "USD",
"trade": "8 sold",
"thumbnail": "ae01.alicdn.com/kf/S1317aeee4a064fad8810a58959c3027dm/2G-3G-Smartphones-Original-512MB-RAM-1G-RAM-4GB-ROM-android-mobile-phones-new-cheap-celulares.jpg_220x220xz.jpg",
"store": {
"url": "www.aliexpress.com/store/1101690689",
"name": "New 123 Store",
"id": 1101690689,
"ali_id": 247497658
}
}
...
]
There's a lot of useful information, but we've limited our parser to bare essentials to keep things brief. Let's put this parser to use in actual scraping next.
Now that we have our product preview parser ready, we need a scraper loop that will iterate through search results to collect all available results - not just the first page:
import httpx
async def scrape_search(query: str, session: httpx.AsyncClient, sort_type="default"):
"""Scrape all search results and return parsed search result data"""
query = query.replace(" ", "+")
async def scrape_search_page(page):
"""Scrape a single aliexpress search page and return all embedded JSON search data"""
print(f"scraping search query {query}:{page} sorted by {sort_type}")
resp = await session.get(
"https://www.aliexpress.com/wholesale?trafficChannel=main"
f"&d=y&CatId=0&SearchText={query}<ype=wholesale&SortType={sort_type}&page={page}"
)
return resp
# scrape first search page and find total result count
first_page = await scrape_search_page(query, session, 1)
first_page_data = extract_search(first_page)
page_size = first_page_data["pageInfo"]["pageSize"]
total_pages = int(math.ceil(first_page_data["pageInfo"]["totalResults"] / page_size))
if total_pages > 60:
log.warning(f"query has {total_pages}; lowering to max allowed 60 pages")
total_pages = 60
# scrape remaining pages concurrently
print(f'scraping search "{query}" of total {total_pages} sorted by {sort_type}')
other_pages = await asyncio.gather(*[scrape_search_page(page=i) for i in range(1, total_pages + 1)])
product_previews = []
for response in [first_page, *other_pages]:
product_previews.extend(parse_search(response))
return product_previews
Above, we defined our scrape_search
function we use a common web scraping idiom for known length pagination:
We scrape the first page to extract the total number of pages and scrape the remaining pages concurrently.
Now, that we can find products let's take a look at how we can scrape product data, pricing info and reviews!
To scrape Aliexpress products all we need is a product numeric ID, which we already found in the previous chapter by scraping product previews from Aliexpress search. For example, this hand drill product aliexpress.com/item/4000927436411.html has the numeric ID of 4000927436411
.
To parse product data we can use the same technique we used in our search parser - the data is hidden in the HTML document under window.runParams
variable's data
key:
from parsel import Selector
def parse_product(response):
"""parse product HTML page for product data"""
sel = Selector(text=response.text)
# find the script tag containing our data:
script_with_data = sel.xpath('//script[contains(text(),"window.runParams")]')
# extract data using a regex pattern:
data = json.loads(script_with_data.re(r"data: ({.+?}),\n")[0])
product = {
"name": data["titleModule"]["subject"],
"total_orders": data["titleModule"]["formatTradeCount"],
"feedback": data["titleModule"]["feedbackRating"],
"variants": [],
}
# every product variant has it's own price and ID number (sku):
for sku in data["skuModule"]["skuPriceList"]:
product["variants"].append(
{
"name": sku["skuAttr"].split("#", 1)[1].split(";")[0],
"sku": sku["skuId"],
"available": sku["skuVal"]["availQuantity"],
"full_price": sku["skuVal"]["skuAmount"]["value"],
"discount_price": sku["skuVal"]["skuActivityAmount"]["value"],
"currency": sku["skuVal"]["skuAmount"]["currency"],
}
)
# data variable contains much more information - so feel free to explore it,
# but to keep things brief we focus on essentials in this article
return product
async def scrape_products(ids, session: httpx.AsyncClient):
"""scrape aliexpress products by id"""
print(f"scraping {len(ids)} products")
responses = await asyncio.gather(*[session.get(f"https://www.aliexpress.com/item/{id_}.html") for id_ in ids])
results = []
for response in responses:
results.append(parse_product(response))
return results
Here, we defined our product scraping function which takes in product IDs, scrapes HTML contents and extracts hidden product JSON of each product. If we run it for our drill product we should see a nicely formatted response:
# Let's use browser like request headers for this scrape to reduce chance of being blocked or asked to solve a captcha
BASE_HEADERS = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
async def run():
async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
print(json.dumps(await scrape_products(["4000927436411"], session), indent=2))
if __name__ == "__main__":
import asyncio
asyncio.run(run())
[
{
"name": "Mini Wireless Drill Electric Carving Pen Variable Speed USB Cordless Drill Rotary Tools Kit Engraver Pen for Grinding Polishing",
"total_orders": "3824",
"feedback": {
"averageStar": "4.8",
"averageStarRage": "96.4",
"display": true,
"evarageStar": "4.8",
"evarageStarRage": "96.4",
"fiveStarNum": 1724,
"fiveStarRate": "88",
"fourStarNum": 170,
"fourStarRate": "9",
"oneStarNum": 21,
"oneStarRate": "1",
"positiveRate": "87.6",
"threeStarNum": 45,
"threeStarRate": "2",
"totalValidNum": 1967,
"trialReviewNum": 0,
"twoStarNum": 7,
"twoStarRate": "0"
},
"variants": [
{
"name": "Red",
"sku": 10000011265318724,
"available": 1601,
"full_price": 16.24,
"discount_price": 12.99,
"currency": "USD"
},
...
}
]
Using this approach, we scrapped much more data than we could see in the visible HTML of the page. We got SKU numbers, stock availability, detailed pricing and review score meta information. We're only missing reviews themselves so let's take a look at how we can retrieve the review data.
Aliexpress' product reviews require additional request to its backend API. If we fire up Network Inspector devtools (F12 in major browsers and then "Network" tab) we can see a background request being made when we click on a next review page:
Let's replicate this request in our scraper:
def parse_review_page(response):
"""parse single review page"""
sel = Selector(response.text)
parsed = []
for review_box in sel.css(".feedback-item"):
# to get star score we have to rely on styling where's 1 star == 20% width, e.g. 4 stars is 80%
stars = int(review_box.css(".star-view>span::attr(style)").re("width:(\d+)%")[0]) / 20
# to get options we must iterate through every options container
options = {}
for option in review_box.css("div.user-order-info>span"):
name = option.css("strong::text").get("").strip()
value = "".join(option.xpath("text()").getall()).strip()
options[name] = value
# parse remaining fields
parsed.append(
{
"country": review_box.css(".user-country>b::text").get("").strip(),
"text": review_box.xpath('.//dt[contains(@class,"buyer-feedback")]/span[1]/text()').get("").strip(),
"post_time": review_box.xpath('.//dt[contains(@class,"buyer-feedback")]/span[2]/text()').get("").strip(),
"stars": stars,
"order_info": options,
"user_name": review_box.css(".user-name>a::text").get(),
"user_url": review_box.css(".user-name>a::attr(href)").get(),
}
)
return parsed
async def scrape_product_reviews(seller_id: str, product_id: str, session: httpx.AsyncClient):
"""scrape all reviews of aliexpress product"""
async def scrape_page(page):
log.debug(f"scraping review page {page} of product {product_id}")
data = f"ownerMemberId={seller_id}&memberType=seller&productId={product_id}&companyId=&evaStarFilterValue=all+Stars&evaSortValue=sortlarest%40feedback&page={page}¤tPage={page-1}&startValidDate=&i18n=true&withPictures=false&withAdditionalFeedback=false&onlyFromMyCountry=false&version=&isOpened=true&translate=+Y+&jumpToTop=true&v=2"
resp = await session.post(
"https://feedback.aliexpress.com/display/productEvaluation.htm",
data=data,
headers={**session.headers, "Content-Type": "application/x-www-form-urlencoded"},
)
return resp
# scrape first page of reviews and find total count of review pages
first_page = await scrape_page(page=1)
sel = Selector(text=first_page.text)
total_reviews = sel.css("div.customer-reviews").re(r"\((\d+)\)")[0]
total_pages = int(math.ceil(int(total_reviews) / 10))
# then scrape remaining review pages concurrently
print(f"scraping reviews of product {product_id}, found {total_reviews} total reviews")
other_pages = await asyncio.gather(*[scrape_page(page) for page in range(1, total_pages + 1)])
reviews = []
for resp in [first_page, *other_pages]:
reviews.extend(parse_review_page(resp))
return reviews
For scraping reviews we're using the same paging idiom we've learned earlier - we request the first page, find the total count and retrieve the rest concurrently.
Further, since reviews are only available in HTML structure we have to dig into HTML parsing a bit more. We iterated through each review box and extracted core details such as star rating, review text and title etc. - all with a few clever XPath and CSS selectors!
For more on parsing HTML using XPATH see our complete, interactive introduction course.
Now, that we have our Aliexpress review scraper let's take it for a spin. For that we'll need seller ID and product ID, which we found previously in our product data scraper (fields sellerId
and productId
)
async def run():
async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
print(json.dumps(await scrape_product_reviews("220712488", "4000714658687", session), indent=2))
if __name__ == "__main__":
asyncio.run(run())
[
{
"country": "BR",
"text": "As requested and",
"post_time": "31 May 2022 16:11",
"stars": 5.0,
"order_info": {
"Color:": "DKCD20FU-Li SET2",
"Ships From:": "China",
"Logistics:": "Seller's Shipping Method"
},
"user_name": "S***s",
"user_url": "feedback.aliexpress.com/display/detail.htm?ownerMemberId=XXXXXXXXX==&memberType=buyer"
},
...
]
With this, we've covered the main scrape targets of Aliexpress - we scraped search to find products, product pages to find product data and product reviews to gather feedback intelligence. Finally, to scrape at scale let's take a look at how can we avoid blocking and captchas.
Scraping product data of Aliexpress.com seems to be easy though unfortunately when scraping at the scale we might be blocked or requested to start solving captchas which will hinder our web scraping process.
To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!
Which offers several powerful features that'll help us to get around AliExpress's blocking:
For this, we'll be using scrapfly-sdk python package and ScrapFly's anti scraping protection bypass feature. First, let's install scrapfly-sdk
using pip:
$ pip install scrapfly-sdk
To take advantage of ScrapFly's API in our AliExpress product scraper all we need to do is our httpx
session code with scrapfly-sdk
requests.
To wrap this guide up, let's take a look at some frequently asked questions about web scraping aliexpress.com:
Yes. Aliexpress product data is publicly available, and we're not extracting anything personal or private. Scraping aliexpress.com at slow, respectful rates would fall under the ethical scraping definition. See our Is Web Scraping Legal? article for more.
No. Currently there's no public API for retrieving product data from Aliexpress.com. Fortunately, as covered in this tutorial, web scraping Aliexpress is easy and can be done with a few lines of Python code!
The main cause of data difference is geo location. Aliexpress.com shows different prices and products based on the user's location so the scraper needs to match the location of the desired data. If you're using Scrapfly API then see our geo location selection feature.
In this tutorial, we built an Aliexpress data scraper capable of using the search system to discover products and scraping product data and product reviews.
We have used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API, which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!