How to Scrape Aliexpress.com (2024 Update)

article feature image

Aliexpress is one the biggest global e-commerce stores from China as well as being a popular web scraping target.

Aliexpress contains millions of products and product reviews that can be used in market analytics, business intelligence and dropshipping.

In this tutorial, we'll take a look at how to scrape Aliexpress. We'll start by finding products by scraping the search system. Then we'll scrape the found product data, pricing and customer reviews.

This will be a relatively easy scraper in just a few lines of Python code. Let's dive in!

Latest Aliexpress.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Why Scrape Aliexpress?

There are many reasons to scrape Aliexpress data. For starters, because Aliexpress is the biggest e-commerce platform in the world, it's a prime target for business intelligence or market analytics. Having an awareness of top products and their meta-information on Aliexpress can be used to great advantage in business and market analysis.

Another common use is e-commerce primarily via dropshipping - one of the biggest emergent markets of this century is curating a list of products and reselling them directly rather than managing a warehouse. In this case, many shop curators would scrape Aliexpress products to generate curated product lists for their dropshipping shops.

Project Setup

In this tutorial we'll be using Python with two packages:

  • httpx - HTTP client library, which will let us communicate with AliExpress.com's servers
  • parsel - HTML parsing library, which will help us to parse our web scraped HTML files for product data.
  • jmespath - JSON parsing library, which we'll use to refine very long JSON datasets.

All of these packages can be easily installed via pip command:

$ pip install httpx parsel jmespath

Alternatively, you're free to swap httpx out with any other HTTP client library such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package.

Hands on Python Web Scraping Tutorial and Example Project

While our Aliexpress scraper is pretty easy if you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.

Hands on Python Web Scraping Tutorial and Example Project

Finding Aliexpress Products

There are many ways to discover products on Aliexpress.
We could use the search system to find products we want to scrape or explore many product categories. Whichever approach we take our key target is all the same - scrape product previews and pagination.

Let's take a look at Aliexpress listing page that is used in the search or category view:

0:00
/

If we take a look at the page source of either search or category page we can see that all the product previews are stored in a javascript variable window.runParams tucked away in the <script> tag in the HTML source of the page:

page source illustration
We can see product preview data by exploring page source in our browser

This is a common web development pattern, which enables dynamic data management using javascript.

It's good news for us though, as we can pick this data up with a simple regex pattern and parse it like a Python dictionary! This is generally called hidden web data scraping and it's a common pattern in modern web scraping.

With this, we can write the first piece of our Aliexpress scraper code - the product preview parser. We'll be using it to extract product preview data from category or search result pages:

from parsel import Selector
from typing import Dict
import httpx
import json

def extract_search(response) -> Dict:
    """extract json data from search page"""
    sel = Selector(response.text)
    # find script with result.pagectore data in it._it_t_=
    script_with_data = sel.xpath('//script[contains(.,"_init_data_=")]')
    # select page data from javascript variable in script tag using regex
    data = json.loads(script_with_data.re(r'_init_data_\s*=\s*{\s*data:\s*({.+}) }')[0])
    return data['data']['root']['fields']

def parse_search(response):
    """Parse search page response for product preview results"""
    data = extract_search(response)
    parsed = []
    for result in data["mods"]["itemList"]["content"]:
        parsed.append(
            {
                "id": result["productId"],
                "url": f"https://www.aliexpress.com/item/{result['productId']}.html",
                "type": result["productType"],  # can be either natural or ad
                "title": result["title"]["displayTitle"],
                "price": result["prices"]["salePrice"]["minPrice"],
                "currency": result["prices"]["salePrice"]["currencyCode"],
                "trade": result.get("trade", {}).get("tradeDesc"),  # trade line is not always present
                "thumbnail": result["image"]["imgUrl"].lstrip("/"),
                "store": {
                    "url": result["store"]["storeUrl"],
                    "name": result["store"]["storeName"],
                    "id": result["store"]["storeId"],
                    "ali_id": result["store"]["aliMemberId"],
                },
            }
        )
    return parsed

Let's try our parser out by scraping a single Aliexpress listing page (category page or search results page):

Run code & example output
if __name__ == "__main__":
    # for example, this category is for android phones:
    resp = httpx.get("https://www.aliexpress.com/category/5090301/cellphones.html", follow_redirects=True)
    print(json.dumps(parse_search(resp), indent=2, ensure_ascii=False))
[
  {
    "id": "3256804075561256",
    "url": "https://www.aliexpress.com/item/3256804075561256.html",
    "type": "ad",
    "title": "2G/3G Smartphones Original 512MB RAM/1G RAM 4GB ROM android mobile phones new cheap celulares FM unlocked 4.0inch cell",
    "price": 21.99,
    "currency": "USD",
    "trade": "8 sold",
    "thumbnail": "ae01.alicdn.com/kf/S1317aeee4a064fad8810a58959c3027dm/2G-3G-Smartphones-Original-512MB-RAM-1G-RAM-4GB-ROM-android-mobile-phones-new-cheap-celulares.jpg_220x220xz.jpg",
    "store": {
      "url": "www.aliexpress.com/store/1101690689",
      "name": "New 123 Store",
      "id": 1101690689,
      "ali_id": 247497658
    }
  }
  ...
]

There's a lot of useful information, but we've limited our parser to bare essentials to keep things brief. Next, let's put this parser to use in actual scraping.

Now that we have our product preview parser ready, we need a scraper loop that will iterate through search results to collect all available results - not just the first page:

from parsel import Selector
from typing import Dict
import httpx
import math
import json
import asyncio

def extract_search(response) -> Dict:
    # rest of the function logic

def parse_search(response):
    # rest of the function logic

async def scrape_search(query: str, session: httpx.AsyncClient, sort_type="default", max_pages: int = None):
    """Scrape all search results and return parsed search result data"""
    query = query.replace(" ", "+")

    async def scrape_search_page(page):
        """Scrape a single aliexpress search page and return all embedded JSON search data"""
        print(f"scraping search query {query}:{page} sorted by {sort_type}")
        resp = await session.get(
            "https://www.aliexpress.com/wholesale?trafficChannel=main"
            f"&d=y&CatId=0&SearchText={query}&ltype=wholesale&SortType={sort_type}&page={page}"
        )
        return resp

    # scrape first search page and find total result count
    first_page = await scrape_search_page(1)
    first_page_data = extract_search(first_page)
    page_size = first_page_data["pageInfo"]["pageSize"]
    total_pages = int(math.ceil(first_page_data["pageInfo"]["totalResults"] / page_size))
    if total_pages > 60:
        print(f"query has {total_pages}; lowering to max allowed 60 pages")
        total_pages = 60

    # get the number of total pages to scrape
    if max_pages and max_pages < total_pages:
        total_pages = max_pages

    # scrape remaining pages concurrently
    print(f'scraping search "{query}" of total {total_pages} sorted by {sort_type}')

    other_pages = await asyncio.gather(*[scrape_search_page(page=i) for i in range(1, total_pages + 1)])
    for response in [first_page, *other_pages]:
        product_previews = []
        product_previews.extend(parse_search(response))

    return product_previews
Run the code
async def run():
    client = httpx.AsyncClient(follow_redirects=True)
    data = await scrape_search(query="cell phones", session=client, max_pages=3)
    print(json.dumps(data, indent=2, ensure_ascii=False))

if __name__ == "__main__":
    asyncio.run(run()) 

Above, we defined our scrape_search function we use a common web scraping idiom for known length pagination:

efficient pagination scraping illustration

We scrape the first page to extract the total number of pages and scrape the remaining pages concurrently. We also add a max_pages parameter to control the number of pagination pages.

Now, that we can find products let's take a look at how we can scrape product data, pricing info and reviews!

Scraping Aliexpress Products

To scrape Aliexpress products all we need is a product numeric ID, which we already found in the previous chapter by scraping product previews from Aliexpress search. For example, this hand drill product aliexpress.com/item/4000927436411.html has the numeric ID of 4000927436411.

To parse product data we can use the same technique we used in our search parser - the data is hidden in the HTML document under window.runParams variable's data key:

from parsel import Selector
import re
import httpx
import json
import asyncio
import jmespath
from parsel import Selector

def parse_product(response):
    """parse product HTML page for product data"""
    print(response)
    sel = Selector(text=response.text)
    # find the script tag containing our data:
    script_with_data = sel.xpath('//script[contains(text(),"window.runParams")]/text()').get()
    # extract data using a regex pattern:    
    data = re.findall(r".+?data:\s*({.+?)};", script_with_data, re.DOTALL)
    data = json.loads(data[0])
    with open("data.json", "w", encoding="utf-8") as file:
        json.dump(data, file, indent=2, ensure_ascii=False)
    if "skuModule" not in data:
        product = jmespath.search("""{
            name: productInfoComponent.subject,
            total_orders: tradeComponent.formatTradeCount,
            feedback: feedbackComponent,
            description_url: productDescComponent.descriptionUrl,
            description_short: metaDataComponent.description,
            keywords: metaDataComponent.keywords,
            images: imageComponent.imagePathList,
            stock: inventoryComponent.totalAvailQuantity,
            seller: sellerComponent.{
                id: storeNum,
                url: storeURL,
                name: storeName,
                country: countryCompleteName,
                positive_rating: positiveRate,
                positive_rating_count: positiveNum,
                started_on: openTime,
                is_top_rated: topRatedSeller
            },
            specification: productPropComponent.props[].{
                name: attrName,
                value: attrValue
            },
            variants: priceComponent.skuPriceList[].{
                name: skuAttr,
                sku: skuId,
                available: skuVal.availQuantity,
                stock: skuVal.inventory,
                full_price: skuVal.skuAmount.value,
                discount_price: skuVal.skuActivityAmount.value,
                currency: skuVal.skuAmount.currency
            }
        }""", data)
    else:
        product = jmespath.search("""{
            name: titleModule.subject,
            total_orders: titleModule.formatTradeCount,
            feedback: titleModule.feedbackRating,
            description_url: descriptionModule.descriptionUrl,
            description_short: pageModule.description,
            keywords: pageModule.keywords,
            images: imageModule.imagePathList,
            stock: quantityModule.totalAvailQuantity,
            seller: storeModule.{
                id: storeNum,
                url: storeURL,
                name: storeName,
                country: countryCompleteName,
                positive_rating: positiveRate,
                positive_rating_count: positiveNum,
                started_on: openTime,
                is_top_rated: topRatedSeller
            },
            specification: specsModule.props[].{
                name: attrName,
                value: attrValue
            },
            variants: skuModule.skuPriceList[].{
                name: skuAttr,
                sku: skuId,
                available: skuVal.availQuantity,
                stock: skuVal.inventory,
                full_price: skuVal.skuAmount.value,
                discount_price: skuVal.skuActivityAmount.value,
                currency: skuVal.skuAmount.currency
            }
        }""", data)
    product['specification'] = dict([v.values() for v in product.get('specification', {})])
    return product


async def scrape_products(ids, session: httpx.AsyncClient):
    """scrape aliexpress products by id"""
    print(f"scraping {len(ids)} products")
    responses = await asyncio.gather(*[session.get(f"https://www.aliexpress.com/item/{id_}.html") for id_ in ids])
    results = []
    for response in responses:
        results.append(parse_product(response))
    return results

Here, we defined our product scraping function which takes in product IDs, scrapes HTML contents and extracts hidden product JSON of each product. The JSON results are very long datasets, so we used JMESPath to look for the desired data points.

Quick Intro to Parsing JSON with JMESPath in Python

In this Jmespath tutorial, we'll take a quick overview of this path language in web scraping and Python. You will learn how to trim very large JSON datasets to extract specific data.

Quick Intro to Parsing JSON with JMESPath in Python

If we run it for our drill product, we should see a nicely formatted response:

Run code & example output
# Let's use browser like request headers for this scrape to reduce chance of being blocked or asked to solve a captcha
BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    async with httpx.AsyncClient(headers=BASE_HEADERS, follow_redirects=True) as session:
        print(json.dumps(await scrape_products(["4000927436411"], session), indent=2, ensure_ascii=False))

if __name__ == "__main__":
    import asyncio
    asyncio.run(run())
[
  {
    "name": "Mini Wireless Drill Electric Carving Pen Variable Speed USB Cordless Drill Rotary Tools Kit Engraver Pen for Grinding Polishing",
    "total_orders": "3824",
    "feedback": {
      "averageStar": "4.8",
      "averageStarRage": "96.4",
      "display": true,
      "evarageStar": "4.8",
      "evarageStarRage": "96.4",
      "fiveStarNum": 1724,
      "fiveStarRate": "88",
      "fourStarNum": 170,
      "fourStarRate": "9",
      "oneStarNum": 21,
      "oneStarRate": "1",
      "positiveRate": "87.6",
      "threeStarNum": 45,
      "threeStarRate": "2",
      "totalValidNum": 1967,
      "trialReviewNum": 0,
      "twoStarNum": 7,
      "twoStarRate": "0"
    },
    "variants": [
      {
        "name": "Red",
        "sku": 10000011265318724,
        "available": 1601,
        "full_price": 16.24,
        "discount_price": 12.99,
        "currency": "USD"
      },
      ...
  }
]

Using this approach, we scrapped much more data than we could see in the visible HTML of the page. We got SKU numbers, stock availability, detailed pricing and review score meta information. We're only missing reviews themselves so let's take a look at how we can retrieve the review data.

Scraping Aliexpress Reviews

Aliexpress' product reviews require additional request to its backend API. If we fire up Network Inspector devtools (F12 in major browsers and then "Network" tab) we can see a background request being made when we click on a next review page:

0:00
/
We can see a background request being made when we click on page 2 link

Let's replicate this request in our scraper:

from parsel import Selector
import httpx
import json
import math
import asyncio
from parsel import Selector

def parse_review_page(response):
    """parse single review page"""
    sel = Selector(response.text)
    parsed = []
    for review_box in sel.css(".feedback-item"):
        # to get star score we have to rely on styling where's 1 star == 20% width, e.g. 4 stars is 80%
        stars = int(review_box.css(".star-view>span::attr(style)").re("width:(\d+)%")[0]) / 20
        # to get options we must iterate through every options container
        options = {}
        for option in review_box.css("div.user-order-info>span"):
            name = option.css("strong::text").get("").strip()
            value = "".join(option.xpath("text()").getall()).strip()
            options[name] = value
        # parse remaining fields
        parsed.append(
            {
                "country": review_box.css(".user-country>b::text").get("").strip(),
                "text": review_box.xpath('.//dt[contains(@class,"buyer-feedback")]/span[1]/text()').get("").strip(),
                "post_time": review_box.xpath('.//dt[contains(@class,"buyer-feedback")]/span[2]/text()').get("").strip(),
                "stars": stars,
                "order_info": options,
                "user_name": review_box.css(".user-name>a::text").get(),
                "user_url": review_box.css(".user-name>a::attr(href)").get(),
            }
        )
    return parsed


async def scrape_product_reviews(seller_id: str, product_id: str, session: httpx.AsyncClient):
    """scrape all reviews of aliexpress product"""

    async def scrape_page(page):
        print(f"scraping review page {page} of product {product_id}")
        data = f"ownerMemberId={seller_id}&memberType=seller&productId={product_id}&companyId=&evaStarFilterValue=all+Stars&evaSortValue=sortlarest%40feedback&page={page}&currentPage={page-1}&startValidDate=&i18n=true&withPictures=false&withAdditionalFeedback=false&onlyFromMyCountry=false&version=&isOpened=true&translate=+Y+&jumpToTop=true&v=2"
        resp = await session.post(
            "https://feedback.aliexpress.com/display/productEvaluation.htm",
            data=data,
            headers={**session.headers, "Content-Type": "application/x-www-form-urlencoded"},
        )
        return resp

    # scrape first page of reviews and find total count of review pages
    first_page = await scrape_page(page=1)

    sel = Selector(text=first_page.text)
    total_reviews = sel.css("div.customer-reviews").re(r"\((\d+)\)")[0]
    total_pages = int(math.ceil(int(total_reviews) / 10))

    # then scrape remaining review pages concurrently
    print(f"scraping reviews of product {product_id}, found {total_reviews} total reviews")
    other_pages = await asyncio.gather(*[scrape_page(page) for page in range(1, total_pages + 1)])
    reviews = []
    for resp in [first_page, *other_pages]:
        reviews.extend(parse_review_page(resp))
    return reviews

For scraping reviews we're using the same paging idiom we've learned earlier - we request the first page, find the total count and retrieve the rest concurrently.
Further, since reviews are only available in HTML structure we have to dig into HTML parsing a bit more. We iterated through each review box and extracted core details such as star rating, review text and title etc. All with a few clever XPath and CSS selectors!

Parsing HTML with Xpath

For more on parsing HTML using XPATH see our complete, interactive introduction course.

Parsing HTML with Xpath

Now, that we have our Aliexpress review scraper let's take it for a spin. For that we'll need seller ID and product ID, which we found previously in our product data scraper (fields sellerId and productId)

Run code & example output
BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}

async def run():
    async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
        print(json.dumps(await scrape_product_reviews("220712488", "4000714658687", session), indent=2, ensure_ascii=False))
        

if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "country": "RU",
    "text": "The capacity of the bank in comparison with the order in the year before last leaves much to be desired, the build quality this time is also not up to par even for such a budget segment.",
    "post_time": "09 Jan 2024 20:56",
    "stars": 1.0,
    "order_info": {
      "Color:": "DKCD12FU-Li SET2",
      "Plug Type:": "EU",
      "Ships From:": "Russian Federation",
      "Logistics:": "Цайняо доставка-в пункт выдачи"
    },
    "user_name": "M***v",
    "user_url": "//feedback.aliexpress.com/display/detail.htm?ownerMemberId=sw0rkbrVNYg8OlaUP1Bvtw==&memberType=buyer"
  },
...
]

With this last feature, we've covered the main scrape targets of Aliexpress - we scraped search to find products, product pages to find product data and product reviews to gather feedback intelligence. Finally, to scrape at scale let's take a look at how can we avoid blocking and captchas.

Bypass Aliexpress Blocking and Captchas

Scraping product data of Aliexpress.com seems to be easy though unfortunately when scraping at the scale we might be blocked or requested to start solving captchas which will hinder our web scraping process.

To get around this, let's take advantage of ScrapFly API, which can avoid all of these blocks for us!

illustration of scrapfly's middleware

Which offers several powerful features that'll help us to get around AliExpress's blocking:

For this, we'll be using scrapfly-sdk python package and ScrapFly's anti scraping protection bypass feature. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our AliExpress product scraper all we need to do is our httpx session code with scrapfly-sdk requests.

FAQ

To wrap this guide up, let's take a look at some frequently asked questions about web scraping aliexpress.com:

Yes. Aliexpress product data is publicly available, and we're not extracting anything personal or private. Scraping aliexpress.com at slow, respectful rates would fall under the ethical scraping definition. See our Is Web Scraping Legal? article for more.

Is there an Aliexpress API?

No. Currently there's no public API for retrieving product data from Aliexpress.com. Fortunately, as covered in this tutorial, web scraping Aliexpress is easy and can be done with a few lines of Python code!

Scraped Aliexpress data is not accurate, what can I do?

The main cause of data difference is geo location. Aliexpress.com shows different prices and products based on the user's location so the scraper needs to match the location of the desired data. See our previous guide on web scraping localization for more details on changing the web scraping language, price or location.

Latest Aliexpress.com Scraper Code
https://github.com/scrapfly/scrapfly-scrapers/

Aliexpress Scraping Summary

In this tutorial, we built an Aliexpress data scraper capable of using the search system to discover products and scraping product data and product reviews.

We have used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API, which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Related Posts

How to Scrape Bing Search with Python

In this scrape guide we'll be taking a look at scraping Bing search results. It's the second biggest search engine in the world and it contains a lot of data - all retrievable with a bit a of Python.

How to Scrape G2 Company Data and Reviews

In this scrapeguide we're taking a look at G2.com - one of the biggest digital product metawebsites out there. We'll be scraping product data, reviews and company profiles.

How to Scrape Etsy.com Product, Shop and Search Data

In this scrapeguide we're taking a look at Etsy.com - a popular e-commerce market for hand crafted and vintage items. We'll be using Python and HTML parsing to scrape search and product data.