How to Scrape Aliexpress.com

article feature image

In this web scraping guide we'll be scraping aliexpress.com - one the biggest global e-commerce stores from China.

Aliexpress contains millions of products and product reviews that can be used in market analytics, business intelligence and dropshipping. We'll start by taking a look at how can we discover products listed on Aliexpress and then proceed to scrape them and their reviews. For all of that we'll be using just a few lines of idiomatic Python code, so let's jump in!

Web Scraping With Python Tutorial

While our Aliexpress scraper is pretty easy if you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.

Web Scraping With Python Tutorial

Why Scrape Aliexpress?

There are many reasons to scrape Aliexpress data. For starters, because Aliexpress is the biggest e-commerce platform in the world it's a prime target for business intelligence or market analytics. Having awareness of top products and their meta-information on Aliexpress can be used to great advantage in business and market analysis.

Another common use is e-commerce primarily via dropshipping - one of the biggest emergent markets of this century is curating a list of products and reselling them directly rather than managing a warehouse. In this case, many shop curators would scrape Aliexpress products to generate curated product lists for their dropshipping shops.

Setup

In this tutorial we'll be using Python with two packages:

  • httpx - HTTP client library which will let us communicate with AliExpress.com's servers
  • parsel - HTML parsing library which will help us to parse our web scraped HTML files for hotel data.
  • loguru[optional] - pretty logging library that'll help us keep track of what's going on.

These packages can be easily installed via pip command:

$ pip install httpx parsel loguru

Alternatively, you're free to swap httpx out with any other HTTP client library such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package.

Finding Products

There are many ways to discover products on Aliexpress. We could use the search system to find products we want to scrape or explore many product categories. Though, our scraping approach would be very similar no matter which path we choose. Let's take a look at Aliexpress listing page that is used in the search or category view:

0:00
/

If we take a look at the page source of either search or category page we can see that all the product previews are stored in a javascript variable window.runParams tucked away in the <script> tag in the HTML source of the page:

page source illustration

We can see product preview data by exploring page source in our browser

This is a common web development pattern, which enables dynamic data management using javascript. It's good news for us though, as we can take pick this data up with a regex pattern and parse it like a Python dictionary!

Let's write the first piece of our scraper code - the product preview parser, which we'll be using to extract product data from category or search result pages:

from parsel import Selector
import json

def extract_search(response):
    """extract json data from search page"""
    sel = Selector(response.text)
    # find script with page data in it
    script_with_data = sel.xpath('//script[contains(text(),"window.runParams")]')
    # select page data from javascript variable in script tag using regex
    return json.loads(script_with_data.re(r"window.runParams\s*=\s*({.+?});")[0])


def parse_search(response):
    """Parse search page response for product preview results"""
    data = extract_search(response)
    parsed = []
    for result in data["mods"]["itemList"]["content"]:
        parsed.append(
            {
                "id": result["productId"],
                "url": f"https://www.aliexpress.com/item/{result['productId']}.html",
                "type": result["productType"],  # can be either natural or ad
                "title": result["title"]["displayTitle"],
                "price": result["prices"]["salePrice"]["minPrice"],
                "currency": result["prices"]["salePrice"]["currencyCode"],
                "trade": result.get("trade", {}).get("tradeDesc"),  # trade line is not always present
                "thumbnail": result["image"]["imgUrl"].lstrip("/"),
                "store": {
                    "url": result["store"]["storeUrl"],
                    "name": result["store"]["storeName"],
                    "id": result["store"]["storeId"],
                    "ali_id": result["store"]["aliMemberId"],
                },
            }
        )
    return parsed

Let's try our parser out by scraping a single Aliexpress listing page (category page or search results page):

Run code & example output
if __name__ == "__main__":
    # for example, this category is for android phones:
    resp = httpx.get("https://www.aliexpress.com/category/5090301/cellphones.html", follow_redirects=True)
    print(json.dumps(parse_search(resp), indent=2))
[
  {
    "id": "3256804075561256",
    "url": "https://www.aliexpress.com/item/3256804075561256.html",
    "type": "ad",
    "title": "2G/3G Smartphones Original 512MB RAM/1G RAM 4GB ROM android mobile phones new cheap celulares FM unlocked 4.0inch cell",
    "price": 21.99,
    "currency": "USD",
    "trade": "8 sold",
    "thumbnail": "ae01.alicdn.com/kf/S1317aeee4a064fad8810a58959c3027dm/2G-3G-Smartphones-Original-512MB-RAM-1G-RAM-4GB-ROM-android-mobile-phones-new-cheap-celulares.jpg_220x220xz.jpg",
    "store": {
      "url": "www.aliexpress.com/store/1101690689",
      "name": "New 123 Store",
      "id": 1101690689,
      "ali_id": 247497658
    }
  }
  ...
]

There's a lot of useful information, but we've limited our parser to bare essentials to keep things brief. Let's put this parser to use in actual scraping next.

Now that we have our product preview parser ready, we need a scraper loop that will iterate through search results to collect all available results - not just the first page:

from loguru import logger as log
import httpx


async def scrape_search(query: str, session: httpx.AsyncClient, sort_type="default"):
    """Scrape all search results and return parsed search result data"""
    query = query.replace(" ", "+")

    async def scrape_search_page(page):
        """Scrape a single aliexpress search page and return all embedded JSON search data"""
        log.info(f"scraping search query {query}:{page} sorted by {sort_type}")
        resp = await session.get(
            "https://www.aliexpress.com/wholesale?trafficChannel=main"
            f"&d=y&CatId=0&SearchText={query}&ltype=wholesale&SortType={sort_type}&page={page}"
        )
        return resp

    # scrape first search page and find total result count
    first_page = await scrape_search_page(query, session, 1)
    first_page_data = extract_search(first_page)
    page_size = first_page_data["pageInfo"]["pageSize"]
    total_pages = int(math.ceil(first_page_data["pageInfo"]["totalResults"] / page_size))
    if total_pages > 60:
        log.warning(f"query has {total_pages}; lowering to max allowed 60 pages")
        total_pages = 60

    # scrape remaining pages concurrently
    log.info(f'scraping search "{query}" of total {total_pages} sorted by {sort_type}')
    other_pages = await asyncio.gather(*[scrape_search_page(page=i) for i in range(1, total_pages + 1)])

    product_previews = []
    for response in [first_page, *other_pages]:
        product_previews.extend(parse_search(response))
    return product_previews

Above we defined our scrape_search function we use a common web scraping idiom for known length pagination:

efficient pagination scraping illustration

We scrape the first page to extract the total number of pages and scrape the remaining pages concurrently.

Now, that we can find products let's take a look at how we can scrape product data, pricing info and reviews!

Scraping Products

To scrape Aliexpress products all we need is a product numeric ID, which we already found in the previous chapter by scraping product previews from Aliexpress search. For example, this hand drill product aliexpress.com/item/4000927436411.html has the numeric ID of 4000927436411.

To parse product data we can use the same technique we used in our search parser - the data is hidden in the HTML document under window.runParams variable's data key:

from parsel import Selector

def parse_product(response):
    """parse product HTML page for product data"""
    sel = Selector(text=response.text)
    # find the script tag containing our data:
    script_with_data = sel.xpath('//script[contains(text(),"window.runParams")]')
    # extract data using a regex pattern:
    data = json.loads(script_with_data.re(r"data: ({.+?}),\n")[0])
    product = {
        "name": data["titleModule"]["subject"],
        "total_orders": data["titleModule"]["formatTradeCount"],
        "feedback": data["titleModule"]["feedbackRating"],
        "variants": [],
    }
    # every product variant has it's own price and ID number (sku):
    for sku in data["skuModule"]["skuPriceList"]:
        product["variants"].append(
            {
                "name": sku["skuAttr"].split("#", 1)[1].split(";")[0],
                "sku": sku["skuId"],
                "available": sku["skuVal"]["availQuantity"],
                "full_price": sku["skuVal"]["skuAmount"]["value"],
                "discount_price": sku["skuVal"]["skuActivityAmount"]["value"],
                "currency": sku["skuVal"]["skuAmount"]["currency"],
            }
        )
    # data variable contains much more information - so feel free to explore it,
    # but to keep things brief we focus on essentials in this article
    return product


async def scrape_products(ids, session: httpx.AsyncClient):
    """scrape aliexpress products by id"""
    log.info(f"scraping {len(ids)} products")
    responses = await asyncio.gather(*[session.get(f"https://www.aliexpress.com/item/{id_}.html") for id_ in ids])
    results = []
    for response in responses:
        results.append(parse_product(response))
    return results

Here, we defined our product scraping function which takes in product IDs, scrapes HTML contents and extracts hidden product JSON of each product. If we run it for our drill product we should see a nicely formatted response:

Run code & example output
# Let's use browser like request headers for this scrape to reduce chance of being blocked or asked to solve a captcha
BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
        print(json.dumps(await scrape_products(["4000927436411"], session), indent=2))

if __name__ == "__main__":
    import asyncio
    asyncio.run(run())
[
  {
    "name": "Mini Wireless Drill Electric Carving Pen Variable Speed USB Cordless Drill Rotary Tools Kit Engraver Pen for Grinding Polishing",
    "total_orders": "3824",
    "feedback": {
      "averageStar": "4.8",
      "averageStarRage": "96.4",
      "display": true,
      "evarageStar": "4.8",
      "evarageStarRage": "96.4",
      "fiveStarNum": 1724,
      "fiveStarRate": "88",
      "fourStarNum": 170,
      "fourStarRate": "9",
      "oneStarNum": 21,
      "oneStarRate": "1",
      "positiveRate": "87.6",
      "threeStarNum": 45,
      "threeStarRate": "2",
      "totalValidNum": 1967,
      "trialReviewNum": 0,
      "twoStarNum": 7,
      "twoStarRate": "0"
    },
    "variants": [
      {
        "name": "Red",
        "sku": 10000011265318724,
        "available": 1601,
        "full_price": 16.24,
        "discount_price": 12.99,
        "currency": "USD"
      },
      ...
  }
]

Using this approach, we scrapped much more data than we could see in the visible HTML of the page. We got SKU numbers, stock availability, detailed pricing and review score meta information. We're only missing reviews themselves so let's take a look at how we can retrieve the review data.

Scraping Reviews

Aliexpress' product reviews require additional request to its backend API. If we fire up Network Inspector devtools (F12 in major browsers and then "Network" tab) we can see a background request being made when we click on a next review page:

0:00
/
We can see a background request being made when we click on page 2 link

Let's replicate this request in our scraper:

def parse_review_page(response):
    """parse single review page"""
    sel = Selector(response.text)
    parsed = []
    for review_box in sel.css(".feedback-item"):
        # to get star score we have to rely on styling where's 1 star == 20% width, e.g. 4 stars is 80%
        stars = int(review_box.css(".star-view>span::attr(style)").re("width:(\d+)%")[0]) / 20
        # to get options we must iterate through every options container
        options = {}
        for option in review_box.css("div.user-order-info>span"):
            name = option.css("strong::text").get("").strip()
            value = "".join(option.xpath("text()").getall()).strip()
            options[name] = value
        # parse remaining fields
        parsed.append(
            {
                "country": review_box.css(".user-country>b::text").get("").strip(),
                "text": review_box.xpath('.//dt[contains(@class,"buyer-feedback")]/span[1]/text()').get("").strip(),
                "post_time": review_box.xpath('.//dt[contains(@class,"buyer-feedback")]/span[2]/text()').get("").strip(),
                "stars": stars,
                "order_info": options,
                "user_name": review_box.css(".user-name>a::text").get(),
                "user_url": review_box.css(".user-name>a::attr(href)").get(),
            }
        )
    return parsed


async def scrape_product_reviews(seller_id: str, product_id: str, session: httpx.AsyncClient):
    """scrape all reviews of aliexpress product"""

    async def scrape_page(page):
        log.debug(f"scraping review page {page} of product {product_id}")
        data = f"ownerMemberId={seller_id}&memberType=seller&productId={product_id}&companyId=&evaStarFilterValue=all+Stars&evaSortValue=sortlarest%40feedback&page={page}&currentPage={page-1}&startValidDate=&i18n=true&withPictures=false&withAdditionalFeedback=false&onlyFromMyCountry=false&version=&isOpened=true&translate=+Y+&jumpToTop=true&v=2"
        resp = await session.post(
            "https://feedback.aliexpress.com/display/productEvaluation.htm",
            data=data,
            headers={**session.headers, "Content-Type": "application/x-www-form-urlencoded"},
        )
        return resp

    # scrape first page of reviews and find total count of review pages
    first_page = await scrape_page(page=1)

    sel = Selector(text=first_page.text)
    total_reviews = sel.css("div.customer-reviews").re(r"\((\d+)\)")[0]
    total_pages = int(math.ceil(int(total_reviews) / 10))

    # then scrape remaining review pages concurrently
    log.info(f"scraping reviews of product {product_id}, found {total_reviews} total reviews")
    other_pages = await asyncio.gather(*[scrape_page(page) for page in range(1, total_pages + 1)])
    reviews = []
    for resp in [first_page, *other_pages]:
        reviews.extend(parse_review_page(resp))
    return reviews

For scraping reviews we're using the same paging idiom we've learned earlier - we request the first page, find the total count and retrieve the rest concurrently.
Further, since reviews are only available in HTML structure we have to dig into HTML parsing a bit more. We iterated through each review box and extracted core details such as star rating, review text and title etc. - all with a few clever XPath and CSS selectors!

Parsing HTML with Xpath

For more on parsing HTML using XPATH see our complete, interactive introduction course.

Parsing HTML with Xpath

Now, that we have our review scraper let's take it for a spin. For that we'll need seller ID and product ID, which we found previously in our product data scraper (fields sellerId and productId)

Run code & example output
async def run():
    async with httpx.AsyncClient(headers=BASE_HEADERS) as session:
        print(json.dumps(await scrape_product_reviews("220712488", "4000714658687", session), indent=2))
        

if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "country": "BR",
    "text": "As requested and",
    "post_time": "31 May 2022 16:11",
    "stars": 5.0,
    "order_info": {
      "Color:": "DKCD20FU-Li SET2",
      "Ships From:": "China",
      "Logistics:": "Seller's Shipping Method"
    },
    "user_name": "S***s",
    "user_url": "feedback.aliexpress.com/display/detail.htm?ownerMemberId=XXXXXXXXX==&memberType=buyer"
  },
...
]

With this, we've covered the main scrape targets of Aliexpress - we scraped search to find products, product pages to find product data and product reviews to gather feedback intelligence. Finally, to scrape at scale let's take a look at how can we avoid blocking and captchas.

ScrapFly - Avoiding Blocking and Captchas

Scraping product data of Aliexpress.com seems to be easy though unfortunately when scraping at the scale we might be blocked or requested to start solving captchas which will hinder our web scraping process.

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!

illustration of scrapfly's middleware

Which offers several powerful features that'll help us to get around AliExpress's blocking:

For this, we'll be using [scrapfly-sdk] python package and ScrapFly's [anti scraping protection bypass] feature. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our AliExpress web scraper all we need to do is our httpx session code with scrapfly-sdk requests.

Full Scraper Code

Let's take a look how our full scraper code would look with ScrapFly integration:

Full Scraper Code with ScrapFly integration
import asyncio
import json
import math
from typing import Dict, List

from loguru import logger as log
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient
from typing_extensions import TypedDict


def extract_search(result: ScrapeApiResponse) -> Dict:
    """extract json data from search page"""
    # find script with page data in it
    script_with_data = result.selector.xpath('//script[contains(text(),"window.runParams")]')
    # select page data from javascript variable in script tag using regex
    return json.loads(script_with_data.re(r"window.runParams\s*=\s*({.+?});")[0])


def parse_search(result: ScrapeApiResponse):
    """Parse search page response for product preview results"""
    data = extract_search(result)
    parsed = []
    for result in data["mods"]["itemList"]["content"]:
        parsed.append(
            {
                "id": result["productId"],
                "url": f"https://www.aliexpress.com/item/{result['productId']}.html",
                "type": result["productType"],  # can be either natural or ad
                "title": result["title"]["displayTitle"],
                "price": result["prices"]["salePrice"]["minPrice"],
                "currency": result["prices"]["salePrice"]["currencyCode"],
                "trade": result.get("trade", {}).get("tradeDesc"),  # trade line is not always present
                "thumbnail": result["image"]["imgUrl"].lstrip("/"),
                "store": {
                    "url": result["store"]["storeUrl"],
                    "name": result["store"]["storeName"],
                    "id": result["store"]["storeId"],
                    "ali_id": result["store"]["aliMemberId"],
                },
            }
        )
    return parsed


async def scrape_search(query: str, session: ScrapflyClient, sort_type="default"):
    """Scrape all search results and return parsed search result data"""
    query = query.replace(" ", "+")
    # scrape first search page and find total result count
    first_page_result = await session.async_scrape(
        ScrapeConfig(
            f"https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&CatId=0&SearchText={query}&ltype=wholesale&SortType={sort_type}&page=1"
        )
    )
    _first_page_data = extract_search(first_page_result)
    page_size = _first_page_data["pageInfo"]["pageSize"]
    total_pages = int(math.ceil(_first_page_data["pageInfo"]["totalResults"] / page_size))
    if total_pages > 60:
        log.warning(f"query has {total_pages}; lowering to max allowed 60 pages")
        total_pages = 60

    # scrape remaining pages concurrently
    log.info(f'scraping search "{query}" of total {total_pages} sorted by {sort_type}')
    other_page_urls = [
        f"https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&CatId=0&SearchText={query}&ltype=wholesale&SortType={sort_type}&page={page}"
        for page in range(2, total_pages + 1)
    ]
    product_previews = parse_search(first_page_result)
    async for result in session.concurrent_scrape([ScrapeConfig(url, country="US") for url in other_page_urls]):
        product_previews.extend(parse_search(result))
    return product_previews


class Product(TypedDict):
    name: str
    sku: str
    available: bool
    full_price: float
    discounted_price: float
    currency: str


def parse_product(result: ScrapeApiResponse) -> Product:
    """parse product HTML page for product data"""
    script_with_data = result.selector.xpath('//script[contains(text(),"window.runParams")]')
    data = json.loads(script_with_data.re(r"data: ({.+?}),\n")[0])
    product = {
        "name": data["titleModule"]["subject"],
        "total_orders": data["titleModule"]["formatTradeCount"],
        "feedback": data["titleModule"]["feedbackRating"],
        "variants": [],
    }
    for sku in data["skuModule"]["skuPriceList"]:
        product["variants"].append(
            {
                "name": sku["skuAttr"].split("#", 1)[1].split(";")[0],
                "sku": sku["skuId"],
                "available": sku["skuVal"]["availQuantity"],
                "full_price": sku["skuVal"]["skuAmount"]["value"],
                "discount_price": sku["skuVal"]["skuActivityAmount"]["value"],
                "currency": sku["skuVal"]["skuAmount"]["currency"],
            }
        )
    return product


async def scrape_products(ids: List[str], session: ScrapflyClient) -> List[Product]:
    """scrape aliexpress products by id"""
    log.info(f"scraping {len(ids)} products concurrently")
    urls = [f"https://www.aliexpress.com/item/{id_}.html" for id_ in ids]
    results = []
    async for result in session.concurrent_scrape([ScrapeConfig(url, country="US") for url in urls]):
        results.append(parse_product(result))
    return results


def parse_review_page(result: ScrapeApiResponse):
    """parse single review page"""
    parsed = []
    for review_box in result.selector.css(".feedback-item"):
        # to get star score we have to rely on styling where's 1 star == 20% width, e.g. 4 stars is 80%
        stars = int(review_box.css(".star-view>span::attr(style)").re("width:(\d+)%")[0]) / 20
        # to get options we must iterate through every options container
        options = {}
        for option in review_box.css("div.user-order-info>span"):
            name = option.css("strong::text").get("").strip()
            value = "".join(option.xpath("text()").getall()).strip()
            options[name] = value
        # parse remaining fields
        parsed.append(
            {
                "country": review_box.css(".user-country>b::text").get("").strip(),
                "text": review_box.xpath('.//dt[contains(@class,"buyer-feedback")]/span[1]/text()').get("").strip(),
                "post_time": review_box.xpath('.//dt[contains(@class,"buyer-feedback")]/span[2]/text()')
                .get("")
                .strip(),
                "stars": stars,
                "order_info": options,
                "user_name": review_box.css(".user-name>a::text").get(),
                "user_url": review_box.css(".user-name>a::attr(href)").get(),
            }
        )
    return parsed


async def scrape_product_reviews(seller_id: str, product_id: str, session: ScrapflyClient):
    """scrape all reviews of aliexpress product"""

    def scrape_config_for_page(page):
        data = f"ownerMemberId={seller_id}&memberType=seller&productId={product_id}&companyId=&evaStarFilterValue=all+Stars&evaSortValue=sortlarest%40feedback&page={page}&currentPage={page-1 if page > 1 else 1}&startValidDate=&i18n=true&withPictures=false&withAdditionalFeedback=false&onlyFromMyCountry=false&version=&isOpened=true&translate=+Y+&jumpToTop=true&v=2"
        return ScrapeConfig(
            "https://feedback.aliexpress.com/display/productEvaluation.htm",
            body=data,
            method="POST",
            headers={"Content-Type": "application/x-www-form-urlencoded"},
        )

    # scrape first page of reviews and find total count of review pages
    first_page_result = await session.async_scrape(scrape_config_for_page(1))
    total_reviews = first_page_result.selector.css("div.customer-reviews").re(r"\((\d+)\)")[0]
    total_pages = int(math.ceil(int(total_reviews) / 10))

    # create scrape configs for other pages
    # then scrape remaining review pages concurrently
    log.info(f"scraping reviews of product {product_id}, found {total_reviews} total reviews")
    scrape_configs = [scrape_config_for_page(page) for page in range(2, total_pages + 1)]
    reviews = parse_review_page(first_page_result)
    async for result in session.concurrent_scrape(scrape_configs):
        reviews.extend(parse_review_page(result))
    return reviews


async def run():
    with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=20) as session:
        search_results = await scrape_search("drill", session)
        product_results = await scrape_products(["4000927436411"], session)
        review_results = await scrape_product_reviews("220712488", "4000714658687", session)


if __name__ == "__main__":
    asyncio.run(run())

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping aliexpress.com:

Summary

In this tutorial, we built a aliexpress.com scraper which is capable of using search to discover products and scraping product data as well as all of the product reviews.

For this we've used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Related post

How to Scrape YellowPages.com

Tutorial on how to scrape yellowpages.com business and review data using Python. How to avoid blocking to scrape data at scale and other tips.

How to Scrape Amazon.com

Tutorial on how to scrape Amazon.com's product and review data using Python and how to avoid blocking to scrape this information at scale.

How to Scrape Zillow.com

Tutorial on how to scrape Zillow.com sale and rent property data, using Python and how to avoid blocking to scrape at scale.

How to Scrape TripAdvisor.com

Tutorial on how to scrape TripAdvisor.com hotel, restaurant, tour and review data using Python and how to avoid blocking to scrape at scale.