How to Web Scrape Walmart.com (2023 Update)

article feature image

Walmart.com is one of the biggest retailers in the world, with a major online presence in the United States. Because of such enormous reach, Walmart's public product data is often in demand for competitive intelligence analytics. So, how can we scrape this valuable product data?

In this web scraping tutorial, we'll look at how to scrape Walmart product data with Python.

We'll start by looking at how to find product URLs either using sitemaps, category links or Walmart's search API.
Then, we'll take a look at Walmart product scraping itself. How to use a common hidden javascript data parsing technique to quickly and easily scrape vast amounts of product data.

Finally, we'll take a look at how to avoid web scraper blocking Walmart is so notoriously known for.

Project Setup

For our Walmart scraper, we'll be using Python with a few community libraries:

  • httpx - HTTP client library which will let us communicate with Booking.com's servers
  • parsel - HTML parsing library which will help us to parse our web scraped HTML files for hotel data.
  • loguru[optional] - for prettier logging, so we can follow along easier

We can easily install them using pip:

$ pip install httpx parsel loguru

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library.
As for, parsel, another great alternative is beautifulsoup package or anything that supports CSS selectors which is what we'll be using in this tutorial.

Finding Walmart Products

To start Walmart scraping we first must find a way to discover Walmart products and there are two most common ways of achieving this.

The easiest approach is to take advantage of Walmart's sitemaps. If we take a look at walmart.com/robots.txt scraping rules we can see that there are multiple sitemaps:

Sitemap: https://www.walmart.com/sitemap_browse.xml
Sitemap: https://www.walmart.com/sitemap_category.xml
Sitemap: https://www.walmart.com/sitemap_store_main.xml

Sitemap: https://www.walmart.com/help/sitemap_gm.xml
Sitemap: https://www.walmart.com/sitemap_browse_fst.xml
Sitemap: https://www.walmart.com/sitemap_store_dept.xml

Sitemap: https://www.walmart.com/sitemap_bf_2020.xml
Sitemap: https://www.walmart.com/sitemap_tp_legacy.xml
...

Unfortunately, this doesn't provide us with a lot of space for result filtering. By the looks of it, we can only filter results by category using the walmart.com/sitemap_category.xml sitemap:

<url>
<loc>https://www.walmart.com/cp/-degree/928899</loc>
<lastmod>2022-04-01</lastmod>
</url>
<url>
<loc>https://www.walmart.com/cp/-depend/1092729</loc>
<lastmod>2022-04-01</lastmod>
</url>
<url>
<loc>https://www.walmart.com/cp/-hungergames/1095300</loc>
<lastmod>2022-04-01</lastmod>
</url>
<url>
<loc>https://www.walmart.com/cp/-jackson/1103987</loc>
<lastmod>2022-04-01</lastmod>
</url>

Each URL in this sitemap will take us to the pagination page of a single category which we can further customize with additional filters:

walmart.com filters for search queries

Alternatively, we can use the search system ourselves which brings us to the same filter-capable page:
walmart.com/search?q=spider&sort=price_low&page=2&affinityOverride=default

So, either way, we choose to approach this we'll have to parse the same kind of page which is great - we can write one scraper function to deal with both scenarios.

In this tutorial let's stick with parsing search pages though to parse category pages all we'd have to do is replace the scraped url. First, let's pick an example search page like search for the word "spider":

https://www.walmart.com/search?q=spider&sort=price_low&page=1&affinityOverride=default

We see this URL contains a few parameters like:

  • q stands for "search query", in this case, it's the word "spider"
  • page stands for page number, in this case, it's the 1st page
  • sort stands for sorting order, in this case, price_low means sorted ascending by price

Now since our scrape doesn't execute javascript the dynamic result content will not be visible to us. Instead, let's open up the page source and search for some product details and we can see that there's state data under:

<script id="__NEXT_DATA__">{"...PRODUCT_PAGINATION_DATA..."}</script>

Highly dynamic websites (especially run by React/Next.js frameworks) often contain data hidden in the HTML and then unpack it to HTML results on load using javascript.

How to Scrape Hidden Web Data

For more on hidden data scraping see our full introduction article that this type of web scraping in greater detail

How to Scrape Hidden Web Data

Hidden web data is great news for us as it makes Walmart product scraping dead easy!

Though first, let's start with our search scraper:

import asyncio
import json
import math
import httpx
from parsel import Selector


async def _search_walmart_page(query: str = "", page=1, sort="price_low", session:httpx.AsyncClient):
    """scrape single walmart search page"""
    url = "https://www.walmart.com/search?" + urlencode(
        {
            "q": query,
            "sort": sort,
            "page": page,
            "affinityOverride": "default",
        },
    )
    resp = await session.get(url)
    assert resp.status_code == 200
    return resp


def parse_search(html_text: str):
    """extract search results from search HTML response"""
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)

    total_results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["count"]
    results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
    return results, total_results

In this scraper, we start with two functions:

  • asynchronous _search_walmart_page() that creates a query URL from given parameters and scrapes the HTML of the search page
  • parse_search() that takes search page HTML, finds the __NEXT_DATA__ hidden page data. Then parses search results as well as the total result count.

We have a way to retrieve the results of a single search page though we can improve it to scrape all 25 pages of results:

async def discover_walmart(search:str, session:httpx.AsyncClient):
    _resp_page1 = await _search_walmart_page(query=search, session=session)
    results, total_items = parse_search(_resp_page1.text)
    max_page = math.ceil(total_items / 40)
    if max_page > 25:
        max_page = 25
    for response in await asyncio.gather(
        *[_search_walmart_page(query=search, page=i, session=session) for i in range(2, max_page)]
    ):
        results.extend(parse_search(response.text)[0])
    return results

Here, we've added a wrapper function that will first scrape the first page and find total amount of pages. Then, scrape the remaining pages concurrently (which is super fast).

Let's run our current scraper code and see the results it generates:

Run code & example output
BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    # limit connection speed to prevent scraping too fast
    limits = httpx.Limits(max_keepalive_connections=5, max_connections=5)
    async with httpx.AsyncClient(headers=BASE_HEADERS, limits=limits) as session:
        results = await discover_walmart("spider", session=session)
        print(json.dumps(results, indent=2))


if __name__ == "__main__":
    asyncio.run(run())

Note: our runner code here is applying some custom headers to our web connection session to avoid scraper blocking. We create an asynchronous httpx client and call our discover function to find all results of this page:

[
  {
    "__typename": "Product",
    "availabilityStatusDisplayValue": "In stock",
    "productLocationDisplayValue": null,
    "externalInfoUrl": "",
    "canonicalUrl": "/ip/Eliminator-Ant-Roach-Spider-Killer4-20-oz-Kills-Insects-Spiders/795033156",
    "canAddToCart": true,
    "showOptions": false,
    "showBuyNow": false,
    "description": "<li>KILLS ON CONTACT: Eliminator Ant, Roach & Spider Killer4 kills cockroaches, ants, carpenter ants, crickets, firebrats, fleas, silverfish and spiders</li><li>NON-STAINING: This water-based product</li>",
    "flag": "",
    "badge": {
      "text": "",
      "id": "",
      "type": "",
      "key": ""
    },
    "fulfillmentBadges": [
      "Pickup",
      "Delivery",
      "1-day shipping"
    ],
    "fulfillmentIcon": {
      "key": "SAVE_WITH_W_PLUS",
      "label": "Save with"
    },
    "fulfillmentBadge": "Tomorrow",
    "fulfillmentSpeed": [
      "TOMORROW"
    ],
    "fulfillmentType": "FC",
    "groupMetaData": {
      "groupType": null,
      "groupSubType": null,
      "numberOfComponents": 0,
      "groupComponents": null
    },
    "id": "5D3NBXRMIZK4",
    "itemType": null,
    "usItemId": "795033156",
    "image": "https://i5.walmartimages.com/asr/c9c0c51c-f30f-4eb2-aaf1-88f599167584.d824f7ff13f10b3dcfb9dadd2a04686d.jpeg?odnHeight=180&odnWidth=180&odnBg=ffffff",
    "isOutOfStock": false,
    "esrb": "",
    "mediaRating": "",
    "name": "Eliminator Ant, Roach & Spider Killer4, 20 oz, Kills Insects & Spiders",
    "price": 3.48,
    "preOrder": {
      "isPreOrder": false,
      "preOrderMessage": null,
      "preOrderStreetDateMessage": null
    },
    "..."
]

Handling Paging Limits

There's one minor issue with our search discovery approach - page limit. Walmart returns only 25 pages (1000 products) per query - what if our query has more than that?

The best way to deal with this is to split our query into multiple smaller queries and we can do this by applying filters:

walmart filters that fit for batching

The first thing we can do is reverse ordering: we can scrape lowest-to-highest ordered results and then reverse that - doubling our coverage to 50 pages or 2000 products!

Further, we can split our query into smaller queries by using single-choice filters (radio buttons) like "department" or go even further and use price ranges.

With a bit of clever query splitting, this 2000 product limit doesn't look that intimidating!

Walmart Product Scraper

Our scraper can use Walmart's search functionality to discover product preview details which contain the price, a few images, the product URL and some descriptions.
To collect full product data we'll need to scrape each product url individually so let's extend our scraper with this functionality.

def parse_product(html):
    ...

async def _scrape_products_by_url(urls: List[str], session:httpx.AsyncClient):
    responses = await asyncio.gather(*[session.get(url) for url in urls])
    results = []
    for resp in responses:
        assert resp.status_code == 200
        results.append(parse_product(resp.text))
    return results

For parsing we can employ the same strategy we've been using in parsing search - extracting the __NEXT_DATA__ JSON state object. In the product case it contains all of the product data in JSON format which is very convenient for us:

def parse_product(html_text: str) -> Dict:
    """parse walmart product"""
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)
    _product_raw = data["props"]["pageProps"]["initialData"]["data"]["product"]
    # There's a lot of product data, including private meta keywords, so we need to do some filtering:
    wanted_product_keys = [
        "availabilityStatus",
        "averageRating",
        "brand",
        "id",
        "imageInfo",
        "manufacturerName",
        "name",
        "orderLimit",
        "orderMinLimit",
        "priceInfo",
        "shortDescription",
        "type",
    ]
    product = {k: v for k, v in _product_raw.items() if k in wanted_product_keys}
    reviews_raw = data["props"]["pageProps"]["initialData"]["data"]["reviews"]
    return {"product": product, "reviews": reviews_raw}

In this parse function, we're picking up the __NEXT_DATA__ object and parsing it for product information. There's a lot of data here, so we're using a key whitelist to select only the most important keys like product name, price, description and media.

Run code & example output
BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    # limit connection speed to prevent scraping too fast
    limits = httpx.Limits(max_keepalive_connections=5, max_connections=5)
    async with httpx.AsyncClient(headers=BASE_HEADERS, limits=limits) as session:
        results = await _scrape_products_by_url(["Some product url"], session=session)
        print(json.dumps(results, indent=2))


if __name__ == "__main__":
    asyncio.run(run())
{
  "product": {
    "availabilityStatus": "IN_STOCK",
    "averageRating": 2.3,
    "brand": "Sony Pictures Entertainment",
    "shortDescription": "It&apos;s great to be Spider-Man (Andrew Garfield). for Peter Parker, there&apos;s no feeling quite like swinging between skyscrapers, embracing being the hero, and spending time with Gwen (Emma Stone). But being Spider-Man comes at a price: only Spider-Man can protect his fellow New Yorkers from the formidable villains that threaten the city. With the emergence of Electro (Jamie Foxx), Peter must confront a foe far more powerful than himself. And as his old friend, Harry Osborn (Dane DeHaan), returns, Peter comes to realize that all of his enemies have one thing in common: Oscorp.",
    "id": "43N352NZTVIQ",
    "imageInfo": {
      "allImages": [
        {
          "id": "E832A8930EF64D37B408265925B61573",
          "url": "https://i5.walmartimages.com/asr/175ec13e-b95a-4e6f-a57c-13e9ea7b703e_1.1047259e6d0655743cac1997d2b1a16e.jpeg",
          "zoomable": false
        },
        {
          "id": "A2C3299D21A34FADB84047E627CFD9E4",
          "url": "https://i5.walmartimages.com/asr/c8f793a3-5ebf-4f83-a2e1-a71fda15dbd3_1.f8b6234fb668f7c4f8d72f1a1c0f21c4.jpeg",
          "zoomable": false
        }
      ],
      "thumbnailUrl": "https://i5.walmartimages.com/asr/175ec13e-b95a-4e6f-a57c-13e9ea7b703e_1.1047259e6d0655743cac1997d2b1a16e.jpeg"
    },
    "manufacturerName": "Sony",
    "name": "The Amazing Spider-Man 2 (Blu-ray + DVD)",
    "orderMinLimit": 1,
    "orderLimit": 5,
    "priceInfo": {
      "priceDisplayCodes": {
        "clearance": null,
        "eligibleForAssociateDiscount": true,
        "finalCostByWeight": null,
        "priceDisplayCondition": null,
        "reducedPrice": null,
        "rollback": null,
        "submapType": null
      },
      "currentPrice": {
        "price": 7.82,
        "priceString": "$7.82",
        "variantPriceString": "$7.82",
        "currencyUnit": "USD"
      },
      "wasPrice": {
        "price": 14.99,
        "priceString": "$14.99",
        "variantPriceString": null,
        "currencyUnit": "USD"
      },
      "unitPrice": null,
      "savings": null,
      "subscriptionPrice": null,
      "priceRange": {
        "minPrice": null,
        "maxPrice": null,
        "priceString": null,
        "currencyUnit": null,
        "denominations": null
      },
      "capType": null,
      "walmartFundedAmount": null
    },
    "type": "Movies"
  },
  "reviews": {
    "averageOverallRating": 2.3333,
    "customerReviews": [
      {
        "rating": 1,
        "reviewSubmissionTime": "9/7/2019",
        "reviewText": "I received this and was so disappointed.  the pic advertised shows a digital copy is included,  but it's just the Blu-ray.  immediately returned bc that is not what I ordered nor does it match the photo shown.",
        "reviewTitle": "no digital copy",
        "userNickname": "Tbaby",
        "photos": [],
        "badges": null,
        "syndicationSource": null
      },
      {
        "rating": 1,
        "reviewSubmissionTime": "3/11/2019",
        "reviewText": "Advertised as \"VUDU Instawatch Included\", this is not true.\nPicture shows BluRay + DVD + Digital HD, what actually ships is just the BluRay + DVD.",
        "reviewTitle": "WARNING: You don't get what's advertised.",
        "userNickname": "Reviewer",
        "photos": [],
        "badges": null,
        "syndicationSource": null
      },
      {
        "rating": 5,
        "reviewSubmissionTime": "1/4/2021",
        "reviewText": null,
        "reviewTitle": null,
        "userNickname": null,
        "photos": [],
        "badges": [
          {
            "badgeType": "Custom",
            "id": "VerifiedPurchaser",
            "contentType": "REVIEW",
            "glassBadge": {
              "id": "VerifiedPurchaser",
              "text": "Verified Purchaser"
            }
          }
        ],
        "syndicationSource": null
      }
    ],
    "ratingValueFiveCount": 1,
    "ratingValueFourCount": 0,
    "ratingValueOneCount": 2,
    "ratingValueThreeCount": 0,
    "ratingValueTwoCount": 0,
    "roundedAverageOverallRating": 2.3,
    "topNegativeReview": null,
    "topPositiveReview": null,
    "totalReviewCount": 3
  }
}

Final Walmart Scraper

We can find Walmart products using the search and scrape each individual product - let's put these two together in our final web scraper script:

Full Scraper Code
import asyncio
import json
import math
from typing import Dict, List, Tuple
from urllib.parse import urljoin

import httpx
from loguru import logger as log
from parsel import Selector
from urllib.parse import urlencode


async def _search_walmart_page(query: str = "", page=1, sort="price_low", session=httpx.AsyncClient) -> httpx.Response:
    """scrape single walmart search page"""
    url = "https://www.walmart.com/search?" + urlencode(
        {
            "q": query,
            "sort": sort,
            "page": page,
            "affinityOverride": "default",
        },
    )
    log.debug(f'searching walpart page {page} of "{query}" sorted by {sort}')
    resp = await session.get(url)
    assert resp.status_code == 200
    return resp


def parse_search(html_text: str) -> Tuple[Dict, int]:
    """extract search results from search HTML response"""
    log.debug(f"parsing search page")
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)

    total_results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["count"]
    results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
    # there are other results types such as ads or placeholders - filter them out:
    results = [result for result in results if result["__typename"] == "Product"]
    log.info(f"parsed {len(results)} search product previews")
    return results, total_results


async def discover_walmart(search: str, session: httpx.AsyncClient) -> List[Dict]:
    log.info(f"searching walmart for {search}")
    _resp_page1 = await _search_walmart_page(query=search, session=session)
    results, total_items = parse_search(_resp_page1.content)
    max_page = math.ceil(total_items / 40)
    log.info(f"found total {max_page} pages of results ({total_items} products)")
    if max_page > 25:
        max_page = 25
    for response in await asyncio.gather(
        *[_search_walmart_page(query=search, page=i, session=session) for i in range(2, max_page)]
    ):
        results.extend(parse_search(response.content)[0])
    log.info(f"parsed total {len(results)} pages of results ({total_items} products)")
    return results


def parse_product(html_text):
    """parse walmart product"""
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)
    _product_raw = data["props"]["pageProps"]["initialData"]["data"]["product"]
    wanted_product_keys = [
        "availabilityStatus",
        "averageRating",
        "brand",
        "id",
        "imageInfo",
        "manufacturerName",
        "name",
        "orderLimit",
        "orderMinLimit",
        "priceInfo",
        "shortDescription",
        "type",
    ]
    product = {k: v for k, v in _product_raw.items() if k in wanted_product_keys}
    reviews_raw = data["props"]["pageProps"]["initialData"]["data"]["reviews"]
    return {"product": product, "reviews": reviews_raw}


async def _scrape_products_by_url(urls: List[str], session: httpx.AsyncClient):
    """scrape walmart products by urls"""
    log.info(f"scraping {len(urls)} product urls (in chunks of 50)")
    results = []
    # we chunk requests to reduce memory usage and scraping speeds
    for i in range(0, len(urls), 50):
        log.debug(f"scraping product chunk: {i}:{i+50}")
        chunk = urls[i : i + 50]
        responses = await asyncio.gather(*[session.get(url) for url in chunk])
        print(responses)
        for resp in responses:
            assert resp.status_code == 200
            results.append(parse_product(resp.content))
    return results


async def scrape_walmart(search: str, session: httpx.AsyncClient):
    """scrape walmart products by search term"""
    search_results = await discover_walmart(search, session=session)
    product_urls = [
        urljoin("https://www.walmart.com/", product_preview["canonicalUrl"]) for product_preview in search_results
    ]
    return await _scrape_products_by_url(product_urls, session=session)


BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    limits = httpx.Limits(max_keepalive_connections=5, max_connections=5)
    async with httpx.AsyncClient(headers=BASE_HEADERS, limits=limits) as session:
        results = await scrape_walmart("spider", session=session)
        total = json.dumps(results)
        return total


if __name__ == "__main__":
    asyncio.run(run())

In this short scraper we've implemented two basic functions:

  • discover_walmart() which finds products from given keywords in the form of product previews which provides basic product information and most importantly url's to the product page.
  • _scrape_products_by_url() which scrapes full product data from these discovered urls.

As for parsing, we took advantage of Walmart's frontend storage to __NEXT_DATA__ HTML/JS variable by extracting it and parsing it as a JSON object for a whitelisted set of keys. This approach is much easier to implement and maintain than HTML parsing.

Bypass Walmart Blocking with Scrapfly

Walmart is one of the biggest retailers in the world, so unsurprisingly it's protective of its product data. If we scrape more than a few products we'll soon be greeted with 307 redirect responses to /blocked endpoint or a Captcha page.

walmart captcha request page
one of many block/captcha pages Walmart.com might display

Walmart is using a complex anti-scraping protection system that analyses the scraper's IP address, HTTP capabilities and javascript environment. This essentially means that Walmart can easily block our web scraper unless we put in significant effort fortifying all these elements.

How to Scrape Without Getting Blocked? In-Depth Tutorial

For more on how web scrapers are being detected and blocked see our full tutorial on web scraping blocking

How to Scrape Without Getting Blocked? In-Depth Tutorial

Instead, let's take advantage of ScrapFly API which can avoid all of these blocks for us!

illustration of scrapfly's middleware

Which offers several powerful features that'll help us to get around yelp's blocking:

For this, we'll be using scrapfly-sdk python package and ScrapFly's anti scraping protection bypass feature. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Walmart web scraper all we need to do is replace httpx requests to scrapfly-sdk requests:

import httpx
session: httpx.AsyncClient

response = session.get(url)

# replace with scrapfly's SDK:

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
session: ScrapflyClient

response = session.scrape(
    ScrapeConfig(url=url, asp=True, country="US")
)

We can enable specific ScrapFly features using ScrapeConfig arguments. For Walmart, we'll be setting asp=True for anti scraping protection bypass and we'll be setting proxy geographical location to the US to scrape only the US version of Walmart.

Full Walmart Scraper Code

Let's take a look at how our full scraper code would look with ScrapFly integration:

Full Scraper Code with ScrapFly integration
import os
import asyncio
import json
import math
from pathlib import Path
from typing import Dict, List, Tuple
from urllib.parse import urlencode, urljoin

from loguru import logger as log
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient

scrapfly = ScrapflyClient(key=os.environ["SCRAPFLY_KEY"], max_concurrency=5)


def parse_search(result: ScrapeApiResponse) -> Tuple[List[Dict], int]:
    """parse Walmart search results page for product previews"""
    log.debug(f"parsing search page {result.context['url']}")
    data = result.selector.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)

    total_results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["count"]
    results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
    # there are other results types such as ads or placeholders - filter them out:
    results = [result for result in results if result["__typename"] == "Product"]
    log.info(f"parsed {len(results)} search product previews")
    return results, total_results


async def scrape_search(search: str, max_pages: int = 25) -> List[Dict]:
    """scrape walmart search for product previews"""

    def create_search_url(query: str, page=1, sort="price_low") -> str:
        """create url for a single walmart search page"""
        return "https://www.walmart.com/search?" + urlencode(
            {
                "q": query,
                "sort": sort,
                "page": page,
                "affinityOverride": "default",
            }
        )

    log.info(f"searching walmart for {search}")
    first_page = await scrapfly.async_scrape(ScrapeConfig(create_search_url(query=search), country="US", asp=True))
    previews, total_items = parse_search(first_page)

    total_pages = math.ceil(total_items / 40)
    log.info(f"found total {total_pages} pages of results ({total_items} products)")
    if max_pages and total_pages > max_pages:
        total_pages = max_pages

    other_pages = [
        ScrapeConfig(url=create_search_url(query=search, page=i), asp=True, country="US")
        for i in range(2, total_pages + 1)
    ]
    async for result in scrapfly.concurrent_scrape(other_pages):
        previews.extend(parse_search(result)[0])
    log.info(f"parsed total {len(previews)} pages of results ({total_items} products)")
    return previews


def parse_product(result: ScrapeApiResponse):
    """parse walmart product from product page response"""
    data = result.selector.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    if not data:
        log.error(f"{result.context['url']} has no product data")
    data = json.loads(data)
    _product_raw = data["props"]["pageProps"]["initialData"]["data"]["product"]
    wanted_product_keys = [
        "availabilityStatus",
        "averageRating",
        "brand",
        "id",
        "imageInfo",
        "manufacturerName",
        "name",
        "orderLimit",
        "orderMinLimit",
        "priceInfo",
        "shortDescription",
        "type",
    ]
    product = {k: v for k, v in _product_raw.items() if k in wanted_product_keys}
    reviews_raw = data["props"]["pageProps"]["initialData"]["data"]["reviews"]
    return {"product": product, "reviews": reviews_raw}


async def scrape_products(urls: List[str]):
    """scrape walmart products by urls"""
    log.info(f"scraping {len(urls)} product urls (in chunks of 50)")
    results = []
    to_scrape = [ScrapeConfig(url=url, asp=True, country="US") for url in urls]
    async for result in scrapfly.concurrent_scrape(to_scrape):
        results.append(parse_product(result))
    return results


async def scrape_search_and_products(search: str, max_pages: int = 25):
    """scrape walmart search to find products and then scrape complete product details"""
    search_results = await scrape_search(search, max_pages=max_pages)
    product_urls = [
        urljoin("https://www.walmart.com/", product_preview["canonicalUrl"]) for product_preview in search_results
    ]
    return await scrape_products(product_urls)


async def example_run():
    out = Path(__file__).parent / "results"
    out.mkdir(exist_ok=True)

    result_products = await scrape_search_and_products("spider", max_pages=1)
    out.joinpath("products.json").write_text(json.dumps(result_products, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    asyncio.run(example_run())

In our update scraper above we've replaced httpx calls with ScrapflyClient calls, so all of our requests are going through ScrapFly API which smartly avoids web scraper blocking.

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping walmart.com:

Yes. Walmart product data is publicly available, and we're not extracting anything personal or private. Scraping walmart.com at slow, respectful rates would fall under the ethical scraping definition. See our Is Web Scraping Legal? article for more.

Walmart Scraping Summary

In this tutorial, we built a small https://www.walmart.com/ scraper which uses search to discover products and then scrapes all the products rapidly while avoiding blocking.

For this we've used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Related Posts

How to Scrape Google SEO Keyword Data and Rankings

In this article, we’ll take a look at SEO web scraping, what it is and how to use it for better SEO keyword optimization. We’ll also create an SEO keyword scraper that scrapes Google search rankings and suggested keywords.

How to Effectively Use User Agents for Web Scraping

In this article, we’ll take a look at the User-Agent header, what it is and how to use it in web scraping. We'll also generate and rotate user agents to avoid web scraping blocking.

How to Observe E-Commerce Trends using Web Scraping

In this example web scraping project we'll be taking a look at monitoring E-Commerce trends using Python, web scraping and data visualization tools.