How to Web Scrape Walmart.com

article feature image

Walmart.com is one of the biggest retailers in the world with a major online presence in the United States. Because of such enormous reach Walmart's public product data is often in demand for competitive intelligence analytics. So, how can we collect this valuable data?

In this web scraping tutorial we'll take a look at scraping public product data of walmart.com. We'll start by taking a look at how to find product urls either using sitemaps, category links or the search API. Then we'll take a look at product scraping itself and how can we use a common javascript state parsing technique to quickly and easily scrape vast amounts of data. Finally, we'll take a look at how to avoid web scraper blocking Walmart is so notorious for!

Finding Products

To start web scraping we first must find a way to discover walmart products and there are two most common ways of achieving this.

The easiest approach is to take advantage of walmart's sitemaps. If we take a look at https://www.walmart.com/robots.txt scraping rules we can see that there are multiple sitemaps:

Sitemap: https://www.walmart.com/sitemap_browse.xml
Sitemap: https://www.walmart.com/sitemap_category.xml
Sitemap: https://www.walmart.com/sitemap_store_main.xml

Sitemap: https://www.walmart.com/help/sitemap_gm.xml
Sitemap: https://www.walmart.com/sitemap_browse_fst.xml
Sitemap: https://www.walmart.com/sitemap_store_dept.xml

Sitemap: https://www.walmart.com/sitemap_bf_2020.xml
Sitemap: https://www.walmart.com/sitemap_tp_legacy.xml
...

Unfortunately this doesn't provide us with a lot of space for result filtering. By the looks of it, we can only filter results by category using the https://www.walmart.com/sitemap_category.xml sitemap:

<url>
<loc>https://www.walmart.com/cp/-degree/928899</loc>
<lastmod>2022-04-01</lastmod>
</url>
<url>
<loc>https://www.walmart.com/cp/-depend/1092729</loc>
<lastmod>2022-04-01</lastmod>
</url>
<url>
<loc>https://www.walmart.com/cp/-hungergames/1095300</loc>
<lastmod>2022-04-01</lastmod>
</url>
<url>
<loc>https://www.walmart.com/cp/-jackson/1103987</loc>
<lastmod>2022-04-01</lastmod>
</url>

Each url in this sitemap will take us to pagination page of a single category which we can further customize with additional filters:

walmart.com filters for search queries

Alternatively, we can use the search system ourselves which brings us to the same filter-capable page: https://www.walmart.com/search?q=spider&sort=price_low&page=2&affinityOverride=default

So, either way we choose to approach this we'll have to parse the same kind of page which is great - we can write one scraper function to deal with both scenarios.

Getting Search Results

In this tutorial let's stick with parsing search pages though to parse category pages all we'd have to do is replace the scraped url. First, let's pick an example search page like search for word "spider":

https://www.walmart.com/search?q=spider&sort=price_low&page=1&affinityOverride=default

We see this url contains few parameters like:

  • q for search query, in this case it's the word "spider"
  • page for page number, in this case it's the 1st page
  • sort for sorting order, in this case price_low means sorted ascending by price

Now since our scrape doesn't execute javascript the dynamic result content will not be visible for us. Instead let's open up page source and search some product name and we can see that there's state data under:

<script id="__NEXT_DATA__">{"...PRODUCT_PAGINATION_DATA..."}</script>

Highly dynamic websites (especially run by React/Next.js frameworks) often contain data hidden in the HTML and then unpack it to HTML results on load using javascript. This is great news for us as we will still be able to access these results without running any javascript in our web scraper.


For our scraper we'll be using Python with:

  • [httpx] library for handling HTTP connections
  • [parsel] for parsing HTML content
  • [w3lib] for formatting URLs
  • [loguru] for prettier logging, so we can follow along easier

We can easily install them using pip:

$ pip install httpx parsel w3lib loguru

Let's start with our search scraper:

import asyncio
import json
import math
import httpx
from parsel import Selector
from w3lib.url import add_or_replace_parameters


async def _search_walmart_page(query: str = "", page=1, sort="price_low", session:httpx.AsyncClient):
    """scrape single walmart search page"""
    url = add_or_replace_parameters(
        "https://www.walmart.com/search?",
        {
            "q": query,
            "sort": sort,
            "page": page,
            "affinityOverride": "default",
        },
    )
    resp = await session.get(url)
    assert resp.status_code == 200
    return resp


def parse_search(html_text: str):
    """extract search results from search HTML response"""
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)

    total_results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["count"]
    results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
    return results, total_results

In this scraper we start off with two functions:

  • asynchronous _search_walmart_page() which creates a query url from given parameters and scrapes the HTML of the search page
  • parse_search() which takes search page HTML, finds the __NEXT_DATA__ javascript state objects and parses search results as well as total result count.

We have a way to retrieve results of a single search page, let's improve that to scrape all 25 pages of results:

async def discover_walmart(search:str, session:httpx.AsyncClient):
    _resp_page1 = await _search_walmart_page(query=search, session=session)
    results, total_items = parse_search(_resp_page1.text)
    max_page = math.ceil(total_items / 40)
    if max_page > 25:
        max_page = 25
    for response in await asyncio.gather(
        *[_search_walmart_page(query=search, page=i, session=session) for i in range(2, max_page)]
    ):
        results.extend(parse_search(response.text)[0])
    return results

Here, we've added a wrapper function that will scrape the search of the first page and then scrape the remaining pages concurrently.

We need some execution code to run this scraper:

BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    # limit connection speed to prevent scraping too fast
    limits = httpx.Limits(max_keepalive_connections=5, max_connections=5)
    async with httpx.AsyncClient(headers=BASE_HEADERS, limits=limits) as session:
        results = await discover_walmart("spider", session=session)
        print(json.dumps(results, indent=2))


if __name__ == "__main__":
    asyncio.run(run())

Here, we're applying some custom headers to our web connection session to avoid scraper blocking. We create an asynchronous httpx client and call our discover function to find all results of this page:

Example of output
[
  {
    "__typename": "Product",
    "availabilityStatusDisplayValue": "In stock",
    "productLocationDisplayValue": null,
    "externalInfoUrl": "",
    "canonicalUrl": "/ip/Eliminator-Ant-Roach-Spider-Killer4-20-oz-Kills-Insects-Spiders/795033156",
    "canAddToCart": true,
    "showOptions": false,
    "showBuyNow": false,
    "description": "<li>KILLS ON CONTACT: Eliminator Ant, Roach & Spider Killer4 kills cockroaches, ants, carpenter ants, crickets, firebrats, fleas, silverfish and spiders</li><li>NON-STAINING: This water-based product</li>",
    "flag": "",
    "badge": {
      "text": "",
      "id": "",
      "type": "",
      "key": ""
    },
    "fulfillmentBadges": [
      "Pickup",
      "Delivery",
      "1-day shipping"
    ],
    "fulfillmentIcon": {
      "key": "SAVE_WITH_W_PLUS",
      "label": "Save with"
    },
    "fulfillmentBadge": "Tomorrow",
    "fulfillmentSpeed": [
      "TOMORROW"
    ],
    "fulfillmentType": "FC",
    "groupMetaData": {
      "groupType": null,
      "groupSubType": null,
      "numberOfComponents": 0,
      "groupComponents": null
    },
    "id": "5D3NBXRMIZK4",
    "itemType": null,
    "usItemId": "795033156",
    "image": "https://i5.walmartimages.com/asr/c9c0c51c-f30f-4eb2-aaf1-88f599167584.d824f7ff13f10b3dcfb9dadd2a04686d.jpeg?odnHeight=180&odnWidth=180&odnBg=ffffff",
    "isOutOfStock": false,
    "esrb": "",
    "mediaRating": "",
    "name": "Eliminator Ant, Roach & Spider Killer4, 20 oz, Kills Insects & Spiders",
    "price": 3.48,
    "preOrder": {
      "isPreOrder": false,
      "preOrderMessage": null,
      "preOrderStreetDateMessage": null
    },
    "..."
]

Dealing With Paging Limits

There's one minor issue with our search discovery approach - page limit. Walmart returns only 25 pages (1000 products) per query - what if our query has more than that?

The best way to deal with this is to split our query into multiple smaller queries and we can do this by applying filters:

walmart filters that fit for batching

The first thing we can do is reverse ordering: we can scrape lowest-to-highest ordered results and then reverse that - doubling our coverage to 50 pages or 2000 products!

Further, we can split our query into smaller queries by using single-choice filters (radio buttons) like "department" or go even further and use price ranges.

With a bit of clever query splitting this 2000 product limit doesn't look that intimidating!

Product Scraper

Our scraper can use Walmart's search functionality to discover product preview details which contain price, few images, product url and some description.
To collect full product data we'll need to scrape each product url individually so let's extend our scraper with this functionality.

def parse_product(html):
    ...

async def _scrape_products_by_url(urls: List[str], session:httpx.AsyncClient):
    responses = await asyncio.gather(*[session.get(url) for url in urls])
    results = []
    for resp in responses:
        assert resp.status_code == 200
        results.append(parse_product(resp.text))
    return results

For parsing we can actually employ the same strategy we've been using in parsing search - extracting the __NEXT_DATA__ json state object. In product case it contains all of the product data in JSON format which is very convenient for us:

def parse_product(html_text: str) -> Dict:
    """parse walmart product"""
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)
    _product_raw = data["props"]["pageProps"]["initialData"]["data"]["product"]
    # There's a lot of product data, including private meta keywords, so we need to do some filtering:
    wanted_product_keys = [
        "availabilityStatus",
        "averageRating",
        "brand",
        "id",
        "imageInfo",
        "manufacturerName",
        "name",
        "orderLimit",
        "orderMinLimit",
        "priceInfo",
        "shortDescription",
        "type",
    ]
    product = {k: v for k, v in _product_raw.items() if k in wanted_product_keys}
    reviews_raw = data["props"]["pageProps"]["initialData"]["data"]["reviews"]
    return {"product": product, "reviews": reviews_raw}

In this parse function we're picking up __NEXT_DATA__ object and parse it for product information. There's a lot of data here, so we're using key whitelist to select only the most important keys like product name, price, description and media.

Example Output
{
  "product": {
    "availabilityStatus": "IN_STOCK",
    "averageRating": 2.3,
    "brand": "Sony Pictures Entertainment",
    "shortDescription": "It&apos;s great to be Spider-Man (Andrew Garfield). for Peter Parker, there&apos;s no feeling quite like swinging between skyscrapers, embracing being the hero, and spending time with Gwen (Emma Stone). But being Spider-Man comes at a price: only Spider-Man can protect his fellow New Yorkers from the formidable villains that threaten the city. With the emergence of Electro (Jamie Foxx), Peter must confront a foe far more powerful than himself. And as his old friend, Harry Osborn (Dane DeHaan), returns, Peter comes to realize that all of his enemies have one thing in common: Oscorp.",
    "id": "43N352NZTVIQ",
    "imageInfo": {
      "allImages": [
        {
          "id": "E832A8930EF64D37B408265925B61573",
          "url": "https://i5.walmartimages.com/asr/175ec13e-b95a-4e6f-a57c-13e9ea7b703e_1.1047259e6d0655743cac1997d2b1a16e.jpeg",
          "zoomable": false
        },
        {
          "id": "A2C3299D21A34FADB84047E627CFD9E4",
          "url": "https://i5.walmartimages.com/asr/c8f793a3-5ebf-4f83-a2e1-a71fda15dbd3_1.f8b6234fb668f7c4f8d72f1a1c0f21c4.jpeg",
          "zoomable": false
        }
      ],
      "thumbnailUrl": "https://i5.walmartimages.com/asr/175ec13e-b95a-4e6f-a57c-13e9ea7b703e_1.1047259e6d0655743cac1997d2b1a16e.jpeg"
    },
    "manufacturerName": "Sony",
    "name": "The Amazing Spider-Man 2 (Blu-ray + DVD)",
    "orderMinLimit": 1,
    "orderLimit": 5,
    "priceInfo": {
      "priceDisplayCodes": {
        "clearance": null,
        "eligibleForAssociateDiscount": true,
        "finalCostByWeight": null,
        "priceDisplayCondition": null,
        "reducedPrice": null,
        "rollback": null,
        "submapType": null
      },
      "currentPrice": {
        "price": 7.82,
        "priceString": "$7.82",
        "variantPriceString": "$7.82",
        "currencyUnit": "USD"
      },
      "wasPrice": {
        "price": 14.99,
        "priceString": "$14.99",
        "variantPriceString": null,
        "currencyUnit": "USD"
      },
      "unitPrice": null,
      "savings": null,
      "subscriptionPrice": null,
      "priceRange": {
        "minPrice": null,
        "maxPrice": null,
        "priceString": null,
        "currencyUnit": null,
        "denominations": null
      },
      "capType": null,
      "walmartFundedAmount": null
    },
    "type": "Movies"
  },
  "reviews": {
    "averageOverallRating": 2.3333,
    "customerReviews": [
      {
        "rating": 1,
        "reviewSubmissionTime": "9/7/2019",
        "reviewText": "I received this and was so disappointed.  the pic advertised shows a digital copy is included,  but it's just the Blu-ray.  immediately returned bc that is not what I ordered nor does it match the photo shown.",
        "reviewTitle": "no digital copy",
        "userNickname": "Tbaby",
        "photos": [],
        "badges": null,
        "syndicationSource": null
      },
      {
        "rating": 1,
        "reviewSubmissionTime": "3/11/2019",
        "reviewText": "Advertised as \"VUDU Instawatch Included\", this is not true.\nPicture shows BluRay + DVD + Digital HD, what actually ships is just the BluRay + DVD.",
        "reviewTitle": "WARNING: You don't get what's advertised.",
        "userNickname": "Reviewer",
        "photos": [],
        "badges": null,
        "syndicationSource": null
      },
      {
        "rating": 5,
        "reviewSubmissionTime": "1/4/2021",
        "reviewText": null,
        "reviewTitle": null,
        "userNickname": null,
        "photos": [],
        "badges": [
          {
            "badgeType": "Custom",
            "id": "VerifiedPurchaser",
            "contentType": "REVIEW",
            "glassBadge": {
              "id": "VerifiedPurchaser",
              "text": "Verified Purchaser"
            }
          }
        ],
        "syndicationSource": null
      }
    ],
    "ratingValueFiveCount": 1,
    "ratingValueFourCount": 0,
    "ratingValueOneCount": 2,
    "ratingValueThreeCount": 0,
    "ratingValueTwoCount": 0,
    "roundedAverageOverallRating": 2.3,
    "topNegativeReview": null,
    "topPositiveReview": null,
    "totalReviewCount": 3
  }
}

Final Scraper

We can find Walmart products using the search and scrape each individual product - let's put these two together in our final web scraper script:

Full Scraper Code
import asyncio
import json
import math
from typing import Dict, List, Tuple
from urllib.parse import urljoin

import httpx
from loguru import logger as log
from parsel import Selector
from w3lib.url import add_or_replace_parameters


async def _search_walmart_page(query: str = "", page=1, sort="price_low", session=httpx.AsyncClient) -> httpx.Response:
    """scrape single walmart search page"""
    url = add_or_replace_parameters(
        "https://www.walmart.com/search?",
        {
            "q": query,
            "sort": sort,
            "page": page,
            "affinityOverride": "default",
        },
    )
    log.debug(f'searching walpart page {page} of "{query}" sorted by {sort}')
    resp = await session.get(url)
    assert resp.status_code == 200
    return resp


def parse_search(html_text: str) -> Tuple[Dict, int]:
    """extract search results from search HTML response"""
    log.debug(f"parsing search page")
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)

    total_results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["count"]
    results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
    # there are other results types such as ads or placeholders - filter them out:
    results = [result for result in results if result["__typename"] == "Product"]
    log.info(f"parsed {len(results)} search product previews")
    return results, total_results


async def discover_walmart(search: str, session: httpx.AsyncClient) -> List[Dict]:
    log.info(f"searching walmart for {search}")
    _resp_page1 = await _search_walmart_page(query=search, session=session)
    results, total_items = parse_search(_resp_page1.content)
    max_page = math.ceil(total_items / 40)
    log.info(f"found total {max_page} pages of results ({total_items} products)")
    if max_page > 25:
        max_page = 25
    for response in await asyncio.gather(
        *[_search_walmart_page(query=search, page=i, session=session) for i in range(2, max_page)]
    ):
        results.extend(parse_search(response.content)[0])
    log.info(f"parsed total {len(results)} pages of results ({total_items} products)")
    return results


def parse_product(html_text):
    """parse walmart product"""
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)
    _product_raw = data["props"]["pageProps"]["initialData"]["data"]["product"]
    wanted_product_keys = [
        "availabilityStatus",
        "averageRating",
        "brand",
        "id",
        "imageInfo",
        "manufacturerName",
        "name",
        "orderLimit",
        "orderMinLimit",
        "priceInfo",
        "shortDescription",
        "type",
    ]
    product = {k: v for k, v in _product_raw.items() if k in wanted_product_keys}
    reviews_raw = data["props"]["pageProps"]["initialData"]["data"]["reviews"]
    return {"product": product, "reviews": reviews_raw}


async def _scrape_products_by_url(urls: List[str], session: httpx.AsyncClient):
    """scrape walmart products by urls"""
    log.info(f"scraping {len(urls)} product urls (in chunks of 50)")
    results = []
    # we chunk requests to reduce memory usage and scraping speeds
    for i in range(0, len(urls), 50):
        log.debug(f"scraping product chunk: {i}:{i+50}")
        chunk = urls[i : i + 50]
        responses = await asyncio.gather(*[session.get(url) for url in chunk])
        print(responses)
        for resp in responses:
            assert resp.status_code == 200
            results.append(parse_product(resp.content))
    return results


async def scrape_walmart(search: str, session: httpx.AsyncClient):
    """scrape walmart products by search term"""
    search_results = await discover_walmart(search, session=session)
    product_urls = [
        urljoin("https://www.walmart.com/", product_preview["canonicalUrl"]) for product_preview in search_results
    ]
    return await _scrape_products_by_url(product_urls, session=session)


BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    limits = httpx.Limits(max_keepalive_connections=5, max_connections=5)
    async with httpx.AsyncClient(headers=BASE_HEADERS, limits=limits) as session:
        results = await scrape_walmart("spider", session=session)
        total = json.dumps(results)
        return total


if __name__ == "__main__":
    asyncio.run(run())

In this short scraper we've implemented two basic functions:

  • discover_walmart() which finds products from given keywords in the form of product previews which provides basic product information and most importantly url's to the product page.
  • _scrape_products_by_url() which scrapes full product data from these discovered urls.

As for parsing, we took advantage of Walmart's frontend storage to __NEXT_DATA__ HTML/JS variable by extracting it and parsing it as a JSON object for whitelisted set of keys. This approach is much easier to implement and maintain than HTML parsing.

Dealing With Blocking

Walmart is one of the biggest retailers in the world, so unsurprisingly it's protective of its product data. If we scrape more than few products we'll soon be greeted with 307 redirect responses to /blocked endpoint, or a Captcha page.

walmart captcha request page

one of many block/captcha pages Walmart.com might display

Walmart is using a complex anti scraping protection system that analyses scraper's IP address, HTTP capabilities and javascript environment. This essentially means that Walmart can easily block our web scraper unless we put in significant effort fortifying all these elements.

How to Scrape Without Getting Blocked Tutorial

For more on how web scrapers are being detected and blocked see our full tutorial on web scraping blocking

How to Scrape Without Getting Blocked Tutorial

Instead, let's take advantage of ScrapFly API which can avoid all of these blocks for us!

illustration of scrapfly's middleware

Which offers several powerful features that'll help us to get around yelp's blocking:

For this we'll be using scrapfly-sdk python package and ScrapFly's anti scraping protection bypass feature. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Walmart web scraper all we need to do is replace httpx requests to scrapfly-sdk requests:

import httpx
session: httpx.AsyncClient

response = session.get(url)


from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
session: ScrapflyClient

response = session.scrape(
    ScrapeConfig(url=url, asp=True, country="US")
)

We can enable specific ScrapFly features using ScrapeConfig arguments. For Walmart, we'll be setting asp=True for anti scraping protection bypass and we'll be setting proxy geographical location to US to scrape only US version of Walmart.

In full our full scraper code has only few minor changes (see highlighted areas):


import asyncio
import json
import math
from typing import Dict, List, Tuple
from urllib.parse import urljoin

from loguru import logger as log
from parsel import Selector
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from w3lib.url import add_or_replace_parameters


async def _search_walmart_page(query: str = "", page=1, sort="price_low", session=ScrapflyClient) -> ScrapeApiResponse:
    """scrape single walmart search page"""
    url = add_or_replace_parameters(
        "https://www.walmart.com/search?",
        {
            "q": query,
            "sort": sort,
            "page": page,
            "affinityOverride": "default",
        },
    )
    log.debug(f'searching walpart page {page} of "{query}" sorted by {sort}')
    resp = await session.async_scrape(ScrapeConfig(url=url, asp=True, country="US"))
    assert resp.status_code == 200
    return resp


def parse_search(html_text: str) -> Tuple[Dict, int]:
    """extract search results from search HTML response"""
    log.debug(f"parsing search page")
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)

    total_results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["count"]
    results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
    # there are other results types such as ads or placeholders - filter them out:
    results = [result for result in results if result["__typename"] == "Product"]
    log.info(f"parsed {len(results)} search product previews")
    return results, total_results


async def discover_walmart(search: str, session: ScrapflyClient) -> List[Dict]:
    log.info(f"searching walmart for {search}")
    _resp_page1 = await _search_walmart_page(query=search, session=session)
    results, total_items = parse_search(_resp_page1.content)
    max_page = math.ceil(total_items / 40)
    log.info(f"found total {max_page} pages of results ({total_items} products)")
    max_page = 3  # TODO
    if max_page > 25:
        max_page = 25
    for response in await asyncio.gather(
        *[_search_walmart_page(query=search, page=i, session=session) for i in range(2, max_page)]
    ):
        results.extend(parse_search(response.content)[0])
    log.info(f"parsed total {len(results)} pages of results ({total_items} products)")
    return results


def parse_product(html_text):
    """parse walmart product"""
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)
    _product_raw = data["props"]["pageProps"]["initialData"]["data"]["product"]
    wanted_product_keys = [
        "availabilityStatus",
        "averageRating",
        "brand",
        "id",
        "imageInfo",
        "manufacturerName",
        "name",
        "orderLimit",
        "orderMinLimit",
        "priceInfo",
        "shortDescription",
        "type",
    ]
    product = {k: v for k, v in _product_raw.items() if k in wanted_product_keys}
    reviews_raw = data["props"]["pageProps"]["initialData"]["data"]["reviews"]
    return {"product": product, "reviews": reviews_raw}


async def _scrape_products_by_url(urls: List[str], session: ScrapflyClient):
    """scrape walmart products by urls"""
    log.info(f"scraping {len(urls)} product urls (in chunks of 50)")
    results = []
    # we chunk requests to reduce memory usage and scraping speeds
    for i in range(0, len(urls), 50):
        log.debug(f"scraping product chunk: {i}:{i+50}")
        chunk = urls[i : i + 50]
        responses = await session.concurrent_scrape([ScrapeConfig(url=url, asp=True, country="US") for url in chunk])
        for resp in responses:
            assert resp.status_code == 200
            results.append(parse_product(resp.content))
    return results


async def scrape_walmart(search: str, session: ScrapflyClient):
    """scrape walmart products by search term"""
    search_results = await discover_walmart(search, session=session)
    product_urls = [
        urljoin("https://www.walmart.com/", product_preview["canonicalUrl"]) for product_preview in search_results
    ]
    return await _scrape_products_by_url(product_urls, session=session)


async def run():
    scrapfly = ScrapflyClient(key="YOUR_SCRAPFLY_API_KEY", max_concurrency=5)
    with scrapfly as session:
        results = await scrape_walmart("spider", session=session)
        total = json.dumps(results)
        return total


if __name__ == "__main__":
    asyncio.run(run())

In our update scraper above we've replaced httpx calls with ScrapflyClient calls (see highlighted lines) so all of our requests are going through ScrapFly API which smartly avoids web scraper blocking.

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping walmart.com:

Summary

In this tutorial we built a small https://www.walmart.com/ scraper which uses search to discover products and then scrape all the products rapidly while avoiding blocking.

For this we've used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Related post

How to Scrape Instagram

Tutorial on how to scrape instagram.com user and post data using pure Python. How to scrape instagram without loging in or being blocked.

How to Web Scrape Yelp.com

Tutorial on how to scrape yelp.com business and review data using Python. How to avoid blocking to web scrape data at scale and other tips.

Web Scraping Graphql with Python

Introduction to web scraping graphql powered websites. How to create graphql queries in python and what are some common challenges.

Web Scraping With Python Tutorial

Introduction tutorial to web scraping with Python. How to collect and parse public data. Challenges, best practices and an example project.