How to Scrape Instagram
Tutorial on how to scrape instagram.com user and post data using pure Python. How to scrape instagram without loging in or being blocked.
Walmart.com is one of the biggest retailers in the world with a major online presence in the United States. Because of such enormous reach Walmart's public product data is often in demand for competitive intelligence analytics. So, how can we collect this valuable data?
In this web scraping tutorial we'll take a look at scraping public product data of walmart.com. We'll start by taking a look at how to find product urls either using sitemaps, category links or the search API. Then we'll take a look at product scraping itself and how can we use a common javascript state parsing technique to quickly and easily scrape vast amounts of data. Finally, we'll take a look at how to avoid web scraper blocking Walmart is so notorious for!
To start web scraping we first must find a way to discover walmart products and there are two most common ways of achieving this.
The easiest approach is to take advantage of walmart's sitemaps. If we take a look at https://www.walmart.com/robots.txt scraping rules we can see that there are multiple sitemaps:
Sitemap: https://www.walmart.com/sitemap_browse.xml
Sitemap: https://www.walmart.com/sitemap_category.xml
Sitemap: https://www.walmart.com/sitemap_store_main.xml
Sitemap: https://www.walmart.com/help/sitemap_gm.xml
Sitemap: https://www.walmart.com/sitemap_browse_fst.xml
Sitemap: https://www.walmart.com/sitemap_store_dept.xml
Sitemap: https://www.walmart.com/sitemap_bf_2020.xml
Sitemap: https://www.walmart.com/sitemap_tp_legacy.xml
...
Unfortunately this doesn't provide us with a lot of space for result filtering. By the looks of it, we can only filter results by category using the https://www.walmart.com/sitemap_category.xml sitemap:
<url>
<loc>https://www.walmart.com/cp/-degree/928899</loc>
<lastmod>2022-04-01</lastmod>
</url>
<url>
<loc>https://www.walmart.com/cp/-depend/1092729</loc>
<lastmod>2022-04-01</lastmod>
</url>
<url>
<loc>https://www.walmart.com/cp/-hungergames/1095300</loc>
<lastmod>2022-04-01</lastmod>
</url>
<url>
<loc>https://www.walmart.com/cp/-jackson/1103987</loc>
<lastmod>2022-04-01</lastmod>
</url>
Each url in this sitemap will take us to pagination page of a single category which we can further customize with additional filters:
Alternatively, we can use the search system ourselves which brings us to the same filter-capable page: https://www.walmart.com/search?q=spider&sort=price_low&page=2&affinityOverride=default
So, either way we choose to approach this we'll have to parse the same kind of page which is great - we can write one scraper function to deal with both scenarios.
In this tutorial let's stick with parsing search pages though to parse category pages all we'd have to do is replace the scraped url. First, let's pick an example search page like search for word "spider":
https://www.walmart.com/search?q=spider&sort=price_low&page=1&affinityOverride=default
We see this url contains few parameters like:
q
for search query, in this case it's the word "spider"page
for page number, in this case it's the 1st pagesort
for sorting order, in this case price_low
means sorted ascending by priceNow since our scrape doesn't execute javascript the dynamic result content will not be visible for us. Instead let's open up page source and search some product name and we can see that there's state data under:
<script id="__NEXT_DATA__">{"...PRODUCT_PAGINATION_DATA..."}</script>
Highly dynamic websites (especially run by React/Next.js frameworks) often contain data hidden in the HTML and then unpack it to HTML results on load using javascript. This is great news for us as we will still be able to access these results without running any javascript in our web scraper.
For our scraper we'll be using Python with:
We can easily install them using pip
:
$ pip install httpx parsel w3lib loguru
Let's start with our search scraper:
import asyncio
import json
import math
import httpx
from parsel import Selector
from w3lib.url import add_or_replace_parameters
async def _search_walmart_page(query: str = "", page=1, sort="price_low", session:httpx.AsyncClient):
"""scrape single walmart search page"""
url = add_or_replace_parameters(
"https://www.walmart.com/search?",
{
"q": query,
"sort": sort,
"page": page,
"affinityOverride": "default",
},
)
resp = await session.get(url)
assert resp.status_code == 200
return resp
def parse_search(html_text: str):
"""extract search results from search HTML response"""
sel = Selector(text=html_text)
data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
data = json.loads(data)
total_results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["count"]
results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
return results, total_results
In this scraper we start off with two functions:
_search_walmart_page()
which creates a query url from given parameters and scrapes the HTML of the search pageparse_search()
which takes search page HTML, finds the __NEXT_DATA__
javascript state objects and parses search results as well as total result count.We have a way to retrieve results of a single search page, let's improve that to scrape all 25 pages of results:
async def discover_walmart(search:str, session:httpx.AsyncClient):
_resp_page1 = await _search_walmart_page(query=search, session=session)
results, total_items = parse_search(_resp_page1.text)
max_page = math.ceil(total_items / 40)
if max_page > 25:
max_page = 25
for response in await asyncio.gather(
*[_search_walmart_page(query=search, page=i, session=session) for i in range(2, max_page)]
):
results.extend(parse_search(response.text)[0])
return results
Here, we've added a wrapper function that will scrape the search of the first page and then scrape the remaining pages concurrently.
We need some execution code to run this scraper:
BASE_HEADERS = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
async def run():
# limit connection speed to prevent scraping too fast
limits = httpx.Limits(max_keepalive_connections=5, max_connections=5)
async with httpx.AsyncClient(headers=BASE_HEADERS, limits=limits) as session:
results = await discover_walmart("spider", session=session)
print(json.dumps(results, indent=2))
if __name__ == "__main__":
asyncio.run(run())
Here, we're applying some custom headers to our web connection session to avoid scraper blocking. We create an asynchronous httpx client and call our discover function to find all results of this page:
[
{
"__typename": "Product",
"availabilityStatusDisplayValue": "In stock",
"productLocationDisplayValue": null,
"externalInfoUrl": "",
"canonicalUrl": "/ip/Eliminator-Ant-Roach-Spider-Killer4-20-oz-Kills-Insects-Spiders/795033156",
"canAddToCart": true,
"showOptions": false,
"showBuyNow": false,
"description": "<li>KILLS ON CONTACT: Eliminator Ant, Roach & Spider Killer4 kills cockroaches, ants, carpenter ants, crickets, firebrats, fleas, silverfish and spiders</li><li>NON-STAINING: This water-based product</li>",
"flag": "",
"badge": {
"text": "",
"id": "",
"type": "",
"key": ""
},
"fulfillmentBadges": [
"Pickup",
"Delivery",
"1-day shipping"
],
"fulfillmentIcon": {
"key": "SAVE_WITH_W_PLUS",
"label": "Save with"
},
"fulfillmentBadge": "Tomorrow",
"fulfillmentSpeed": [
"TOMORROW"
],
"fulfillmentType": "FC",
"groupMetaData": {
"groupType": null,
"groupSubType": null,
"numberOfComponents": 0,
"groupComponents": null
},
"id": "5D3NBXRMIZK4",
"itemType": null,
"usItemId": "795033156",
"image": "https://i5.walmartimages.com/asr/c9c0c51c-f30f-4eb2-aaf1-88f599167584.d824f7ff13f10b3dcfb9dadd2a04686d.jpeg?odnHeight=180&odnWidth=180&odnBg=ffffff",
"isOutOfStock": false,
"esrb": "",
"mediaRating": "",
"name": "Eliminator Ant, Roach & Spider Killer4, 20 oz, Kills Insects & Spiders",
"price": 3.48,
"preOrder": {
"isPreOrder": false,
"preOrderMessage": null,
"preOrderStreetDateMessage": null
},
"..."
]
There's one minor issue with our search discovery approach - page limit. Walmart returns only 25 pages (1000 products) per query - what if our query has more than that?
The best way to deal with this is to split our query into multiple smaller queries and we can do this by applying filters:
The first thing we can do is reverse ordering: we can scrape lowest-to-highest ordered results and then reverse that - doubling our coverage to 50 pages or 2000 products!
Further, we can split our query into smaller queries by using single-choice filters (radio buttons) like "department" or go even further and use price ranges.
With a bit of clever query splitting this 2000 product limit doesn't look that intimidating!
Our scraper can use Walmart's search functionality to discover product preview details which contain price, few images, product url and some description.
To collect full product data we'll need to scrape each product url individually so let's extend our scraper with this functionality.
def parse_product(html):
...
async def _scrape_products_by_url(urls: List[str], session:httpx.AsyncClient):
responses = await asyncio.gather(*[session.get(url) for url in urls])
results = []
for resp in responses:
assert resp.status_code == 200
results.append(parse_product(resp.text))
return results
For parsing we can actually employ the same strategy we've been using in parsing search - extracting the __NEXT_DATA__
json state object. In product case it contains all of the product data in JSON format which is very convenient for us:
def parse_product(html_text: str) -> Dict:
"""parse walmart product"""
sel = Selector(text=html_text)
data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
data = json.loads(data)
_product_raw = data["props"]["pageProps"]["initialData"]["data"]["product"]
# There's a lot of product data, including private meta keywords, so we need to do some filtering:
wanted_product_keys = [
"availabilityStatus",
"averageRating",
"brand",
"id",
"imageInfo",
"manufacturerName",
"name",
"orderLimit",
"orderMinLimit",
"priceInfo",
"shortDescription",
"type",
]
product = {k: v for k, v in _product_raw.items() if k in wanted_product_keys}
reviews_raw = data["props"]["pageProps"]["initialData"]["data"]["reviews"]
return {"product": product, "reviews": reviews_raw}
In this parse function we're picking up __NEXT_DATA__
object and parse it for product information. There's a lot of data here, so we're using key whitelist to select only the most important keys like product name, price, description and media.
{
"product": {
"availabilityStatus": "IN_STOCK",
"averageRating": 2.3,
"brand": "Sony Pictures Entertainment",
"shortDescription": "It's great to be Spider-Man (Andrew Garfield). for Peter Parker, there's no feeling quite like swinging between skyscrapers, embracing being the hero, and spending time with Gwen (Emma Stone). But being Spider-Man comes at a price: only Spider-Man can protect his fellow New Yorkers from the formidable villains that threaten the city. With the emergence of Electro (Jamie Foxx), Peter must confront a foe far more powerful than himself. And as his old friend, Harry Osborn (Dane DeHaan), returns, Peter comes to realize that all of his enemies have one thing in common: Oscorp.",
"id": "43N352NZTVIQ",
"imageInfo": {
"allImages": [
{
"id": "E832A8930EF64D37B408265925B61573",
"url": "https://i5.walmartimages.com/asr/175ec13e-b95a-4e6f-a57c-13e9ea7b703e_1.1047259e6d0655743cac1997d2b1a16e.jpeg",
"zoomable": false
},
{
"id": "A2C3299D21A34FADB84047E627CFD9E4",
"url": "https://i5.walmartimages.com/asr/c8f793a3-5ebf-4f83-a2e1-a71fda15dbd3_1.f8b6234fb668f7c4f8d72f1a1c0f21c4.jpeg",
"zoomable": false
}
],
"thumbnailUrl": "https://i5.walmartimages.com/asr/175ec13e-b95a-4e6f-a57c-13e9ea7b703e_1.1047259e6d0655743cac1997d2b1a16e.jpeg"
},
"manufacturerName": "Sony",
"name": "The Amazing Spider-Man 2 (Blu-ray + DVD)",
"orderMinLimit": 1,
"orderLimit": 5,
"priceInfo": {
"priceDisplayCodes": {
"clearance": null,
"eligibleForAssociateDiscount": true,
"finalCostByWeight": null,
"priceDisplayCondition": null,
"reducedPrice": null,
"rollback": null,
"submapType": null
},
"currentPrice": {
"price": 7.82,
"priceString": "$7.82",
"variantPriceString": "$7.82",
"currencyUnit": "USD"
},
"wasPrice": {
"price": 14.99,
"priceString": "$14.99",
"variantPriceString": null,
"currencyUnit": "USD"
},
"unitPrice": null,
"savings": null,
"subscriptionPrice": null,
"priceRange": {
"minPrice": null,
"maxPrice": null,
"priceString": null,
"currencyUnit": null,
"denominations": null
},
"capType": null,
"walmartFundedAmount": null
},
"type": "Movies"
},
"reviews": {
"averageOverallRating": 2.3333,
"customerReviews": [
{
"rating": 1,
"reviewSubmissionTime": "9/7/2019",
"reviewText": "I received this and was so disappointed. the pic advertised shows a digital copy is included, but it's just the Blu-ray. immediately returned bc that is not what I ordered nor does it match the photo shown.",
"reviewTitle": "no digital copy",
"userNickname": "Tbaby",
"photos": [],
"badges": null,
"syndicationSource": null
},
{
"rating": 1,
"reviewSubmissionTime": "3/11/2019",
"reviewText": "Advertised as \"VUDU Instawatch Included\", this is not true.\nPicture shows BluRay + DVD + Digital HD, what actually ships is just the BluRay + DVD.",
"reviewTitle": "WARNING: You don't get what's advertised.",
"userNickname": "Reviewer",
"photos": [],
"badges": null,
"syndicationSource": null
},
{
"rating": 5,
"reviewSubmissionTime": "1/4/2021",
"reviewText": null,
"reviewTitle": null,
"userNickname": null,
"photos": [],
"badges": [
{
"badgeType": "Custom",
"id": "VerifiedPurchaser",
"contentType": "REVIEW",
"glassBadge": {
"id": "VerifiedPurchaser",
"text": "Verified Purchaser"
}
}
],
"syndicationSource": null
}
],
"ratingValueFiveCount": 1,
"ratingValueFourCount": 0,
"ratingValueOneCount": 2,
"ratingValueThreeCount": 0,
"ratingValueTwoCount": 0,
"roundedAverageOverallRating": 2.3,
"topNegativeReview": null,
"topPositiveReview": null,
"totalReviewCount": 3
}
}
We can find Walmart products using the search and scrape each individual product - let's put these two together in our final web scraper script:
import asyncio
import json
import math
from typing import Dict, List, Tuple
from urllib.parse import urljoin
import httpx
from loguru import logger as log
from parsel import Selector
from w3lib.url import add_or_replace_parameters
async def _search_walmart_page(query: str = "", page=1, sort="price_low", session=httpx.AsyncClient) -> httpx.Response:
"""scrape single walmart search page"""
url = add_or_replace_parameters(
"https://www.walmart.com/search?",
{
"q": query,
"sort": sort,
"page": page,
"affinityOverride": "default",
},
)
log.debug(f'searching walpart page {page} of "{query}" sorted by {sort}')
resp = await session.get(url)
assert resp.status_code == 200
return resp
def parse_search(html_text: str) -> Tuple[Dict, int]:
"""extract search results from search HTML response"""
log.debug(f"parsing search page")
sel = Selector(text=html_text)
data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
data = json.loads(data)
total_results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["count"]
results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
# there are other results types such as ads or placeholders - filter them out:
results = [result for result in results if result["__typename"] == "Product"]
log.info(f"parsed {len(results)} search product previews")
return results, total_results
async def discover_walmart(search: str, session: httpx.AsyncClient) -> List[Dict]:
log.info(f"searching walmart for {search}")
_resp_page1 = await _search_walmart_page(query=search, session=session)
results, total_items = parse_search(_resp_page1.content)
max_page = math.ceil(total_items / 40)
log.info(f"found total {max_page} pages of results ({total_items} products)")
if max_page > 25:
max_page = 25
for response in await asyncio.gather(
*[_search_walmart_page(query=search, page=i, session=session) for i in range(2, max_page)]
):
results.extend(parse_search(response.content)[0])
log.info(f"parsed total {len(results)} pages of results ({total_items} products)")
return results
def parse_product(html_text):
"""parse walmart product"""
sel = Selector(text=html_text)
data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
data = json.loads(data)
_product_raw = data["props"]["pageProps"]["initialData"]["data"]["product"]
wanted_product_keys = [
"availabilityStatus",
"averageRating",
"brand",
"id",
"imageInfo",
"manufacturerName",
"name",
"orderLimit",
"orderMinLimit",
"priceInfo",
"shortDescription",
"type",
]
product = {k: v for k, v in _product_raw.items() if k in wanted_product_keys}
reviews_raw = data["props"]["pageProps"]["initialData"]["data"]["reviews"]
return {"product": product, "reviews": reviews_raw}
async def _scrape_products_by_url(urls: List[str], session: httpx.AsyncClient):
"""scrape walmart products by urls"""
log.info(f"scraping {len(urls)} product urls (in chunks of 50)")
results = []
# we chunk requests to reduce memory usage and scraping speeds
for i in range(0, len(urls), 50):
log.debug(f"scraping product chunk: {i}:{i+50}")
chunk = urls[i : i + 50]
responses = await asyncio.gather(*[session.get(url) for url in chunk])
print(responses)
for resp in responses:
assert resp.status_code == 200
results.append(parse_product(resp.content))
return results
async def scrape_walmart(search: str, session: httpx.AsyncClient):
"""scrape walmart products by search term"""
search_results = await discover_walmart(search, session=session)
product_urls = [
urljoin("https://www.walmart.com/", product_preview["canonicalUrl"]) for product_preview in search_results
]
return await _scrape_products_by_url(product_urls, session=session)
BASE_HEADERS = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
async def run():
limits = httpx.Limits(max_keepalive_connections=5, max_connections=5)
async with httpx.AsyncClient(headers=BASE_HEADERS, limits=limits) as session:
results = await scrape_walmart("spider", session=session)
total = json.dumps(results)
return total
if __name__ == "__main__":
asyncio.run(run())
In this short scraper we've implemented two basic functions:
discover_walmart()
which finds products from given keywords in the form of product previews which provides basic product information and most importantly url's to the product page._scrape_products_by_url()
which scrapes full product data from these discovered urls.As for parsing, we took advantage of Walmart's frontend storage to __NEXT_DATA__
HTML/JS variable by extracting it and parsing it as a JSON object for whitelisted set of keys. This approach is much easier to implement and maintain than HTML parsing.
Walmart is one of the biggest retailers in the world, so unsurprisingly it's protective of its product data. If we scrape more than few products we'll soon be greeted with 307 redirect responses to /blocked
endpoint, or a Captcha page.
one of many block/captcha pages Walmart.com might display
Walmart is using a complex anti scraping protection system that analyses scraper's IP address, HTTP capabilities and javascript environment. This essentially means that Walmart can easily block our web scraper unless we put in significant effort fortifying all these elements.
For more on how web scrapers are being detected and blocked see our full tutorial on web scraping blocking
Instead, let's take advantage of ScrapFly API which can avoid all of these blocks for us!
Which offers several powerful features that'll help us to get around yelp's blocking:
For this we'll be using scrapfly-sdk python package and ScrapFly's anti scraping protection bypass feature. First, let's install scrapfly-sdk
using pip:
$ pip install scrapfly-sdk
To take advantage of ScrapFly's API in our Walmart web scraper all we need to do is replace httpx
requests to scrapfly-sdk requests:
import httpx
session: httpx.AsyncClient
response = session.get(url)
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
session: ScrapflyClient
response = session.scrape(
ScrapeConfig(url=url, asp=True, country="US")
)
We can enable specific ScrapFly features using ScrapeConfig
arguments. For Walmart, we'll be setting asp=True
for anti scraping protection bypass and we'll be setting proxy geographical location to US to scrape only US version of Walmart.
In full our full scraper code has only few minor changes (see highlighted areas):
import asyncio
import json
import math
from typing import Dict, List, Tuple
from urllib.parse import urljoin
from loguru import logger as log
from parsel import Selector
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from w3lib.url import add_or_replace_parameters
async def _search_walmart_page(query: str = "", page=1, sort="price_low", session=ScrapflyClient) -> ScrapeApiResponse:
"""scrape single walmart search page"""
url = add_or_replace_parameters(
"https://www.walmart.com/search?",
{
"q": query,
"sort": sort,
"page": page,
"affinityOverride": "default",
},
)
log.debug(f'searching walpart page {page} of "{query}" sorted by {sort}')
resp = await session.async_scrape(ScrapeConfig(url=url, asp=True, country="US"))
assert resp.status_code == 200
return resp
def parse_search(html_text: str) -> Tuple[Dict, int]:
"""extract search results from search HTML response"""
log.debug(f"parsing search page")
sel = Selector(text=html_text)
data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
data = json.loads(data)
total_results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["count"]
results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
# there are other results types such as ads or placeholders - filter them out:
results = [result for result in results if result["__typename"] == "Product"]
log.info(f"parsed {len(results)} search product previews")
return results, total_results
async def discover_walmart(search: str, session: ScrapflyClient) -> List[Dict]:
log.info(f"searching walmart for {search}")
_resp_page1 = await _search_walmart_page(query=search, session=session)
results, total_items = parse_search(_resp_page1.content)
max_page = math.ceil(total_items / 40)
log.info(f"found total {max_page} pages of results ({total_items} products)")
max_page = 3 # TODO
if max_page > 25:
max_page = 25
for response in await asyncio.gather(
*[_search_walmart_page(query=search, page=i, session=session) for i in range(2, max_page)]
):
results.extend(parse_search(response.content)[0])
log.info(f"parsed total {len(results)} pages of results ({total_items} products)")
return results
def parse_product(html_text):
"""parse walmart product"""
sel = Selector(text=html_text)
data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
data = json.loads(data)
_product_raw = data["props"]["pageProps"]["initialData"]["data"]["product"]
wanted_product_keys = [
"availabilityStatus",
"averageRating",
"brand",
"id",
"imageInfo",
"manufacturerName",
"name",
"orderLimit",
"orderMinLimit",
"priceInfo",
"shortDescription",
"type",
]
product = {k: v for k, v in _product_raw.items() if k in wanted_product_keys}
reviews_raw = data["props"]["pageProps"]["initialData"]["data"]["reviews"]
return {"product": product, "reviews": reviews_raw}
async def _scrape_products_by_url(urls: List[str], session: ScrapflyClient):
"""scrape walmart products by urls"""
log.info(f"scraping {len(urls)} product urls (in chunks of 50)")
results = []
# we chunk requests to reduce memory usage and scraping speeds
for i in range(0, len(urls), 50):
log.debug(f"scraping product chunk: {i}:{i+50}")
chunk = urls[i : i + 50]
responses = await session.concurrent_scrape([ScrapeConfig(url=url, asp=True, country="US") for url in chunk])
for resp in responses:
assert resp.status_code == 200
results.append(parse_product(resp.content))
return results
async def scrape_walmart(search: str, session: ScrapflyClient):
"""scrape walmart products by search term"""
search_results = await discover_walmart(search, session=session)
product_urls = [
urljoin("https://www.walmart.com/", product_preview["canonicalUrl"]) for product_preview in search_results
]
return await _scrape_products_by_url(product_urls, session=session)
async def run():
scrapfly = ScrapflyClient(key="YOUR_SCRAPFLY_API_KEY", max_concurrency=5)
with scrapfly as session:
results = await scrape_walmart("spider", session=session)
total = json.dumps(results)
return total
if __name__ == "__main__":
asyncio.run(run())
In our update scraper above we've replaced httpx
calls with ScrapflyClient
calls (see highlighted lines) so all of our requests are going through ScrapFly API which smartly avoids web scraper blocking.
To wrap this guide up let's take a look at some frequently asked questions about web scraping walmart.com:
Yes. Walmart product data publicly available, and we're not extracting anything personal or private. Scraping walmart.com at slow, respectful rates of would fall under ethical scraping definition. See our Is Web Scraping Legal? article for more.
In this tutorial we built a small https://www.walmart.com/ scraper which uses search to discover products and then scrape all the products rapidly while avoiding blocking.
For this we've used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!