StockX is an online marketplace for buying and selling authentic sneakers, streetwear, watches, and designer handbags. The most interesting part about StockX is that it treats apparel items as a commodity and tracks their value over time. This makes StockX a prime target for web scraping as tracking market movement and product data is a great way to build a data-driven business.
In this web scraping tutorial, we'll be taking a look at how to scrape StockX using Python. We'll be scraping StockX's hidden web data which is an incredibly easy way to scrape e-commerce websites with just a few lines of code.
We'll start with a quick Python environment setup and tool overview, then scrape some products and product search pages. Let's dive in!
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Why scrape StockX?
Just like most e-commerce targets StockX public data provides an important overview of the market. So, scraping StockX is a great way to get a competitive advantage through data-driven decision-making.
Additionally, since StockX treats its products as commodities by scraping the data we can perform various data analysis tasks to keep on top with market trends outbidding and outmaneuvering our competitors.
StockX Dataset Preview
Since we'll be using hidden web data scraping we'll be extracting the whole product datasets which contain fields like:
Product data like descriptions, images, sizes, ids, traits and everything visible on the page.
Product market performance like sale numbers, prices, asks and bids.
For parsing these json datasets to something smaller see our JMESPath and JSONPath tool introductions.
Project Setup
In this web scraping tutorial, we'll be using Python with three popular libraries:
httpx - HTTP client library which we'll use to retrieve StockX's web pages.
parsel - HTML parsing library which we'll use to find <script> elements containing hidden web data.
nested_lookup - Allows to extract nested values from a dictionary by key. Since StockX's datasets are huge and nested this library will help us find the product data quickly and reliably.
These packages can be easily installed via the pip install command:
$ pip install httpx parsel nested_lookup
Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is the beautifulsoup package.
Next, let's start by taking a look at how to scrape StockX's single product data.
Scraping StockX Product Data
To scrape single product data we'll be using hidden web data technique.
StockX is powered by React and Next.js technologies so we'll be looking for hidden data in the <script> elements. In particular, hidden web data is usually available in one of these two places:
<script id="__NEXT_DATA__" type="application/json">{...}</script>
<!-- or -->
<script data-name="query">window.__REACT_QUERY_STATE__ = {...};</script>
To parse this HTML for these hidden datasets we can use XPath or CSS Selectors:
import json
from parsel import Selector
def parse_hidden_Data(html: str) -> dict:
"""extract nextjs cache from page"""
selector = Selector(html)
data = selector.css("script#__NEXT_DATA__::text").get()
if not data:
data = selector.css("script[data-name=query]::text").get()
data = data.split("=", 1)[-1].strip().strip(';')
data = json.loads(data)
return data
Here we're building a parsel.Selector and looking up <script> elements based on CSS selectors.
Next, let's add HTTP capabilities to complete our product scraper and let's take it for a spin:
Python
Scrapfly
import asyncio
import json
import httpx
from nested_lookup import nested_lookup
from parsel import Selector
# create HTTPX client with headers that resemble a web browser
client = httpx.AsyncClient(
http2=True,
follow_redirects=True,
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
},
)
def parse_nextjs(html: str) -> dict:
"""extract nextjs cache from page"""
selector = Selector(html)
data = selector.css("script#__NEXT_DATA__::text").get()
if not data:
data = selector.css("script[data-name=query]::text").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
async def scrape_product(url: str) -> dict:
"""scrape a single stockx product page for product data"""
response = await client.get(url)
assert response.status_code == 200
data = parse_nextjs(response.text)
# extract all products datasets from page cache
products = nested_lookup("product", data)
# find the current product dataset
try:
product = next(p for p in products if p.get("urlKey") in str(response.url))
except StopIteration:
raise ValueError("Could not find product dataset in page cache", response)
return product
# example use:
url = "https://stockx.com/nike-air-max-90-se-running-club"
print(asyncio.run(scrape_product(url)))
import asyncio
import json
from nested_lookup import nested_lookup
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY API KEY", max_concurrency=10)
def parse_nextjs(result: ScrapeApiResponse) -> dict:
"""extract nextjs cache from page"""
data = result.selector.css("script#__NEXT_DATA__::text").get()
if not data:
data = result.selector.css("script[data-name=query]::text").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
async def scrape_product(url: str) -> dict:
"""scrape a single stockx product page for product data"""
result = await scrapfly.async_scrape(
ScrapeConfig(
url=url,
render_js: True,
proxy_pool: "public_residential_pool",
country="US",
asp=True,
)
)
data = parse_nextjs(result)
# extract all products datasets from page cache
products = nested_lookup("product", data)
# find the current product dataset
try:
product = next(p for p in products if p.get("urlKey") in result.context["url"])
except StopIteration:
raise ValueError("Could not find product dataset in page cache", result.context)
return product
# example use:
url = "https://stockx.com/nike-air-max-90-se-running-club"
print(asyncio.run(scrape_product(url)))
Above, in just a few lines of code, we've scraped the entire product's dataset available on StockX's website.
Now that we can scrape a single item, let's take a look at how to find products on StockX to scrape all data or just select categories.
Scraping Product Prices
In the above section, we were able to scrape StockX product pages successfully. The extracted data includes various details about its specifications and variants. However, it's missing an essential detail: pricing!
The product details we extracted earlier were rendered on the server side. Hence, it was accessible upon requesting the product page URL. On the other hand, pricing data are loaded through a GraphQL endpoint that's dynamically sent using JavaSciprt. Therefore, pricing data aren't accessible in the script tag extracted earlier.
To view the GraphQL endpoint for retrieving price data, follow the below steps:
Open the browser developer tools by clicking the F12 key.
Head over to the network tab and filter by Fetch/XHR requests.
Hard reload the product web page.
Upon following the above steps, you will find the below XHR call captured:
To scrape StockX product pricing data, we'll extract the above XHR response:
import json
import asyncio
from typing import Dict
from nested_lookup import nested_lookup
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
def parse_pricing(result: ScrapeApiResponse, sku: str = None) -> Dict:
"""extractproduct data from xhr responses"""
_xhr_calls = result.scrape_result["browser_data"]["xhr_call"]
json_calls = []
for xhr in _xhr_calls:
if xhr["response"]["body"] is None:
continue
try:
data = json.loads(xhr["response"]["body"])
except json.JSONDecodeError:
continue
json_calls.append(data)
for xhr in json_calls:
if (
"data" not in xhr
or "product" not in xhr["data"]
or "uuid" not in xhr["data"]["product"]
):
continue
if sku == xhr["data"]["product"]["uuid"]:
data = xhr["data"]["product"]
return {
"minimumBid": data["minimumBid"],
"market": data["market"],
"variants": data["variants"],
}
return None
async def scrape_product(url: str) -> dict:
"""scrape a single stockx product page for product data"""
result = await scrapfly.async_scrape(
ScrapeConfig(
url=url,
render_js=True,
proxy_pool="public_residential_pool",
country="US",
asp=True,
rendering_wait=5000,
wait_for_selector="//h2[@data-testid='trade-box-buy-amount']",
)
)
# previous product parsing logic
data = parse_nextjs(result)
products = nested_lookup("product", data)
product = next(p for p in products if p.get("urlKey") in result.context["url"])
# extract product price
product["pricing"] = parse_pricing(result, product["id"])
return product["pricing"]
Run the code
if __name__ == "__main__":
pricing = asyncio.run(
scrape_product("https://stockx.com/nike-air-max-90-se-running-club")
)
print("extracted pricing details")
with open("pricing.json", "w") as f:
json.dump(pricing, f, indent=2, ensure_ascii=False)
Above, we request the product page with JavaScript rendering after ensuring the pricing details have loaded correctly. Then, we use the parse_pricing to parse the XHR response.
Here's an example output of the results extracted:
Note that the above approach is available using headless browsers. For more, refer to our guide on web scraping background requests.
Scraping Stockx Search
To discover products we have two choices: sitemaps and search.
Sitemaps are ideal for discovering all products and they can usually be found by inspecting /robots.txt URL. For example, StockX's robots.txt indicates this:
So we could scrape /sitemap/sitemap-index.xml where every product URL is located. However, if we want to narrow down our scope and scrape specific items then we can scrape StockX's search pages. Let's take a look how can we do that.
To start, we can see that StockX's search is capable of searching by product category and query:
Each of these search pages can be further refined and sorted which results in a unique URL.
For this example, let's take the top sold apparel items that match the query "indigo":
To scrape this we'll be using the same hidden web data approach as before. The hidden data is located in the same place so we can reuse our parse_hidden_data() function though this time around it only contains product preview data rather than the whole datasets.
Python
Scrapfly
import asyncio
import json
import math
from typing import Dict, List
import httpx
from nested_lookup import nested_lookup
from parsel import Selector
# create HTTPX client with headers that resemble a web browser
client = httpx.AsyncClient(
http2=True,
follow_redirects=True,
limits=httpx.Limits(max_connections=3), # keep this low to avoid being blocked
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
},
)
# From previous chapter:
def parse_nextjs(html: str) -> dict:
"""extract nextjs cache from page"""
selector = Selector(html)
data = selector.css("script#__NEXT_DATA__::text").get()
if not data:
data = selector.css("script[data-name=query]::text").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
async def scrape_search(url: str, max_pages: int = 25) -> List[Dict]:
"""Scrape StockX search"""
print(f"scraping first search page: {url}")
first_page = await client.get(url)
assert first_page.status_code == 200, "scrape was blocked" # this should be retried, handled etc.
# parse first page for product search data and total amount of pages:
data = parse_nextjs(first_page.text)
_first_page_results = nested_lookup("results", data)[0]
_paging_info = _first_page_results["pageInfo"]
total_pages = _paging_info["pageCount"] or math.ceil(_paging_info["total"] / _paging_info["limit"]) # note: pageCount can be missing but we can calculate it ourselves
if max_pages < total_pages:
total_pages = max_pages
product_previews = [edge["node"] for edge in _first_page_results["edges"]]
# then scrape other pages concurrently:
print(f" scraping remaining {total_pages - 1} search pages")
_other_pages = [ # create GET task for each page url
asyncio.create_task(client.get(f"{first_page.url}&page={page}"))
for page in range(2, total_pages + 1)
]
for response in asyncio.as_completed(_other_pages): # run all tasks concurrently
response = await response
data = parse_nextjs(response.text)
_page_results = nested_lookup("results", data)[0]
product_previews.extend([edge["node"] for edge in _page_results["edges"]])
return product_previews
# example run
result = asyncio.run(scrape_search("https://stockx.com/search?s=nike", max_pages=2))
print(json.dumps(result, indent=2))
import asyncio
import json
import math
from typing import Dict, List
from nested_lookup import nested_lookup
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=10)
def parse_nextjs(result: ScrapeApiResponse) -> dict:
"""extract nextjs cache from page"""
data = result.selector.css("script#__NEXT_DATA__::text").get()
if not data:
data = result.selector.css("script[data-name=query]::text").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
async def scrape_search(url: str, max_pages: int = 25) -> List[Dict]:
"""Scrape StockX search"""
print(f"scraping first search page: {url}")
first_page = await scrapfly.async_scrape(
ScrapeConfig(
url=url,
country="US",
render_js: True,
proxy_pool: "public_residential_pool",
asp=True,
)
)
# parse first page for product search data and total amount of pages:
data = parse_nextjs(first_page)
_first_page_results = nested_lookup("results", data)[0]
_paging_info = _first_page_results["pageInfo"]
total_pages = _paging_info["pageCount"] or math.ceil(_paging_info["total"] / _paging_info["limit"])
if max_pages < total_pages:
total_pages = max_pages
product_previews = [edge["node"] for edge in _first_page_results["edges"]]
# then scrape other pages concurrently:
print(f" scraping remaining {total_pages - 1} search pages")
_other_pages = [
ScrapeConfig(
url=f"{first_page.context['url']}&page={page}",
render_js: True,
proxy_pool: "public_residential_pool",
country="US",
asp=True,
)
for page in range(2, total_pages + 1)
]
async for result in scrapfly.concurrent_scrape(_other_pages):
data = parse_nextjs(result)
_page_results = nested_lookup("results", data)[0]
product_previews.extend([edge["node"] for edge in _page_results["edges"]])
return product_previews
# example run
result = asyncio.run(scrape_search("https://stockx.com/search?s=nike"))
print(json.dumps(result, indent=2))
While this product preview data offers a lot of data we might want to scrape the entire product dataset using the product scraper we wrote in the previous chapter. See the urlKey field for the full product URL.
Bypass StockX Blocking with Scrapfly
StockX is a popular website and it's not uncommon for them to block scraping attempts. To scale up our scraper and bypass blocking we can use Scrapfly's web scraping API which fortifies scrapers against blocking and much more!
Using Python-SDK we can easily integrate Scrapfly into our Python scrapers:
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
result = scrapfly.scrape(ScrapeConfig(
"https://stockx.com/search/apparel/top-selling?s=indigo",
# anti scraping protection bypass
asp=True,
# proxy country selection
country="US",
# we can enable features like:
# cloud headless browser use
render_js=True,
# screenshot taking
screenshots={"all": "fullpage"},
))
# full result data
print(result.content) # html body
print(result.selector.css("h1")) # CSS selector and XPath parser built-in
For more see the complete stockx scraper code using Scrapfly on our Github repository:
To wrap up this scrape guide, let's take a look at frequently asked questions regarding scraping of StockX:
Is it legal to scrape StockX.com?
Yes, it is legal to scrape StockX.com. StockX e-commerce data is publically available and as long as the scraper doesn't inflict damages to the website it's perfectly legal to scrape.
Can StockX.com be crawled?
Yes, StockX.com can be crawled. Crawling is an alternative web scraping approach where the scraper is capable of discovering pages on its own. StockX offers sitemaps and recommended product areas that can be used to develop crawling logic. For more see our Crawling With Python introduction.
StockX Scraping Summary
In this guide, we've learned how to scrape StockX.com using Python and a few community packages.
For this, we've used the hidden web data scraping technique where instead of traditional HTML parsing we retrieve the product HTML pages and extracted Javascript cache data.
With just a few lines of Python code, we've extracted the entire product dataset from StockX.com.
We've also taken a look at how to discover StockX product pages using sitemaps or search pages.
To scale up our scraper we've also taken a look at Scrapfly API which fortifies scrapers against blocking and much more - try it out for free!
In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.
In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.
In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.