Vestiaire Collective is a luxury fashion resale platform from France. It's a popular web scraping target as it's one of the biggest second-hand markets for luxury fashion items.
In this tutorial, we'll take a quick look at how to scrape Vestiaire Collective using Python. In this guide we'll cover:
Scrape Vestiaire Collective product listing data.
Find product listings using Vestiaire Collective sitemaps.
This is a very easy scraper as we'll be using hidden web data scraping to effortlessly collect product and seller data.
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Why Scrape Vestiaire Collective?
Vestiaire Collective is a major exchange for luxury fashion items. Scraping this website can be useful for a number of reasons:
To scrape this target we'll need a few Python packages commonly used in web scraping. Since we'll be using the hidden web data scraping approach all we need is two packages:
httpx - powerful HTTP client which we'll be using to retrieve the HTML pages.
parsel - HTML parser which we'll be using to extract hidden JSON datasets.
These packages can be installed using Python's pip console command:
$ pip install httpx parsel
For Scrapfly users there's also a Scrapfly SDK version of each code example. The SDK can be installed using pip as well:
$ pip install "scrapfly-sdk[all]"
Scrape Vestiaire Collective Product Data
Let's start by taking a look at a single product page and how can we scrape it using Python. For example, let's take this product page:
We could parse the page HTML using CSS selectors or XPath but since Verstiaire Collective is using Next.js javascript framework we can extract the dataset directly from the page source:
We can find this by inspecting the page source and looking for unique product idenfier like name or id (ctrl+f). In the example above we can see it's under <script id="__NEXT_DATA"> html element.
This is called hidden web data scraping and it's a really simple and effective way to scrape data from websites that use javascript frameworks like next.js. To scrape it all we have to do:
Retrieve the product HTML page.
Find the hidden JSON dataset using CSS selectors and parsel.
Load JSON as Python dictionary using json.loads.
Select the product fields.
In practical Python this would look something like this:
Python
ScrapFly
import asyncio
import json
import httpx
from parsel import Selector
# create HTTP client with defaults headers that look like a web browser and enable HTTP2 version
client = httpx.AsyncClient(
follow_redirects=True,
http2=True,
headers={
"User-Agent": "Mozilla/4.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=-1.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
},
)
def find_hidden_data(html) -> dict:
"""extract hidden web cache from page html"""
# use CSS selectors to find script tag with data
data = Selector(html).css("script#__NEXT_DATA__::text").get()
return json.loads(data)
async def scrape_product(url: str):
# retrieve page HTML
response = await client.get(url)
# find hidden web data
data = find_hidden_data(response.text)
# extract only product data from the page dataset
product = data['props']['pageProps']['product']
return product
# example scrape run:
print(asyncio.run(scrape_product("https://www.vestiairecollective.com/men-accessories/watches/patek-philippe/metallic-steel-nautilus-patek-philippe-watch-21827899.shtml")))
import asyncio
import json
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=10)
def find_hidden_data(result: ScrapeApiResponse) -> dict:
"""extract hidden NEXT_DATA from page html"""
data = result.selector.css("script#__NEXT_DATA__::text").get()
data = json.loads(data)
return data
async def scrape_product(url: str) -> dict:
"""scrape a single stockx product page for product data"""
result = await scrapfly.async_scrape(ScrapeConfig(
url=url,
cache=True, # use cache while developing to speed up scraping for repeated script runs
asp=True, # Anti-Scraping Protection bypass allows to scrape protected pages
)
)
data = find_hidden_data(result)
product = data["props"]["pageProps"]["product"]
return product
# example run of 1 product scrape
print(asyncio.run(scrape_product("https://www.vestiairecollective.com/men-accessories/watches/patek-philippe/metallic-steel-nautilus-patek-philippe-watch-21827899.shtml"))
In just a few lines of Python code, we extracted the whole product dataset which includes all of the product details and seller information!
Next up, let's take a look at how to find product listings using Vestiaire Collective sitemaps.
Finding Vestiaire Collective Products
Vestiaire Collective has an extensive sitemap suite that can be used to find all of the product listings. So, to find product pages we'll be scraping sitemaps.
Which contains sitemaps in split into various categories like by brand, new listings, item type (clothing, shoes):
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<!-- sitemap url and category clues, this one is for brands -->
<loc>https://www.vestiairecollective.com/sitemaps/https_en-brands-1.xml</loc>
<!-- when the sitemap was updated -->
<lastmod>2023-04-07</lastmod>
</sitemap>
<sitemap>
<loc>https://www.vestiairecollective.com/sitemaps/https_en-new_items-1.xml</loc>
<lastmod>2023-04-07</lastmod>
</sitemap>
...
</sitemapindex>
Each of these sitemaps contains 50 000 product listings.
For our example, let's scrape the newest listings which can be found on the new_items.xml sitemaps.
The new_items-1.xml sitemap contains the newest 50_000 items. Let's see how to scrape it:
Python
ScrapFly
import asyncio
import json
from typing import Dict, List
import httpx
from parsel import Selector
client = httpx.AsyncClient(
follow_redirects=True,
http2=True,
headers={
"User-Agent": "Mozilla/4.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=-1.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
},
)
def find_hidden_data(html) -> dict:
"""extract hidden web cache from page html"""
# use CSS selectors to find script tag with data
data = Selector(html).css("script#__NEXT_DATA__::text").get()
return json.loads(data)
async def scrape_product(url: str):
# retrieve page HTML
response = await client.get(url)
# catch products that are no longer available as they redirect to 308
for redirect in response.history:
if redirect.status_code == 308:
print(f"product {redirect.url} is no longer available")
return None
# find hidden web data
data = find_hidden_data(response.text)
# extract only product data from the page dataset
product = data["props"]["pageProps"]["product"]
return product
async def scrape_sitemap(url: str, max_pages: int = 100) -> List[Dict]:
"""Scrape Vestiaire Collective sitemap for products"""
# retrieve sitemap
print(f"scraping sitemap page: {url}")
response_sitemap = await client.get(url)
product_urls = Selector(response_sitemap.text).css("url>loc::text").getall()
print(f"found {len(product_urls)} products in the sitemap: {url}\n scraping the first {max_pages} products")
# scrape products concurrently using asyncio
product_scrapes = [asyncio.create_task(scrape_product(url)) for url in product_urls[:max_pages]]
return await asyncio.gather(*product_scrapes)
# example scrape run:
print(asyncio.run(scrape_sitemap("https://www.vestiairecollective.com/sitemaps/https_en-new_items-1.xml", max_pages=5)))
import asyncio
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=10)
async def scrape_sitemap(url: str, max_pages:int=100) -> List[Dict]:
"""Scrape Vestiaire Collective sitemap for products"""
print(f"scraping sitemap page: {url}")
result_sitemap = await scrapfly.async_scrape(ScrapeConfig(url=url, asp=True))
product_urls = result_sitemap.selector.css("url>loc::text").getall()
print(f"found {len(product_urls)} products in the sitemap: {url}\n scraping the first {max_pages} products")
product_pages = [ScrapeConfig(url=url, asp=True) for url in product_urls[:max_pages]]
products = []
async for result in scrapfly.concurrent_scrape(product_pages):
# Vestiaire Collective redirects to product category if product is no longer available (sold, deleted etc.)
if any(redirect['http_code'] == 308 for redirect in result.context['redirects']):
print(f"Product page {result.scrape_config.url} is no longer available")
continue
data = find_hidden_data(result)
products.append(data['props']['pageProps']['product'])
return products
# example scrape: scrape the first 10 newest listings
asyncio.run(scrape_sitemaps("https://www.vestiairecollective.com/sitemaps/https_en-new_items-1.xml", max_pages=10))
Above, we've used simple XML parsing using parsel to extract URLs from the new listings sitemap. Then we scrape hidden web data of each product like we've done in the previous chapter.
Sitemaps are great for finding scrape targets quickly and efficiently. Though to further scale our scraper up let's take a look at how to avoid blocking using Scrapfly SDK.
Bypass Vestiaire Collective Blocking with Scrapfly
Using Python SDK our scraper code can be easily adapted to use Scrapfly API:
from scrapfly import ScrapeConfig, ScrapflyClient
client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
url="https://www.vestiairecollective.com/women-clothing/knitwear/anine-bing/beige-cotton-anine-bing-knitwear-32147447.shtml",
# enable scraper blocking service bypass
asp=True
# optional - render javascript using headless browsers:
render_js=True,
))
print(result.content)
For more on web scraping Vestiaire Collective with ScrapFly check out the Full Scraper Code section.
FAQ
To wrap up our guide on how to scrape Vestiaire Collective, let's take a look at some frequently asked questions.
Is it legal to scrape Vestiaire Collective?
Yes. All of the data we scraped in this tutorial is available publically which is perfectly legal to scrape. However, attention should be paid when using scraped seller data as it can be protected by GDPR or copyright in Europe.
Can Vestiaire Collective be crawled?
Yes. Crawling is a form of web scraping where the scraper discovers product listing on it's own and Visetiaire Collective offers many discovery points such as recommendations, search and sitemaps.
In this quick tutorial, we took a look at how to scrape Vestiaire Collective using Python. We covered how to use the hidden web data scraping approach to quickly extract product datasets from HTML pages. To find the products we've covered how to use sitemaps to quickly collect all of the product listings by category.
To avoid blocking we've taken a look at Scrapfly API scaling solution which can be used to scale your scraping projects to collect public datasets like this one in a matter of minutes!
In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.
In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.
In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.