Leboncoin.fr is one of the biggest websites for marketplace peer to peer markets in France. It's a major data target, so it can be challenging to scrape due to many anti-scraping challenges.
In this article, we'll explain how to web scrape leboncoin.fr without getting blocked. We'll also explain how to scrape data from leboncoin.fr search and ad pages. Let's dive in!
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Why Scrape Leboncoin.fr?
Lebonocin.com includes millions of ads in various categories, from household essentials and vehicles to real estate offerings. Therefore, web scraping leboncoin can provide valuable insights by allowing for:
Market Research
Listing data can be scraped and analyzed to get insights into trends, pricing patterns and product demand.
Competitive Analysis
Scraping Leboncoin can help businesses gain a competitive edge through competitors' offering analytics.
Price Tracking
Individuals looking to buy or sell products can use web scraping to track product prices over time to predict future price changes or score great deals.
Inventory Management
Sellers can scrape Leboncoin to update their personal inventory with products available online on leboncoin.
Project Setup
In this leboncoin scraping guide, we'll use a few Python libraries:
Scrapfly-sdk - a web scraping API and Python SDK that allows for scraping at scale without blocking.
All of these libraries can be installed using pip:
pip install scrapfly-sdk parsel
Bypass Leboncoin Scraping Blocking Wtih ScrapFly
Leboncoin.fr is a highly protected website that can detect web scrapers. For example, let's try to web scrape leboncoin using a simple headless browser using Playwright browser automation library for Python:
from playwright.sync_api import sync_playwright
with sync_playwright() as playwight:
# Lanuch a chrome browser
browser = playwight.chromium.launch(headless=False)
page = browser.new_page()
# Go to leboncoin.fr
page.goto("https://www.leboncoin.fr")
# Take a screenshot
page.screenshot(path="screenshot.png")
The website detected us as web scrapers and we were required to solve a captcha challenge:
To bypass leboncoin.fr web scraping blocking checkout Scrapfly!
For example, by using the ScrpaFly asp feature with the ScrapFly SDK. We can easily bypass leboncoin.fr scraper blocking:
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
api_response: ScrapeApiResponse = scrapfly.scrape(
ScrapeConfig(
url="https://www.leboncoin.fr",
# Cloud headless browser similar to Playwright
render_js=True,
# Bypass anti scraping protection
asp=True,
# Set the geographical location to France
country="FR",
)
)
# Print the website's status code
print(api_response.upstream_status_code)
"200"
Now that we can bypass leboncoin.fr blocking with ScrapFly, let's use it to create a leboncoin scraper.
How to Scrape Leboncoin Search?
To start, let's take a look at how searching works on leboncoin.fr.
First, if we go to the homepage and search for any keyword we'll see a result page similar to this:
This example search page for real estate listings supports pagination with the following URL structure:
We'll use this URL as our main search URL and use the page parameter to crawl over search pages.
We'll start by taking a look at how to scrape the first page, then we'll add paging to scrape the remainder.
As for result parsing, we'll use a hidden web data approach. Instead of using parsing selectors like XPath or CSS selectors we'll get all the data in JSON directly from script tags in the HTML.
To locate this script tag, open browser developer tools by pressing the F12 key. Then, scroll down the page till you find the script tag with the __NEXT_DATA__ ID:
We'll select this script tag from the HTML and extract its data within our scraper:
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
import asyncio
from typing import Dict, List
import json
SCRAPFLY = ScrapflyClient(key="Your API key")
# scrapfly config
BASE_CONFIG = {
# bypass web scraping blocking
"asp": True,
# set the proxy location to France
"country": "fr",
}
def parse_search(result: ScrapeApiResponse):
"""parse search result data from nextjs cache"""
# select the __NEXT_DATA__ script from the HTML
next_data = result.selector.css("script[id='__NEXT_DATA__']::text").get()
# extract ads listing data from the search page
ads_data = json.loads(next_data)["props"]["pageProps"]["initialProps"]["searchData"]["ads"]
return ads_data
async def scrape_search(url: str) -> List[Dict]:
"""scrape leboncoin search"""
print(f"scraping search {url}")
first_page = await SCRAPFLY.async_scrape(ScrapeConfig(url, **BASE_CONFIG))
search_data = parse_search(first_page)
# print the data in JSON format
print(json.dumps(search_data, indent=2))
# run the scraping search function
asyncio.run(scrape_search(url="https://www.leboncoin.fr/recherche?text=coffe"))
Here, we use the parse_search function to parse and select the search data from the HTML. Next, we use the scrape_search to scrape the first search page using ScrapFly. Finally, we print the results in JSON format and run the code using asyncio.
The above leboncoin scraper can scrape data from the first search page only. Let's modify it to scrape more than one page:
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from typing import Dict, List
import asyncio
import json
SCRAPFLY = ScrapflyClient(key="Your API key")
# scrapfly config
BASE_CONFIG = {
# bypass web scraping blocking
"asp": True,
# set the proxy location to France
"country": "fr",
}
def parse_search(result: ScrapeApiResponse):
"""parse search result data from nextjs cache"""
# select the __NEXT_DATA__ script from the HTML
next_data = result.selector.css("script[id='__NEXT_DATA__']::text").get()
# extract ads listing data from the search page
ads_data = json.loads(next_data)["props"]["pageProps"]["initialProps"]["searchData"]["ads"]
return ads_data
async def scrape_search(url: str, max_pages: int) -> List[Dict]:
"""scrape leboncoin search"""
print(f"scraping search {url}")
first_page = await SCRAPFLY.async_scrape(ScrapeConfig(url, **BASE_CONFIG))
search_data = parse_search(first_page)
# add the ramaining pages in a scraping list
_other_pages = [
ScrapeConfig(f"{first_page.context['url']}&page={page}", **BASE_CONFIG)
for page in range(2, max_pages + 1)
]
# scrape the remaining pages concurrently
async for result in SCRAPFLY.concurrent_scrape(_other_pages):
ads_data = parse_search(result)
search_data.extend(ads_data)
print(json.dumps(search_data, indent=2))
# run the scraping search function
asyncio.run(scrape_search(url="https://www.leboncoin.fr/recherche?text=coffe", max_pages=2))
Here, we add a max_pages parameter to the scrape_search function, which specifies the number of search pages to scrape. The scraping result is a list that contains all ad data found in two search pages:
We successfully got all listing data using Leboncoin's search. Next, let's take a look at how to scrape individual listing pages!
How to Scrape Leboncoin.fr Listing Ads?
Although listing data on search pages and listing pages is the same the location in the HTML differs. For this, we'll need to slightly change the object keys we use to obtain the hidden web data:
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from typing import Dict
import asyncio
import json
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
# scrapfly config
BASE_CONFIG = {
# bypass web scraping blocking
"asp": True,
# set the proxy location to France
"country": "fr",
}
def parse_ad(result: ScrapeApiResponse):
"""parse ad data from nextjs cache"""
next_data = result.selector.css("script[id='__NEXT_DATA__']::text").get()
# extract ad data from the ad page
ad_data = json.loads(next_data)["props"]["pageProps"]["ad"]
return ad_data
async def scrape_ad(url: str, _retries: int = 0) -> Dict:
"""scrape ad page"""
print(f"scraping ad {url}")
try:
result = await SCRAPFLY.async_scrape(ScrapeConfig(url, **BASE_CONFIG))
ad_data = parse_ad(result)
except:
print("retrying failed request")
if _retries < 2:
return await scrape_ad(url, _retries=_retries + 1)
return ad_data
Run the code
async def run():
ad_data = []
to_scrape = [
scrape_ad(url)
for url in [
"https://www.leboncoin.fr/ad/ventes_immobilieres/2809308201",
"https://www.leboncoin.fr/ad/ventes_immobilieres/2820947069",
"https://www.leboncoin.fr/ad/ventes_immobilieres/2787737700"
]
]
for response in asyncio.as_completed(to_scrape):
ad_data.append(await response)
# save to JSON file
with open("ads.json", "w", encoding="utf-8") as file:
json.dump(ad_data, file, indent=2, ensure_ascii=False)
asyncio.run(run())
As we did earlier, we use the parse_ad function to parse and select ad data from the listing page HTML. Next, we use the scrape_ad function to scrape the ad page using ScrapFly. Here is the result we got:
Cool - we are able to scrape leboncoin.fr data from both search and ad pages!
FAQ
To wrap up this guide, let's take a look at some frequently asked questions about leboncoin.fr web scraping.
Is it legal to scrape leboncoin.fr?
Yes, all ad data on leboncoin is public, so it's legal to scrape them as long as you keep your scraping rate reasonable. However, you should pay attention to the GDRP compliance in the EU when scraping personal data, such as seller's data. For more information, refer to our article on web scraping legality.
Is there a public API for leboncoin.fr?
Currently, there is no available public API for leboncoin.fr. However, scraping leboncoin.fr is straightforward and you can easily use it to create your own web scraping API.
How to avoid leboncoin.fr web scraping blocking?
There are many factors that lead to web scraping blocking including headers, IP addresses and security handshakes. To avoid leboncoin.fr web scraping blocking, you need to pay attention to these details. For more information, refer to our previous guide on scraping without getting blocked.
Leboncoin.fr is one of the most popular marketplaces for ads in France. Which is a highly protected website that can detect and block web scrapers, requiring the use of an anti-scraping solution.
In this article, we took a deep dive at how to scrape leboncoin.fr to get ad and search data. We have also seen how to avoid leboncoin web scraping blocking using ScrapFly.
In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.
In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.
In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.