Immobilienscout24.de is a popular website for real estate ads in Germany, featuring millions of property listings across the country. However, due to its high level of protection, scraping immobilienscout24.de can be challenging.
In this article, we'll explain how to scrape and avoid immobilienscout24.de web scraping blocking. We'll also go over a step-by-step guide on how to scrape immobilienscout24.de real estate listings from search and property pages. Let's dive in!
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Why Scrape Immobilienscout24.de?
Manually navigating property listings from immobilienscout24.de can be a time-consuming task. Scraping immobilienscout24.de automates this process, allowing for getting thousands of property listing data in no time.
Web scraping immobilienscout24.de opens the door to a wide range of investing opportunities. Investors and buyers can identify properties with high potential returns by filtering through a wide set of property listings, helping them make strategic investment decisions.
Immobilienscout24.de scraping enables buyers to tailor their property search based on specific criteria by allowing them to look for properties in a particular location, with specific facilities or within a certain price range.
Project Setup
In this guide, we'll be scraping immobilienscout24.de using a few Python libraries:
scrapfly-sdk: A Python SDK for ScrapFly web scraping API, which allows for web scraping at scale without getting blocked.
parsel: For parsing the HTML using XPath and CSS selectors.
We'll also use asyncio to run our code asynchronously, allowing for increasing our web scraping speed.
Although asyncio is pre-installed in Python, install the other libraries using the following pip command:
pip install scrapfly-sdk parsel
How to Avoid Immobilienscout24.de Scraping Blocking?
Immobilienscout24.de uses anti-scraping technologies that can detect and block bots from accessing the website. For example, let's try to scrape it using a simple Playwright headless browser:
from playwright.sync_api import sync_playwright
with sync_playwright() as playwight:
# Lanuch a chrome browser
browser = playwight.chromium.launch(headless=False)
page = browser.new_page()
# Go to immobilienscout24.de
page.goto("https://www.immobilienscout24.de/expose/147036156#/")
# Take a screenshot
page.screenshot(path="screenshot.png")
By running this code, we can see that the website detected us as web scrapers, requiring us to solve a CAPTCHA challenge before proceeding to the web page:
To bypass immobilienscout24.de web scraping blocking, we'll use ScrapFly. A web scraping API that allows for scraping at scale by providing:
Here is how we can bypass any website blocking using the ScrapFly asp feature with the Python SDK:
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
api_response: ScrapeApiResponse = scrapfly.scrape(
ScrapeConfig(
url="https://www.immobilienscout24.de/expose/147036156#/",
# Cloud headless browser similar to Playwright
render_js=True,
# Bypass anti-scraping protection
asp=True,
# Set the geographical location to Germany
country="DE",
)
)
# Print the website's status code
print(api_response.upstream_status_code)
"200"
Now that we can scrape immobilienscout24.de without getting blocked using ScrapFly, let's use it to scrape some real estate data!
How to Scrape Immobilienscout24.de Propoerty Pages?
In this section, we'll scrape property pages from immobilienscout24.de. Go to any property page on the website and you will get a page similar to this:
First, we'll start by parsing the HTML using XPath and CSS selectors to get the property data from the web page:
def parse_property_page(response: ScrapeApiResponse):
"""parse property listing data from property pages"""
def strip_text(text):
"""remove extra spaces while handling None values"""
if text != None:
text = text.strip()
return text
selector = response.selector
property_link = selector.xpath("//link[@rel='canonical']").attrib["href"]
title = strip_text(selector.xpath("//h1[@id='expose-title']/text()").get())
description = selector.xpath("//meta[@name='description']").attrib["content"]
address = selector.xpath("//div[@class='address-block']/div/span[2]/text()").get()
floors_number = strip_text(selector.xpath("//dd[contains(@class, 'etage')]/text()").get())
living_space = strip_text(selector.xpath("//dd[contains(@class, 'wohnflaeche')]/text()").get())
vacant_from = strip_text(selector.xpath("//dd[contains(@class, 'bezugsfrei')]/text()").get())
number_of_rooms = strip_text(selector.xpath("//dd[contains(@class, 'zimmer')]/text()").get())
garage = strip_text(selector.xpath("//dd[contains(@class, 'garage-stellplatz')]/text()").get())
additional_sepcs = []
for spec in selector.xpath("//div[contains(@class, 'criteriagroup boolean-listing')]//span[contains(@class, 'palm-hide')]"):
additional_sepcs.append(spec.xpath("./text()").get())
price_without_heating = strip_text(selector.xpath("//dd[contains(@class, 'kaltmiete')]/text()").get())
price_per_meter = strip_text(selector.xpath("//dd[contains(@class, 'preism')]/text()").get())
basic_rent = strip_text(selector.xpath("//div[contains(@class, 'kaltmiete')]/span/text()").get())
additional_costs = strip_text(selector.xpath("//dd[contains(@class, 'nebenkosten')]/text()").extract()[1].replace("\n", ""))
# additional_costs = selector.xpath("//dd[contains(@class, 'nebenkosten')]/text()").extract()
heating_costs = strip_text(selector.xpath("//dd[contains(@class, 'heizkosten')]/text()").extract()[1].replace("\n", ""))
# heating_costs = selector.xpath("//dd[contains(@class, 'heizkosten')]/text()").extract()
total_rent = strip_text(selector.xpath("//dd[contains(@class, 'gesamtmiete')]/text()").get())
deposit = strip_text(selector.xpath("//dd[contains(@class, 'ex-spacelink')]/div/text()").get())
garage_parking_rent = selector.xpath("//dd[contains(@class, 'garagestellplatz')]/text()").get()
if garage_parking_rent:
garage_parking_rent = strip_text(garage_parking_rent)
construction_year = strip_text(selector.xpath("//dd[contains(@class, 'baujahr')]/text()").get())
energy_sources = strip_text(selector.xpath("//dd[contains(@class, 'wesentliche-energietraeger')]/text()").get())
energy_certificate = strip_text(selector.xpath("//dd[@class='is24qa-energieausweis grid-item three-fifths']/text()").get())
energy_certificate_type = strip_text(selector.xpath("//dd[contains(@class, 'energieausweis')]/text()").get())
energy_certificate_date = strip_text(selector.xpath("//dd[contains(@class, 'baujahr-laut-energieausweis')]/text()").get())
final_energy_requirement = strip_text(selector.xpath("//dd[contains(@class, 'endenergiebedarf')]/text()").get())
property_images = []
for image in selector.xpath("//div[@class='sp-slides']//div[contains(@class, 'sp-slide')]"):
try:
if image.xpath("./img").attrib["data-src"]:
property_images.append(image.xpath("./img").attrib["data-src"].split("/ORIG")[0])
except:
pass
video_available = bool(selector.xpath("//button[contains(@class, 'gallery-video')]/text()").get())
internet_speed = selector.xpath("//a[contains(@class, 'mediaavailcheck')]/text()").get()
internet_available = bool(internet_speed)
agency_name = selector.xpath("//span[@data-qa='companyName']/text()").get()
agency_address = ""
for text in selector.xpath("//ul[li[span[@data-qa='companyName']]]/li[position() >= 3 and position() <= 4]/text()").getall():
agency_address = agency_address + text
# return the data into a JSON object
data = {
"id": int(property_link.split("/")[-1]),
"title": title,
"description": description,
"address": address,
"propertyLlink": property_link,
"propertySepcs": {
"floorsNumber": floors_number,
"livingSpace": living_space,
"livingSpaceUnit": "Square Meter",
"vacantFrom": vacant_from,
"numberOfRooms": int(number_of_rooms) if number_of_rooms is not None else None,
"Garage/parking space": garage,
"additionalSpecs": additional_sepcs,
"internetAvailable": internet_available,
},
"price": {
"priceWithoutHeadting": price_without_heating,
"priceperMeter": price_per_meter,
"additionalCosts": additional_costs,
"heatingCosts": heating_costs,
"totalRent": total_rent,
"basisRent": basic_rent,
"deposit": deposit,
"garage/parkingRent": garage_parking_rent,
"priceCurrency": price_without_heating.split(" ")[1],
},
"building": {
"constructionYear": int(construction_year) if construction_year is not None else None,
"energySources": energy_sources,
"energyCertificate": energy_certificate,
"energyCertificateType": energy_certificate_type,
"energyCertificateDate": int(energy_certificate_date) if energy_certificate_date is not None else None,
"finalEnergyRrequirement": final_energy_requirement,
},
"attachments": {
"propertyImages": property_images,
"videoAvailable": video_available,
},
"agencyName": agency_name,
"agencyAddress": agency_address,
}
return data
This looks very complex but all we're doing is requesting the HTML page and then selecting parts of it using XPath selectors bunding all results as a final property
JSON dataset.
Next, we'll use this function with the rest of our web scraping logic to scrape property data from each page:
import asyncio
import json
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from typing import Dict, List
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
BASE_CONFIG = {
# bypass web scraping blocking
"asp": True,
# set the proxy country to Germany
"country": "de"
}
def parse_property_page(response: ScrapeApiResponse):
# The rest of the parse_property_page function
async def scrape_properties(urls: List[str]) -> List[Dict]:
"""scrape listing data from immoscout24 proeprty pages"""
# add the property pages in a scraping list
to_scrape = [ScrapeConfig(url, **BASE_CONFIG) for url in urls]
properties = []
# scrape all property pages concurrently
async for response in scrapfly.concurrent_scrape(to_scrape):
data = parse_property_page(response)
properties.append(data)
print(f"scraped {len(properties)} property listings")
return properties
Run the code
if __name__ == "__main__":
proeprties_data = asyncio.run(scrape_properties(
urls = [
"https://www.immobilienscout24.de/expose/153142187#/",
"https://www.immobilienscout24.de/expose/150757843#/",
"https://www.immobilienscout24.de/expose/151476545#/",
]
))
# print the result in JSON format
print(json.dumps(proeprties_data, indent=2, ensure_ascii=False))
Here, we add all property page URLs to the scrape_properties function as a scraping list and scrape them concurrently. Then, we use the parsing function we created earlier to extract the data from each page response. Finally, we append the data into the properties array.
The result is an array containing the property data of each page:
Our immobilienscout24.de scraper can scrape property pages. Next, we'll scrape search pages which will help us to find the right properties for our real estate datasets.
How to Scrape Immobilienscout24.de Search Pages?
Before we start scraping immobilienscout24.de search pages, let's take a look at them. Go to any search page and you will get a page similar to this:
Instead of scraping the search pages by parsing the HTML, we'll extract all the data directly in JSON from hidden script tags.
Hidden web data are found in script tags in the HTML document. To view the hidden search data on Immobilienscout24, follow the below steps:
Search for the XPath selector //script[contains(text(),'searchResponseModel')]
After following the above steps, you will be able to locate a script tag containing the search result data nested as JSON:
Since our target JSON data are nested in HTML and JavaScript, we'll use a utility to extract it:
def find_json_objects(text: str, decoder=json.JSONDecoder()):
"""Find JSON objects in text, and generate decoded JSON data"""
pos = 0
while True:
match = text.find("{", pos)
if match == -1:
break
try:
result, index = decoder.raw_decode(text[match:])
yield result
pos = match + index
except ValueError:
pos = match + 1
The above find_json_objects can extract any JSON dataset from a string value. Let's use while requesting the search pages to scrape them:
import json
import asyncio
from typing import Dict, List
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
BASE_CONFIG = {
# bypass web scraping blocking
"asp": True,
# set the proxy country to Germany
"country": "de",
}
def find_json_objects(text: str, decoder=json.JSONDecoder()):
"""Find JSON objects in text, and generate decoded JSON data"""
pos = 0
while True:
match = text.find("{", pos)
if match == -1:
break
try:
result, index = decoder.raw_decode(text[match:])
yield result
pos = match + index
except ValueError:
pos = match + 1
def parse_search(response: ScrapeApiResponse) -> List[Dict]:
"""parse script tags for json search results """
selector = response.selector
script = selector.xpath("//script[contains(text(),'searchResponseModel')]/text()").get()
json_data = [i for i in list(find_json_objects(script)) if "searchResponseModel" in i][0]["searchResponseModel"]["resultlist.resultlist"]
search_data = json_data["resultlistEntries"][0]["resultlistEntry"]
max_pages = json_data["paging"]["numberOfPages"]
return {"search_data": search_data, "max_pages": max_pages}
async def scrape_search(url: str, scrape_all_pages: bool, max_scrape_pages: int = 10) -> List[Dict]:
"""scrape immobilienscout24 search pages"""
first_page = await scrapfly.async_scrape(ScrapeConfig(url, **BASE_CONFIG))
data = parse_search(first_page)
search_data = data["search_data"]
max_search_pages = data["max_pages"]
if scrape_all_pages == False and max_scrape_pages < max_search_pages:
max_scrape_pages = max_scrape_pages
# scrape all available pages in the search if scrape_all_pages = True or max_pages > total_search_pages
else:
max_scrape_pages = max_search_pages
print(f"scraping search {url} pagination ({max_scrape_pages - 1} more pages)")
# scrape the remaining search pages concurrently
to_scrape = [
ScrapeConfig(url + f"?pagenumber={page}", **BASE_CONFIG)
for page in range(2, max_scrape_pages + 1)
]
async for response in scrapfly.concurrent_scrape(to_scrape):
search_data.extend(parse_search(response)["search_data"])
print(f"scraped {len(search_data)} proprties from {url}")
return search_data
Run the code
if __name__ == "__main__":
search_data = asyncio.run(scrape_search(
url="https://www.immobilienscout24.de/Suche/de/bayern/muenchen/wohnung-mieten",
scrape_all_pages=False,
max_scrape_pages=3
))
# save the reuslts to a JSON file
with open("search_results.json", "w", encoding="utf-8") as f:
json.dump(search_data, f, indent=2, ensure_ascii=False)
Here, we use the parse_search to parse the hidden web data. Next, we use the scrape_search function to scrape the first search page and get the total number of available search pages. Next, the script iterates over the max_scrape_pages value to crawl the search paginatio
The result is a list containing all the property data found in three search pages:
Cool! Our immobilienscout24.de scraper can scrape loads of real estate data with a few lines of code.
FAQ
To wrap up this guide on immobilienscout24.de web scraping, let's take a look at some frequently asked questions.
Is scraping Immobilienscout24.de legal?
Yes, all the data on Immobilienscout24.de are publicly available. So it's legal to scrape at reasonable scraping rates. However, using personal data commercially (such as private realtors' information) may be difficult due to GDPR rules in the EU countries. Refer to our previous guide on web scraping legality for more information.
Is there a public API for immobilienscout24.de?
At the time of writing this article, there is no public API available for immobilienscout24.de. However, we have seen that we can use the private API Immobilienscout24.de use to get property listings for search data.
Immobilienscout24.de is a popular website for real estate ads in Germany. Which is a highly protected website with the ability to detect and block web scrapers.
In this article, we explained how to bypass Immobilienscout24.de web scraping blocking. We also went through a step-by-step guide on how to scrape immobilienscout24.de with Python to get real estate data from property and search pages.
Learn about the fundamentals of parsing data, across formats like JSON, XML, HTML, and PDFs. Learn how to use Python parsers and AI models for efficient data extraction.