Immobilienscout24.de is a popular website for real estate ads in Germany, featuring millions of property listings across the country. However, due to its high level of protection, scraping immobilienscout24.de can be challenging.
In this article, we'll explain how to scrape and avoid immobilienscout24.de web scraping blocking. We'll also go over a step-by-step guide on how to scrape immobilienscout24.de real estate listings from search and property pages. Let's dive in!
Manually navigating property listings from immobilienscout24.de can be a time-consuming task. Scraping immobilienscout24.de automates this process, allowing for getting thousands of property listing data in no time.
Web scraping immobilienscout24.de opens the door to a wide range of investing opportunities. Investors and buyers can identify properties with high potential returns by filtering through a wide set of property listings, helping them make strategic investment decisions.
Immobilienscout24.de scraping enables buyers to tailor their property search based on specific criteria by allowing them to look for properties in a particular location, with specific facilities or within a certain price range.
Project Setup
In this guide, we'll be scraping immobilienscout24.de using a few Python libraries:
scrapfly-sdk: A Python SDK for ScrapFly web scraping API, which allows for web scraping at scale without getting blocked.
parsel: For parsing the HTML using XPath and CSS selectors.
We'll also use asyncio to run our code asynchronously, allowing for increasing our web scraping speed.
Although asyncio is pre-installed in Python, install the other libraries using the following pip command:
pip install scrapfly-sdk parsel
How to Avoid Immobilienscout24.de Scraping Blocking?
Immobilienscout24.de uses anti-scraping technologies that can detect and block bots from accessing the website. For example, let's try to scrape it using a simple Playwright headless browser:
from playwright.sync_api import sync_playwright
with sync_playwright() as playwight:
# Lanuch a chrome browser
browser = playwight.chromium.launch(headless=False)
page = browser.new_page()
# Go to immobilienscout24.de
page.goto("https://www.immobilienscout24.de/expose/147036156#/")
# Take a screenshot
page.screenshot(path="screenshot.png")
By running this code, we can see that the website detected us as web scrapers, requiring us to solve a CAPTCHA challenge before proceeding to the web page:
Web scraping detection on immobilienscout24.de
To bypass immobilienscout24.de web scraping blocking, we'll use ScrapFly. A web scraping API that allows for scraping at scale by providing:
Here is how we can bypass any website blocking using the ScrapFly asp feature with the Python SDK:
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
api_response: ScrapeApiResponse = scrapfly.scrape(
ScrapeConfig(
url="https://www.immobilienscout24.de/expose/147036156#/",
# Cloud headless browser similar to Playwright
render_js=True,
# Bypass anti-scraping protection
asp=True,
# Set the geographical location to Germany
country="DE",
)
)
# Print the website's status code
print(api_response.upstream_status_code)
"200"
Now that we can scrape immobilienscout24.de without getting blocked using ScrapFly, let's use it to scrape some real estate data!
How to Scrape Immobilienscout24.de Propoerty Pages?
In this section, we'll scrape property pages from immobilienscout24.de. Go to any property page on the website and you will get a page similar to this:
Property page on immobilienscout24.de
First, we'll start by parsing the HTML using XPath and CSS selectors to get the property data from the web page:
This looks very complex but all we're doing is requesting the HTML page and then selecting parts of it using XPath selectors bunding all results as a final property
JSON dataset.
Next, we'll use this function with the rest of our web scraping logic to scrape property data from each page:
import asyncio
import json
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from typing import Dict, List
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
BASE_CONFIG = {
# bypass web scraping blocking
"asp": True,
# set the proxy country to Germany
"country": "de"
}
def parse_property_page(response: ScrapeApiResponse):
# The rest of the parse_property_page function
async def scrape_properties(urls: List[str]) -> List[Dict]:
"""scrape listing data from immoscout24 proeprty pages"""
# add the property pages in a scraping list
to_scrape = [ScrapeConfig(url, **BASE_CONFIG) for url in urls]
properties = []
# scrape all property pages concurrently
async for response in scrapfly.concurrent_scrape(to_scrape):
data = parse_property_page(response)
print(f"scraped {len(properties)} property listings")
return properties
Run the code
if __name__ == "__main__":
proeprties_data = asyncio.run(scrape_properties(
urls = [
"https://www.immobilienscout24.de/expose/147036156#/",
"https://www.immobilienscout24.de/expose/147175589#/",
"https://www.immobilienscout24.de/expose/139851227#/",
"https://www.immobilienscout24.de/expose/146053176#/"
]
))
# print the result in JSON format
print(json.dumps(proeprties_data, indent=2, ensure_ascii=False))
Here, we add all property page URLs to the scrape_properties function as a scraping list and scrape them concurrently. Then, we use the parsing function we created earlier to extract the data from each page response. Finally, we append the data into the properties array.
The result is an array containing the property data of each page:
Our immobilienscout24.de scraper can scrape property pages. Next, we'll scrape search pages which will help us to find the right properties for our real estate datasets.
How to Scrape Immobilienscout24.de Search Pages?
Before we start scraping immobilienscout24.de search pages, let's take a look at them. Go to any search page and you will get a page similar to this:
Search page on immobilienscout24.de
Instead of scraping the search pages by parsing the HTML, we'll extract all the data directly in JSON from the API.
Head over the Network tab and filter requests by Fetch/XHR.
Click on the next search page button.
By following the above steps, you will see all the background requests sent by the browser to the web server while changing the search page:
Background requests on browser developer tools
To identify the request responsible for fetching the search data, click on the XHR request that has the following response data:
Search page data from the API
The data in the above request are the same on the web page but before rendering into the HTML. We can also see all the request headers sent along the request by clicking on the Headers tab.
To scrape search data, we'll replicate this request within our scraper by using the headers and URL of the API request:
import asyncio
import json
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from typing import Dict, List
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
BASE_CONFIG = {
# bypass web scraping blocking
"asp": True,
# set the proxy country to Germany
"country": "de",
# the API request headers
"headers": {
"accept": "application/json",
"accept-language": "en-US,en;q=0.9"
}
}
def parse_search_api(response: ScrapeApiResponse):
"""parse JSON data from the search API"""
# skip invalid API responses
if response.scrape_result["content_type"].split(";")[0] != "application/json":
return
data = json.loads(response.scrape_result['content'])
max_search_pages = data["searchResponseModel"]["resultlist.resultlist"]["paging"]["numberOfPages"]
search_data = data["searchResponseModel"]["resultlist.resultlist"]["resultlistEntries"][0]["resultlistEntry"]
# remove similar property listings from each property data
for json_object in search_data:
if "similarObjects" in json_object.keys():
json_object.pop("similarObjects")
return {
"max_search_pages": max_search_pages,
"search_data": search_data
}
async def scrape_search(url: str, scrape_all_pages: bool, max_scrape_pages: int = 10) -> List[Dict]:
"""scrape property listings from the search API, which follows the same search page URLs"""
first_page = await scrapfly.async_scrape(ScrapeConfig(url, **BASE_CONFIG))
result_data = parse_search_api(first_page)
search_data = result_data["search_data"]
max_search_pages = result_data["max_search_pages"]
if scrape_all_pages == False and max_scrape_pages < max_search_pages:
max_scrape_pages = max_scrape_pages
# scrape all available pages in the search if scrape_all_pages = True or max_pages > total_search_pages
else:
max_scrape_pages = max_search_pages
print("scraping search {} pagination ({} more pages)", url, max_scrape_pages - 1)
# scrape the remaining search pages
for page in range(2, max_scrape_pages + 1):
response = await scrapfly.async_scrape(ScrapeConfig(
first_page.context["url"].split("?pagenumber")[0] + f"?pagenumber={page}", **BASE_CONFIG
))
try:
data = parse_search_api(response)["search_data"]
search_data.extend(data)
except:
print("invalid search page")
pass
print(f"scraped {len(search_data)} proprties from {url}")
return search_data
Run the code
if __name__ == "__main__":
search_data = asyncio.run(scrape_search(
url="https://www.immobilienscout24.de/Suche/de/bayern/muenchen/wohnung-mieten?pagenumber=1",
scrape_all_pages=False,
max_scrape_pages=3
))
# print the result in JSON format
print(json.dumps(search_data, indent=2))
Here, we use the parse_search_api to parse the search API JSON response. Next, we use the scrape_search function to scrape the first search page and get the total number of available search pages. Next, loop page numbers to scrape the desired search pages.
The result is a list containing all the property data found in three search pages:
Cool! Our immobilienscout24.de scraper can scrape loads of real estate data with a few lines of code.
FAQ
To wrap up this guide on immobilienscout24.de web scraping, let's take a look at some frequently asked questions.
Is scraping Immobilienscout24.de legal?
Yes, all the data on Immobilienscout24.de are publicly available. So it's legal to scrape at reasonable scraping rates. However, using personal data commercially (such as private realtors' information) may be difficult due to GDPR rules in the EU countries. Refer to our previous guide on web scraping legality for more information.
Is there a public API for immobilienscout24.de?
At the time of writing this article, there is no public API available for immobilienscout24.de. However, we have seen that we can use the private API Immobilienscout24.de use to get property listings for search data.
Immobilienscout24.de is a popular website for real estate ads in Germany. Which is a highly protected website with the ability to detect and block web scrapers.
In this article, we explained how to bypass Immobilienscout24.de web scraping blocking. We also went through a step-by-step guide on how to scrape immobilienscout24.de with Python to get real estate data from property and search pages.
We're taking yet another look at real estate websites. This time we're going down under! Realtestate.com.au is the biggest real estate portal in Australia and let's take a look at how to scrape it.
Immowelt.de is a major real estate website in Germany and it's suprisingly easy to scrape. In this tutorial, we'll be using Python and hidden web data scraping technique to scrape real estate property data.
For this scrape guide we'll be taking a look at another real estate website in Switzerland - Homegate. For this we'll be using hidden web data scraping and JSON parsing.