Immoscout24.ch is a popular website for real estate ads in Switzerland, which includes real estate properties for buying or renting.
In this article, we'll explore how to scrape immoscout2.ch to get property listing data. We'll also explain how to avoid Immoscout24.ch web scraping blocking. Let's dive in!
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Why Scrape Immoscout24.ch?
Scraping Immoscout24.ch opens the door to comprehensive property listing data. Allowing sellers and buyers to track market trends including property prices, demand and supply changes over time.
Real estate data from scraping immopscout24.ch can help investors identify underpriced properties or areas with potential growth, leading to better decision-making.
Moreover, manually exploring real estate data from websites can be tedious and time-consuming. Therefore, web scraping immoscout24.ch can save a lot of manual effort by quickly retrieving thousands of listings.
Project Setup
To scrape immoscout24.ch, we'll use a few Python libraries.
scrapfly-sdk - for avoiding immoscout24.ch scraping blocking using the ScrapFly web scraping API.
asyncio - for increasing web scraping speed by running our code asynchronously.
Python already includes asyncio and the rest can be installed using the following pip command:
pip install httpx parsel scrapfly-sdk
Scrape Immoscout24 Search
Let's start by scraping immoscout24.ch search pages which contain property preview data and links to full property listing details. For our example let's take this search url immoscout24.ch/en/real-estate/rent/city-bern:
We can manipulate the search results by changing the parts of URL like URL parameters or location:
Search page number
By adding a pn parameter at the end of the URL with the desired page number.
Search location
By replacing city-bern with the desired city name.
Property types
By changing either rent or buy in the search URL.
Instead of scraping immoscout24.ch property listings by parsing the visible HTML elements, we'll extract hidden JSON proprety datasets directly from the invisible script tags. This type of data is typically known as hidden web data, which is the source of HTML data we see on the page.
To view this data in the HTML, open the browser developer tools by clicking the F12 key. Then, scroll down to the script tag that has an HTML similar to this:
To scrape this data, we'll select and parse this script tag data:
Python
ScrapFly
import asyncio
import json
from typing import List, Dict
from httpx import AsyncClient, Response
from parsel import Selector
client = AsyncClient(
headers={
# use same headers as a popular web browser (Chrome on Windows in this case)
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
}
)
def parse_next_data(response: Response) -> Dict:
"""parse listing data from script tags"""
print(response)
selector = Selector(response.text)
# extract data in JSON from the script tag
next_data = selector.xpath("//script[contains(text(),'INITIAL_STATE')]/text()").get().strip("window.__INITIAL_STATE__=")
# replace undefined values as Python doesn't recognize it
next_data = next_data.replace("undefined", "null")
if not next_data:
return
next_data_json = json.loads(next_data)
return next_data_json
async def scrape_search(url: str, max_pages: int) -> List[Dict]:
"""scrape listing data from immoscout24 search pages"""
# scrape the first search page first
first_page = await client.get(url)
data = parse_next_data(first_page)["resultList"]["search"]["fullSearch"]["result"]
search_data = data["listings"]
# get the number of maximum search pages available
max_search_pages = data["resultCount"]
print(f"scraped first search page, remaining ({max_search_pages} search pages)")
# get the number of max pages to scrape
if max_pages and max_pages < max_search_pages:
max_search_pages = max_pages
# add the remaining search pages to a scraping list
other_pages = [
client.get(url=str(first_page.url) + f"?pn={page}")
for page in range(2, max_search_pages + 1)
]
# scrape the remaining search pages concurrently
for response in asyncio.as_completed(other_pages):
data = parse_next_data(await response)
search_data.extend(data["resultList"]["search"]["fullSearch"]["result"]["listings"])
print(f"scraped {len(search_data)} property listings from search")
return search_data
import asyncio
import json
from typing import List, Dict
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
def parse_next_data(response: ScrapeApiResponse) -> Dict:
"""parse listing data from script tags"""
selector = response.selector
# extract data in JSON from the script tag
next_data = next_data = selector.xpath("//script[contains(text(),'INITIAL_STATE')]/text()").get().strip("window.__INITIAL_STATE__=")
# replace undefined values as Python doesn't recognize it
next_data = next_data.replace("undefined", "null")
if not next_data:
return
next_data_json = json.loads(next_data)
return next_data_json
async def scrape_search(url: str, max_pages: int) -> List[Dict]:
"""scrape listing data from immoscout24 search pages"""
# scrape the first search page first
first_page = await scrapfly.async_scrape(ScrapeConfig(url, asp=True, country="CH"))
data = parse_next_data(first_page)["resultList"]["search"]["fullSearch"]["result"]
search_data = data["listings"]
# get the number of maximum search pages available
max_search_pages = data["resultCount"]
# get the number of max pages to scrape
if max_pages and max_pages < max_search_pages:
max_search_pages = max_pages
print(f"scraped first search page, remaining ({max_search_pages} search pages)")
# add the remaining search pages in a scraping list
other_pages = [
ScrapeConfig(first_page.context['url']+ f"?pn={page}", asp=True, country="CH")
for page in range(2, max_search_pages + 1)
]
# scrape the remaining search pages concurrently
async for response in scrapfly.concurrent_scrape(other_pages):
data = parse_next_data(response)
search_data.extend(data["resultList"]["search"]["fullSearch"]["result"]["listings"])
print(f"scraped {len(search_data)} property listings from search")
return search_data
Run the code
if __name__ == "__main__":
search_data = asyncio.run(
scrape_search(
url="https://www.immoscout24.ch/en/real-estate/rent/city-bern",
max_pages=3
)
)
# print the result in JSON format
print(json.dumps(search_data, indent=2))
Here, we create an async httpx client with a few HTTP headers to mimic a normal web browser. Then, we create a parse_next_data function to find the script tag that contains the property JSON data and clean it up. Next, we create a scrape_search function to scrape the first search page data and the maximum number of search pages available. Finally, we scrape the remaining of the pages concurrently.
The result is a list containing real estate property listings found on all search pages. Similar to this example:
Now that our immoscout24.ch scraper can successfully scrape search pages, let's scrape property pages.
Scrape Immoscout24 Property Pages
Similar to hidden JSON data in search pages, there are also hidden JSON datasets in property listing pages:
To scrape immoscout24.ch property page data, we will extend our search scraping code with a scrape_properties() function:
Python
ScrapFly
import asyncio
import json
from typing import List, Dict
from httpx import AsyncClient, Response
from parsel import Selector
client = AsyncClient(
headers={
# use same headers as a popular web browser (Chrome on Windows in this case)
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
}
)
def parse_next_data(response: Response) -> Dict:
"""parse listing data from script tags"""
selector = Selector(response.text)
# extract data in JSON from the script tag
next_data = (
selector.xpath("//script[contains(text(),'INITIAL_STATE')]/text()").get().strip("window.__INITIAL_STATE__=")
)
# replace undefined values as Python doesn't recognize it
next_data = next_data.replace("undefined", "null")
if not next_data:
return
next_data_json = json.loads(next_data)
return next_data_json
async def scrape_properties(urls: List[str]) -> List[Dict]:
"""scrape listing data from immoscout24 proeprty pages"""
# add the property pages to a scraping list
to_scrape = [client.get(url) for url in urls]
properties = []
# scrape all property pages concurrently
for response in asyncio.as_completed(to_scrape):
data = parse_next_data(await response)
# handle expired property pages
try:
properties.append(data["listing"]["listing"])
except:
print("expired property page")
pass
print(f"scraped {len(properties)} property pages")
return properties
import asyncio
import json
from typing import List, Dict
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
def parse_next_data(response: ScrapeApiResponse) -> Dict:
"""parse listing data from script tags"""
selector = response.selector
# extract data in JSON from the script tag
next_data = selector.xpath("//script[contains(text(),'INITIAL_STATE')]/text()").get().strip("window.__INITIAL_STATE__=")
# replace undefined values as Python doesn't recognize it
next_data = next_data.replace("undefined", "null")
if not next_data:
return
next_data_json = json.loads(next_data)
return next_data_json
async def scrape_properties(urls: List[str]) -> List[Dict]:
"""scrape listing data from immoscout24 proeprty pages"""
# add the property pages in a scraping list
to_scrape = [ScrapeConfig(url, asp=True, country="CH") for url in urls]
properties = []
# scrape all property pages concurrently
async for response in scrapfly.concurrent_scrape(to_scrape):
data = parse_next_data(response)
# handle expired property pages
try:
properties.append(data["listing"]["listing"])
except:
print("expired property page")
pass
print(f"scraped {len(properties)} property pages")
return properties
Run the code
if __name__ == "__main__":
properties = asyncio.run(scrape_properties(
urls = [
"https://www.immoscout24.ch/rent/4001413696",
"https://www.immoscout24.ch/rent/4001377896",
"https://www.immoscout24.ch/rent/4000759629",
"https://www.immoscout24.ch/rent/4000924213"
]
))
# print the result in JSON format
print(json.dumps(properties, indent=2))
We use the parse_next_data function we used earlier to extract script tag data in JSON. Next, we add all the property page URLs to a scraping list and scrape them concurrently.
The result is a list containing all property page data:
With this addition, our immoscout24.ch scraper can successfully scrape property listings from search and property pages. However, we can scrape it using the immoscout24.ch API. Let's take a look!
Scraping Hidden API
Currently, there is no public API available for immoscout24.ch. However, we can use the hidden API that is used by Immoscout24 front-end by reverse engineering and replicating the behavior of the browser in our scraper.
To view this API, go to any search page on immoscout24.ch and open up theb browser developer tools. Then, go over the Network tab and filter by Fetch/XHR requests. Next, select the next search page which will display all the requests sent by the browser to the immoscout24.ch API:
On the left side, we can see all the XHR requests sent by the browser to the web server. On the right side, we can see the response and the headers sent along each request.
To get this data directly, we'll can replicate this API use in our Python web scraper:
Python
ScrapFly
import asyncio
import json
from typing import List, Dict
from httpx import AsyncClient
async def scrape_api(page_number: int) -> List[Dict]:
"""send a GET request to the immoscout24.ch API"""
client = AsyncClient(
headers={
# request header that we got from request header in developer tools
"authority": "rest-api.immoscout24.ch",
"is24-meta-pagenumber": str(page_number),
"is24-meta-pagesize": "24",
"origin": "https://www.immoscout24.ch",
"referer": "https://www.immoscout24.ch/",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
}
)
api_url = "https://rest-api.immoscout24.ch/v4/en/properties?l=436&s=1&t=1&inp=1"
data = await client.get(api_url)
return json.loads(data.text)
async def crawl_api(start_page: int, end_page: int, scrape_all_pages: bool) -> List[Dict]:
"""crawl the API by changing the page number in each request"""
first_page = await scrape_api(1)
max_search_pages = first_page["pagingInfo"]["totalPages"]
result = []
# scrape all pages if scrape_all_pages = True or end_page > max_search_pages
if scrape_all_pages == False and end_page <= max_search_pages:
end_page = end_page
else:
end_page = max_search_pages
# scrape the desired API pages
for page_number in range(start_page, end_page + 1):
data = await scrape_api(page_number)
result.extend(data["properties"])
return result
import asyncio
import json
from typing import List, Dict
from scrapfly import ScrapeConfig, ScrapflyClient
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
async def scrape_api(page_number: int) -> List[Dict]:
"""send a GET request to the immoscout24.ch API"""
HEADERS = {
# request header that we got from request header in developer tools
"authority": "rest-api.immoscout24.ch",
"is24-meta-pagenumber": str(page_number),
"is24-meta-pagesize": "24",
"origin": "https://www.immoscout24.ch",
"referer": "https://www.immoscout24.ch/",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
}
api_url = "https://rest-api.immoscout24.ch/v4/en/properties?l=436&s=1&t=1&inp=1"
data = await scrapfly.async_scrape(
ScrapeConfig(url=api_url, asp=True, country="CH", headers=HEADERS)
)
return json.loads(data.scrape_result["content"])
async def crawl_api(start_page: int, end_page: int, scrape_all_pages: bool) -> List[Dict]:
"""crawl the API by changing the page number in each request"""
first_page = await scrape_api(1)
max_search_pages = first_page["pagingInfo"]["totalPages"]
result = []
# scrape all pages if scrape_all_pages = True or end_page > max_search_pages
if scrape_all_pages == False and end_page <= max_search_pages:
end_page = end_page
else:
end_page = max_search_pages
# scrape the desired API pages
for page_number in range(start_page, end_page + 1):
data = await scrape_api(page_number)
result.extend(data["properties"])
return result
Run the code
if __name__ == "__main__":
result = asyncio.run(crawl_api(start_page=1, end_page=10, scrape_all_pages=False))
# print the result in JSON format
print(json.dumps(result, indent=2))
with open ("api-data.json", "w", encoding="utf-8") as file:
json.dump(result, file, indent=2)
The result is the same as the result we got earlier from scraping search pages:
With this addition, we can scrape immoscout24.ch to get property listing data in two different ways ensuring that our scrapers are always on top of the latest property listings in Switzerland.
However, the chances of getting our scraper blocked after sending additional requests are high. Let's take a look at a solution!
Bypass Immoscout24.ch Scraping Blocking
To avoid immoscout24.ch web scraping blocking, we'll use ScrapFly API - a web scraping API that powers up web scrapers for scraping at scale.
For example, we can use ScrapFly's asp feature with the ScrapFly Python SDK to scrape any amount of data from immoscout24.ch without getting blocked:
import httpx
from parsel import Selector
response = httpx.get("some immoscout24.ch url")
selector = Selector(response.text)
# in ScrapFly SDK becomes
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly_client = ScrapflyClient("Your ScrapFly API key")
result: ScrapeApiResponse = scrapfly_client.scrape(ScrapeConfig(
# some homegate.ch URL
"https://www.immoscout24.ch/en/d/flat-rent-bern/8068164",
# we can select specific proxy country
country="CH",
# and enable anti scraping protection bypass:
asp=True,
# allows JavaScript rendering similar to headless browsers
render_js=True
))
# use the built-in parsel selector
selector = result.selector
To wrap up this guide on scraping immoscout24.ch, let's take a look at some frequently asked questions.
Is web scraping immoscout24.ch legal?
Since immoscout24.ch data is publicly available - it's perfectly legal to scrape as long as scrapers don't directly damage the website. Note that using private scraped seller's data can be difficult due to GDPR in EU and for more information, refer to our is web scraping legal? page.
Is there a public API for immoscout24.ch?
Currently, there is no public API for immoscout24.ch. However, Immoscout can be scraped with Python through multiple publicly available endpoints.
In this guide, we explained how to scrape immoscout24.ch - a popular real estate listing website - using nothing but a bit of Python.
We went through a step-by-step guide on scraping immoscout24.ch property and search pages. Furthermore, we explored how to use immoscout24.ch private API to get property listing data and how to avoid immoscout.ch web scraper blocking.
In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.
In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.
In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.