In this web scraping tutorial, we'll be scraping idealista.com - the biggest real estate marketplace in Spain, Portugal and Italy.
In this guide, we'll be exploring real estate data scraping by taking a look at Idealista.com. We'll be scraping common property data points like property pricing, addresses, photos and agent phone numbers.
When it comes to web scraping, Idealista.com is a traditional scraping target. To scrape it, we'll cover popular web scraping techniques used in Python like HTML parsing using CSS Selectors and concurrent requests using asyncio.
Finally, we'll also cover tracking to scrape newly listed properties - giving us an upper hand in real estate discovery and bidding.
In this article, we'll focus on the Spanish version of the website (Idealista.com) though both Italian and Portuguese version function the same and our scraper code should work for these sources as well.
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Why Scrape Idealista.com?
Idealista.com is one of the biggest real estate websites in Spain (as well as Italy and Portugal) making it the biggest public real estate dataset for these areas. Containing fields like real estate prices, listing locations and sale dates and general property information.
This is valuable information for market analytics, the study of the housing industry, and a general competitor overview.
For more on scraping use cases see our extensive write-up Scraping Use Cases
Project Setup
In this tutorial, we'll be using Python with two community packages:
httpx - HTTP client library which will let us communicate with Idealista.com's servers
These packages can be easily installed via the pip install command:
$ pip install httpx parsel
Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is the beautifulsoup package.
Scraping Idealista Property Data
Let's start by taking a look at how to scrape Idealista for a single property. In later sections, we'll also take a look at how to find any properties and scrape them using this property scraper.
For example, let's start by taking a look at the listing page and where is all of the information stored on it. Let's pick a random property listing, like:
For parsing data on Idealista, we'll be using CSS selectors, so let's markup the fields we want to scrape:
Idealista is a pure HTML website with a very convenient styling markup which we can take advantage in our scraper. For example, if we right-click on the price and inspect the HTML element we can see how clear the HTML structure is:
We can see that all of the data points are under clear class names like info-data-price for price or main-info__title-main for property name.
Let's scrape it:
Python
ScrapFly
import asyncio
import json
import re
from typing import Dict, List
from collections import defaultdict
from urllib.parse import urljoin
import httpx
from parsel import Selector
from typing_extensions import TypedDict
# Establish persisten HTTPX session with browser-like headers to avoid blocking
BASE_HEADERS = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
session = httpx.AsyncClient(headers=BASE_HEADERS, follow_redirects=True)
# type hints fo expected results so we can visualize our scraper easier:
class PropertyResult(TypedDict):
url: str
title: str
location: str
price: int
currency: str
description: str
updated: str
features: Dict[str, List[str]]
images: Dict[str, List[str]]
plans: List[str]
def parse_property(response: httpx.Response) -> PropertyResult:
"""parse Idealista.com property page"""
# load response's HTML tree for parsing:
selector = Selector(text=response.text)
css = lambda x: selector.css(x).get("").strip()
css_all = lambda x: selector.css(x).getall()
data = {}
# Meta data
data["url"] = str(response.url)
# Basic information
data['title'] = css("h1 .main-info__title-main::text")
data['location'] = css(".main-info__title-minor::text")
data['currency'] = css(".info-data-price::text")
data['price'] = int(css(".info-data-price span::text").replace(",", ""))
data['description'] = "\n".join(css_all("div.comment ::text")).strip()
data["updated"] = selector.xpath(
"//p[@class='stats-text']"
"[contains(text(),'updated on')]/text()"
).get("").split(" on ")[-1]
# Features
data["features"] = {}
# first we extract each feature block like "Basic Features" or "Amenities"
for feature_block in selector.css(".details-property-h2"):
# then for each block we extract all bullet points underneath them
label = feature_block.xpath("text()").get()
features = feature_block.xpath("following-sibling::div[1]//li")
data["features"][label] = [
''.join(feat.xpath(".//text()").getall()).strip()
for feat in features
]
# Images
# the images are tucked away in a javascript variable.
# We can use regular expressions to find the variable and parse it as a dictionary:
image_data = re.findall(r"fullScreenGalleryPics\s*:\s*(\[.+?\]),",
response.scrape_result['content']
)[0]
# we also need to replace unquoted keys to quoted keys (i.e. title -> "title"):
images = json.loads(re.sub(r'(\w+?):([^/])', r'"\1":\2', image_data))
data['images'] = defaultdict(list)
data['plans'] = []
for image in images:
url = urljoin(str(response.url), image['imageUrl'])
if image['isPlan']:
data['plans'].append(url)
else:
data['images'][image['tag']].append(url)
return data
async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
"""Scrape Idealista.com properties"""
properties = []
to_scrape = [session.get(url) for url in urls]
# tip: asyncio.as_completed allows concurrent scraping - super fast!
for response in asyncio.as_completed(to_scrape):
response = await response
print(response.status_code)
if response.status_code != 200:
print(f"can't scrape property: {response.url}")
continue
properties.append(parse_property(response))
return properties
import asyncio
import json
import re
from typing import Dict, List
from typing_extensions import TypedDict
from collections import defaultdict
from urllib.parse import urljoin
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly Key")
# type hints fo expected results so we can visualize our scraper easier:
class PropertyResult(TypedDict):
url: str
title: str
location: str
price: int
currency: str
description: str
updated: str
features: Dict[str, List[str]]
images: Dict[str, List[str]]
plans: List[str]
def parse_property(response: ScrapeApiResponse) -> PropertyResult:
"""parse Idealista.com property page"""
# load response's HTML tree for parsing:
selector = response.selector
css = lambda x: selector.css(x).get("").strip()
css_all = lambda x: selector.css(x).getall()
data = {}
# Meta data
data["url"] = str(response.context["url"])
# Basic information
data['title'] = css("h1 .main-info__title-main::text")
data['location'] = css(".main-info__title-minor::text")
data['currency'] = css(".info-data-price::text")
data['price'] = int(css(".info-data-price span::text").replace(",", ""))
data['description'] = "\n".join(css_all("div.comment ::text")).strip()
data["updated"] = selector.xpath(
"//p[@class='stats-text']"
"[contains(text(),'updated on')]/text()"
).get("").split(" on ")[-1]
# Features
data["features"] = {}
# first we extract each feature block like "Basic Features" or "Amenities"
for feature_block in selector.css(".details-property-h2"):
# then for each block we extract all bullet points underneath them
label = feature_block.xpath("text()").get()
features = feature_block.xpath("following-sibling::div[1]//li")
data["features"][label] = [
''.join(feat.xpath(".//text()").getall()).strip()
for feat in features
]
# Images
# the images are tucked away in a javascript variable.
# We can use regular expressions to find the variable and parse it as a dictionary:
image_data = re.findall(r"fullScreenGalleryPics\s*:\s*(\[.+?\]),",
response.scrape_result['content']
)[0]
# we also need to replace unquoted keys to quoted keys (i.e. title -> "title"):
images = json.loads(re.sub(r'(\w+?):([^/])', r'"\1":\2', image_data))
data['images'] = defaultdict(list)
data['plans'] = []
for image in images:
url = urljoin(str(response.context["url"]), image['imageUrl'])
if image['isPlan']:
data['plans'].append(url)
else:
data['images'][image['tag']].append(url)
return data
async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
"""Scrape Idealista.com properties"""
properties = []
to_scrape = [ScrapeConfig(url, asp=True, country="ES") for url in urls]
async for response in scrapfly.concurrent_scrape(to_scrape):
if response.upstream_status_code != 200:
print(f"can't scrape property: {response.context['url']}")
continue
properties.append(parse_property(response))
return properties
🧙 If you are experiencing errors while running the Python code tabs, it may be due to getting blocked. To bypass blocking, use the ScrapFly code tabs instead.
Run Code & Example Output
async def run():
urls = ["https://www.idealista.com/en/inmueble/97028172/"]
data = await scrape_properties(urls)
print(json.dumps(data, indent=2, ensure_ascii=False))
if __name__ == "__main__":
asyncio.run(run())
Which would result in a dataset similar to this:
[
{
"title": "Penthouse for sale in La Dreta de l'Eixample",
"location": "Eixample, Barcelona",
"price": 5200000,
"currency": "€",
"description": "This stunning penthouse hosts 269 m2 distributed across two floors and a turret with 360º exposures. With straight access from the main lift, we walk through a hall that leads to a large central space composed by the living area and a dining room and an access to a terrace at the same level. A full equipped and red lacquered kitchen, is directly connected to the dining room and features a large window framing Gaudi's masterpiece, Sagrada Familia. On the same floor there are three bedrooms, one en-suite and two double bedrooms with their own bathroom. All the rooms are exterior facing and are surrounded by terraces. Moreover, oversize windows allow for abundant light to stream across the interiors with high-ceilings. \nOn the upper floor we find a room with access to 200 m2 of terraces hosting chill-out areas, a swimming pool and a jacuzzi. In addition, an interior spiral staircase on the same floor, leads to a turret on a third level spanning 360º views over Barcelona. \nThe penthouse is well preserved with high quality finishes, air conditioning and heating, but it also offers the opportunity have the interiors renovated to contemporary standards, to convert it into one-of-a-kind piece in Barcelona city. \nContact us for more information or to arrange a viewing.",
"features": {
"Basic features": [
"367 m² built",
"5 bedrooms",
"4 bathrooms",
"Terrace",
"Second hand/good condition",
"Fitted wardrobes",
"Built in 1954"
],
"Building": [
"exterior",
"With lift"
],
"Amenities": [
"Air conditioning",
"Swimming pool"
],
"Energy performance certificate": [
"Not indicated"
]
},
"updated": "2 November",
"url": "https://www.idealista.com/en/inmueble/97028172/",
"images": {
"Communal areas": [
"https://www.idealista.com/inmueble/97028172/foto/1/",
"https://www.idealista.com/inmueble/97028172/foto/3/",
"https://www.idealista.com/inmueble/97028172/foto/5/",
"https://www.idealista.com/inmueble/97028172/foto/6/",
"https://www.idealista.com/inmueble/97028172/foto/9/",
"https://www.idealista.com/inmueble/97028172/foto/10/",
"https://www.idealista.com/inmueble/97028172/foto/11/"
],
"Swimming pool": [
"https://www.idealista.com/inmueble/97028172/foto/2/",
"https://www.idealista.com/inmueble/97028172/foto/4/",
"https://www.idealista.com/inmueble/97028172/foto/7/",
"https://www.idealista.com/inmueble/97028172/foto/8/"
],
"Views": [
"https://www.idealista.com/inmueble/97028172/foto/12/",
"https://www.idealista.com/inmueble/97028172/foto/28/",
"https://www.idealista.com/inmueble/97028172/foto/48/"
],
"Living room": [
"https://www.idealista.com/inmueble/97028172/foto/13/",
"https://www.idealista.com/inmueble/97028172/foto/14/",
"https://www.idealista.com/inmueble/97028172/foto/16/",
"https://www.idealista.com/inmueble/97028172/foto/17/",
"https://www.idealista.com/inmueble/97028172/foto/18/",
"https://www.idealista.com/inmueble/97028172/foto/19/"
],
"Dining room": [
"https://www.idealista.com/inmueble/97028172/foto/15/",
"https://www.idealista.com/inmueble/97028172/foto/25/"
],
"Terrace": [
"https://www.idealista.com/inmueble/97028172/foto/20/",
"https://www.idealista.com/inmueble/97028172/foto/21/",
"https://www.idealista.com/inmueble/97028172/foto/22/",
"https://www.idealista.com/inmueble/97028172/foto/24/",
"https://www.idealista.com/inmueble/97028172/foto/36/",
"https://www.idealista.com/inmueble/97028172/foto/40/",
"https://www.idealista.com/inmueble/97028172/foto/41/",
"https://www.idealista.com/inmueble/97028172/foto/42/"
],
"Bedroom": [
"https://www.idealista.com/inmueble/97028172/foto/23/",
"https://www.idealista.com/inmueble/97028172/foto/31/",
"https://www.idealista.com/inmueble/97028172/foto/34/",
"https://www.idealista.com/inmueble/97028172/foto/35/",
"https://www.idealista.com/inmueble/97028172/foto/38/",
"https://www.idealista.com/inmueble/97028172/foto/39/",
"https://www.idealista.com/inmueble/97028172/foto/43/"
],
"Kitchen": [
"https://www.idealista.com/inmueble/97028172/foto/26/",
"https://www.idealista.com/inmueble/97028172/foto/27/",
"https://www.idealista.com/inmueble/97028172/foto/29/",
"https://www.idealista.com/inmueble/97028172/foto/30/"
],
"Bathroom": [
"https://www.idealista.com/inmueble/97028172/foto/32/",
"https://www.idealista.com/inmueble/97028172/foto/37/",
"https://www.idealista.com/inmueble/97028172/foto/44/"
],
"Office": [
"https://www.idealista.com/inmueble/97028172/foto/33/",
"https://www.idealista.com/inmueble/97028172/foto/46/"
],
"Staircase": [
"https://www.idealista.com/inmueble/97028172/foto/45/",
"https://www.idealista.com/inmueble/97028172/foto/47/"
],
"Reception": [
"https://www.idealista.com/inmueble/97028172/foto/49/"
]
},
"plans": [
"https://www.idealista.com/inmueble/97028172/foto/50/",
"https://www.idealista.com/inmueble/97028172/foto/51/"
]
}
]
In this demonstration, we used a few CSS and XPath selectors using parsel to extract property details like price, description, features etc.
However, the images are where things are getting a bit complex. For image carousels, many websites use javascript to generate dynamic HTML on demand. Idealista is no exception and it hides all of the image URLs in a javascript variable, then displays it using javascript.
To scrape this, we used a regular expression pattern to find the hidden javascript variable then load it as a Python dictionary object and parse the image and floor plans.
For scraping itself we used asynchronous capabilities of httpx and asyncio.as_completed to schedule multiple properties concurrently making our scraper super fast!
Next, let's take a look at how we can scale up this scraper by implementing exploration functionality.
Finding Idealista Properties
There are several ways to find properties listed in Idealista. The most popular and reliable is to explore by area. In this section, we'll take a look at how to scrape property listings with a little bit of crawling - we'll explore the location directory.
To find the location directory we can scroll to the bottom of the page:
Each link leads to a province listing URL which further leads to area listings URLs. We can easily scrape this with the same CSS selector technique we've used previously:
Python
ScrapFly
def parse_province(response: httpx.Response) -> List[str]:
"""parse province page for area search urls"""
selector = Selector(text=response.text)
urls = selector.css("#location_list li>a::attr(href)").getall()
return [urljoin(str(response.url), url) for url in urls]
async def scrape_provinces(urls: List[str]) -> List[str]:
"""
Scrape province pages like:
https://www.idealista.com/en/venta-viviendas/malaga-provincia/con-chalets/municipios
for search page urls like:
https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
"""
to_scrape = [session.get(url) for url in urls]
search_urls = []
async for response in asyncio.as_completed(to_scrape):
search_urls.extend(parse_province(await response))
return search_urls
def parse_province(response: ScrapeApiResponse) -> List[str]:
"""parse province page for area search urls"""
selector = response.selector
urls = selector.css("#location_list li>a::attr(href)").getall()
return [urljoin(str(response.context["url"]), url) for url in urls]
async def scrape_provinces(urls: List[str]) -> List[str]:
"""
Scrape province pages like:
https://www.idealista.com/en/venta-viviendas/malaga-provincia/con-chalets/municipios
for search page urls like:
https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
"""
to_scrape = [ScrapeConfig(url, asp=True, country="ES") for url in urls]
search_urls = []
async for response in scrapfly.concurrent_scrape(to_scrape):
search_urls.extend(parse_province(response))
return search_urls
Run Code & Example Output
async def run():
data = await scrape_provinces([
"https://www.idealista.com/en/venta-viviendas/balears-illes/con-chalets/municipios"
])
print(json.dumps(data, indent=2))
if __name__ == "__main__":
asyncio.run(run())
This scraper will scrape all area pages for given provinces. To discover all property listings all we'd have to do is scrape all provinces. Next, let's scrape the search results page itself:
Python
ScrapFly
def parse_search(response: httpx.Response) -> List[str]:
"""Parse search result page for 30 listings"""
selector = Selector(text=response.text)
urls = selector.css("article.item .item-link::attr(href)").getall()
return [urljoin(str(response.url), url) for url in urls]
async def scrape_search(url: str, paginate=True, max_pages: int = None) -> List[str]:
"""
Scrape search urls like:
https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
for proprety urls
"""
first_page = await session.get(url)
property_urls = parse_search(first_page)
if not paginate:
return property_urls
total_results = first_page.selector.css("h1#h1-container").re(": (.+) houses")[0]
total_pages = math.ceil(int(total_results.replace(",", "")) / 30)
if total_pages > 60:
print(f"search contains more than max page limit ({total_pages}/60)")
total_pages = 60
# scrape all available pages in the search if max_scrape_pages is None or max_scrape_pages > total_pages
if max_pages and max_pages < total_pages:
total_pages = max_pages
else:
total_pages = total_pages
print(f"scraping {total_pages} of search results concurrently")
to_scrape = [
session.get(first_page.url + f"pagina-{page}.htm")
for page in range(2, total_pages + 1)
]
async for response in asyncio.as_completed(to_scrape):
property_urls.extend(parse_search(await response))
return property_urls
def parse_search(response: ScrapeApiResponse) -> List[str]:
"""Parse search result page for 30 listings"""
selector = response.selector
urls = selector.css("article.item .item-link::attr(href)").getall()
return [urljoin(str(response.context["url"]), url) for url in urls]
async def scrape_search(url: str, paginate=True, max_pages: int = None) -> List[str]:
"""
Scrape search urls like:
https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
for proprety urls
"""
first_page = await scrapfly.async_scrape(ScrapeConfig(url, asp=True, country="ES"))
property_urls = parse_search(first_page)
if not paginate:
return property_urls
total_results = first_page.selector.css("h1#h1-container").re(": (.+) houses")[0]
total_pages = math.ceil(int(total_results.replace(",", "")) / 30)
if total_pages > 60:
print(f"search contains more than max page limit ({total_pages}/60)")
total_pages = 60
# scrape all available pages in the search if max_scrape_pages is None or max_scrape_pages > total_pages
if max_pages and max_pages < total_pages:
total_pages = max_pages
else:
total_pages = total_pages
print(f"scraping {total_pages} of search results concurrently")
to_scrape = [
ScrapeConfig(first_page.context["url"] + f"pagina-{page}.htm", asp=True, country="ES")
for page in range(2, total_pages + 1)
]
async for response in scrapfly.concurrent_scrape(to_scrape):
property_urls.extend(parse_search(response))
return property_urls
Run Code & Example Output
async def run():
data = await scrape_search(
url="https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/",
max_pages=1
)
print(json.dumps(data, indent=2))
if __name__ == "__main__":
asyncio.run(run())
For scraping paginated content like the area results pages we first scrape the first page to extract the total result count. Then, we can scrape the remaining pages concurrently retrieving all listings in just a few seconds!
With this discovery scraper and our previous property scraper we can collect all of the existing real estate data on Idealista.com - though what if we want to be the first to know about new property listings? Next, let's take a look how can we scrape Idealista's search results.
How to Scrape Idealista Search
In this section, we'll scrape Idealista's search pages. These search pages enabling finding specific property listings and sorting them. For example, let's find properties in Malaga, Spain:
To build an Idealista scraper for search pages. We'll request search pages and parse their results while incrementing the pagina parameter for pagination:
Python
ScrapFly
import json
import math
import httpx
import asyncio
from typing import Dict, List
# Establish persisten HTTPX session with browser-like headers to avoid blocking
BASE_HEADERS = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
session = httpx.AsyncClient(headers=BASE_HEADERS, follow_redirects=True)
def parse_search_data(response) -> List[Dict]:
"""parse search result data"""
selector = Selector(response.text)
total_results = selector.css("h1#h1-container").re(": (.+) houses")[0]
max_pages = math.ceil(int(total_results.replace(",", "")) / 30)
max_pages = 60 if max_pages > 60 else max_pages
search_data = []
for box in selector.xpath("//section[contains(@class, 'items-list')]/article[contains(@class, 'item')]"):
ad = box.xpath(".//p[@class='adv_txt']") # ignore ad listings
if ad:
continue
price = box.xpath(".//span[contains(@class, 'item-price')]/text()").get()
parking = box.xpath(".//span[@class='item-parking']").get()
company_url = box.xpath(".//picture[@class='logo-branding']/a/@href").get()
search_data.append({
"title": box.xpath(".//div/a/@title").get(),
"link": "https://www.idealista.com" + box.xpath(".//div/a/@href").get(),
"picture": box.xpath(".//img/@src").get(),
"price": int(price.replace(",", '')) if price else None,
"currency": box.xpath(".//span[contains(@class, 'item-price')]/span/text()").get(),
"parking_included": True if parking else False,
"details": box.xpath(".//div[@class='item-detail-char']/span/text()").getall(),
"description": box.xpath(".//div[contains(@class, 'item-description')]/p/text()").get().replace('\n', ''),
"tags": box.xpath(".//div[@class='listing-tags-container']/span/text()").getall(),
"listing_company": box.xpath(".//picture[@class='logo-branding']/a/@title").get(),
"listing_company_url": "https://www.idealista.com" + company_url if company_url else None
})
return {"max_pages": max_pages, "search_data": search_data}
async def scrape_search(url: str, max_scrape_pages: int = None) -> List[Dict]:
"""scrape Idealista search results"""
first_page = await session.get(url)
assert first_page == 200, "request is blocked, use ScrapFly code tab"
data = parse_search_data(first_page)
search_data = data["search_data"]
max_pages = data["max_pages"]
# get the number of total pages to scrape
if max_scrape_pages and max_scrape_pages < max_pages:
max_pages = max_scrape_pages
# scrape the remaining pages concurrently
to_scrape = [
session(url + f"pagina-{page}.htm")
for page in range(2, max_pages + 1)
]
print(f"scraping search pagination, {max_pages - 1} pages remaining")
for response in asyncio.as_completed(to_scrape):
search_data.extend(parse_search_data(await response)["search_data"])
print(f"scraped {len(search_data)} property listings from search pages")
return search_data
import json
import math
import asyncio
from typing import Dict, List
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def parse_search_data(response: ScrapeApiResponse) -> List[Dict]:
"""parse search result data"""
selector = response.selector
total_results = selector.css("h1#h1-container").re(": (.+) houses")[0]
max_pages = math.ceil(int(total_results.replace(",", "")) / 30)
max_pages = 60 if max_pages > 60 else max_pages
search_data = []
for box in selector.xpath("//section[contains(@class, 'items-list')]/article[contains(@class, 'item')]"):
ad = box.xpath(".//p[@class='adv_txt']") # ignore ad listings
if ad:
continue
price = box.xpath(".//span[contains(@class, 'item-price')]/text()").get()
parking = box.xpath(".//span[@class='item-parking']").get()
company_url = box.xpath(".//picture[@class='logo-branding']/a/@href").get()
search_data.append({
"title": box.xpath(".//div/a/@title").get(),
"link": "https://www.idealista.com" + box.xpath(".//div/a/@href").get(),
"picture": box.xpath(".//img/@src").get(),
"price": int(price.replace(",", '')) if price else None,
"currency": box.xpath(".//span[contains(@class, 'item-price')]/span/text()").get(),
"parking_included": True if parking else False,
"details": box.xpath(".//div[@class='item-detail-char']/span/text()").getall(),
"description": box.xpath(".//div[contains(@class, 'item-description')]/p/text()").get().replace('\n', ''),
"tags": box.xpath(".//div[@class='listing-tags-container']/span/text()").getall(),
"listing_company": box.xpath(".//picture[@class='logo-branding']/a/@title").get(),
"listing_company_url": "https://www.idealista.com" + company_url if company_url else None
})
return {"max_pages": max_pages, "search_data": search_data}
async def scrape_search(url: str, max_scrape_pages: int = None) -> List[Dict]:
"""scrape Idealista search results"""
first_page = await SCRAPFLY.async_scrape(ScrapeConfig(url, asp=True, country="ES"))
data = parse_search_data(first_page)
search_data = data["search_data"]
max_pages = data["max_pages"]
# get the number of total pages to scrape
if max_scrape_pages and max_scrape_pages < max_pages:
max_pages = max_scrape_pages
# scrape the remaining pages concurrently
to_scrape = [
ScrapeConfig(url + f"pagina-{page}.htm", asp=True, country="ES")
for page in range(2, max_pages + 1)
]
log.info(f"scraping search pagination, {max_pages - 1} pages remaining")
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
# skip invalid property pages
search_data.extend(parse_search_data(response)["search_data"])
log.success(f"scraped {len(search_data)} property listings from search pages")
return search_data
Run the code
if __name__ == "__main__":
search_data = asyncio.run(scrape_search(
url="https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/",
# remove the max_scrape_pages paremeter to scrape all pages
max_scrape_pages=3
))
with open("search_data.json", "w", encoding="utf-8") as file:
json.dump(search_data, file, indent=2, ensure_ascii=False)
Above, we define a parse_search_data utility. It parses the HTML page using XPath selectors to extract the search results. We also use the scrape_search function to paginate search pages by requesting the first page, adding the remaining pages to a scraping list, and then scraping them concurrently.
Here's an example reuslts to the above Idealista scraper:
Example output
[
{
"title": "Detached house in calle Verdi, 128 -27, Sierra Blanca, Marbella",
"link": "https://www.idealista.com/en/inmueble/105709329/",
"picture": "https://img4.idealista.com/blur/WEB_LISTING-M/0/id.pro.es.image.master/7e/17/de/1260664883.jpg",
"price": 12450000,
"currency": "€",
"parking_included": true,
"details": [
"6 bed.",
"774 m²"
],
"description": "Nestled within Marbella's prestigious Sierra Blanca community, Villa Verdi epitomises luxury and refinement, a testament to the artistry of AMES arquitectos. Set amidst lush greenery on a generous plot, this unique villa seamlessly merges Andalusian heritage with contemporary design, offering an unparalleled living exp",
"tags": [
"Sea views",
"Luxury",
"Villa"
],
"listing_company": "Solvilla",
"listing_company_url": "https://www.idealista.com/en/pro/solvilla/"
},
....
]
We scraped Idealista data from discovery, property, adn search pages - all is left is to scale our scraper. If we were to increase our scraping load Idealista is very likely to block us so let's take a look at how to avoid blocking using ScrapFly web scraping API next.
Bypass Idealista Blocking with ScrapFly
As we've seen, scraping Idealista.com using Python is pretty straight-forward, though when scraping at scale our scrapers are likely to be blocked or asked to solve captchas.
To take advantage of ScrapFly's API in our Idealista web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests:
import httpx
response = httpx.get("some idealista.com URL")
# in ScrapFly SDK becomes
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient("YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
"some Idealista.ocm url",
# we can select specific proxy country like Spain:
country="ES",
# and enable anti scraping protection bypass:
asp=True
))
For more on how to scrape Idealista.com using ScrapFly, see the Full Scraper Code section.
FAQ
To wrap this guide up, let's take a look at some frequently asked questions about web scraping Idealista.com data:
Is it legal to scrape Idealista.com?
Yes. Idealista.com's data is publicly available; we're not extracting anything personal or private. Scraping Idealista.com at slow, respectful rates is perfectly legal and ethical.
That being said, attention should be paid to GDRP compliance in the EU when scraping personal data like (seller's name, phone number etc). For more, see our Is Web Scraping Legal? article.
Does Idealista.com have a public API?
No, Idealista.com (and it's sister websites) do not offer a public API for property data. However, as seen in this guide, it's easy to scrape and crawl using a little bit of Python.
In this web scraping tutorial, we wrote a short Idealista scraper for real estate property data. We started by scraping a single property page and parsing details using CSS and XPath selectors.
Then, we've taken a look at how to find properties using Idealista's directory and search system. We wrote a small web crawler that can crawl and scrape all property listings in provided provinces of Spain.
Finally, we've taken a look at how to track new listings being posted on Idealista by creating a looping scraper that constantly checks for new listings.
For all of this, we used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API that smartly configures every web scraper connection to avoid being blocked.
For more about ScrapFly, see our documentation and try it out for FREE!
import re
import asyncio
import json
import math
from pathlib import Path
from typing import List
from urllib.parse import urljoin
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=2)
# -------------------------------------------------
# Property
# -------------------------------------------------
def parse_property(result: ScrapeApiResponse):
sel = result.selector
css = lambda x: result.selector.css(x).get("").strip()
css_all = lambda x: result.selector.css(x).getall()
data = {}
# Meta data
data["url"] = result.context["url"]
# Basic information
data['title'] = css("h1 .main-info__title-main::text")
data['location'] = css(".main-info__title-minor::text")
data['currency'] = css(".info-data-price::text")
data['price'] = int(css(".info-data-price span::text").replace(",", ""))
data['description'] = "\n".join(css_all("div.comment ::text")).strip()
data["updated"] = sel.xpath(
"//p[@class='stats-text']"
"[contains(text(),'updated on')]/text()"
).get("").split(" on ")[-1]
# Features
data["features"] = {}
# first we extract each feature block like "Basic Features" or "Amenities"
for feature_block in result.selector.css(".details-property-h3"):
# then for each block we extract all bullet points underneath them
label = feature_block.xpath("text()").get()
features = feature_block.xpath("following-sibling::div[1]//li")
data["features"][label] = [
''.join(feat.xpath(".//text()").getall()).strip()
for feat in features
]
# Images
# the images are tucked away in a javascript variable.
# We can use regular expressions to find the variable and parse it as a dictionary:
image_data = re.findall("fullScreenGalleryPics\s*:\s*(\[.+?\]),", result.content)[0]
# we also need to replace unquoted keys to quoted keys (i.e. title -> "title"):
images = json.loads(re.sub(r'(\w+?):([^/])', r'"\1":\2', image_data))
data['images'] = defaultdict(list)
data['plans'] = []
for image in images:
url = urljoin(result.context['url'], image['imageUrl'])
if image['isPlan']:
data['plans'].append(url)
else:
data['images'][image['tag']].append(url)
return data
async def scrape_properties(urls: List[str]):
to_scrape = [ScrapeConfig(url=url, country="ES", asp=True) for url in urls]
results = []
async for result in scrapfly.concurrent_scrape(to_scrape):
results.append(parse_property(result))
return results
# -------------------------------------------------
# Search
# -------------------------------------------------
def parse_province(result: ScrapeApiResponse) -> List[str]:
"""parse province page for area search urls"""
urls = result.selector.css("#location_list li>a::attr(href)").getall()
return [urljoin(result.context["url"], url) for url in urls]
async def scrape_provinces(urls: List[str]) -> List[str]:
"""
Scrape province pages like:
https://www.idealista.com/en/venta-viviendas/malaga-provincia/con-chalets/municipios
for search page urls like:
https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
"""
to_scrape = [ScrapeConfig(url, country="US", asp=True) for url in urls]
search_urls = []
async for result in scrapfly.concurrent_scrape(to_scrape):
search_urls.extend(parse_province(result))
return search_urls
def parse_search(result: ScrapeApiResponse) -> List[str]:
"""Parse search result page for 30 listings"""
urls = result.selector.css("article.item .item-link::attr(href)").getall()
return [urljoin(result.context["url"], url) for url in urls]
async def scrape_search(url: str, max_pages: int = 2) -> List[str]:
"""
Scrape search urls like:
https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
for proprety urls
"""
first_page = await scrapfly.async_scrape(ScrapeConfig(url=url, country="ES", asp=True))
property_urls = parse_search(first_page)
if not paginate:
return property_urls
total_results = first_page.selector.css("h1#h1-container").re(": (.+) houses")[0]
total_pages = math.ceil(int(total_results.replace(",", "")) / 30)
if total_pages > max_pages: # note idealista allows max 60 pages per search
print(f"search contains more than max page limit ({total_pages}/60)")
total_pages = max_pages
print(f"scraping {total_pages} of search results concurrently")
to_scrape = [
ScrapeConfig(
url=first_page.context["url"] + f"pagina-{page}.htm",
asp=True,
country="ES",
)
for page in range(2, total_pages + 1)
]
async for result in scrapfly.concurrent_scrape(to_scrape):
property_urls.extend(parse_search(result))
return property_urls
# -------------------------------------------------
# Track Search
# -------------------------------------------------
async def track_search(url: str, output: Path, interval=60):
"""Track Idealista.com results page, scrape new listings and append them as JSON to the output file"""
seen = set()
output.touch(exist_ok=True)
try:
while True:
properties = await scrape_search(url=url, paginate=False)
# check deduplication filter
properties = [prop for prop in properties if prop not in seen]
if properties:
# scrape properties and save to file - 1 property as JSON per line
results = await scrape_properties(properties)
with output.open("a") as f:
f.write("\n".join(json.dumps(property) for property in results))
# add seen to deduplication filter
for prop in properties:
seen.add(prop)
print(f"scraped {len(results)} new properties; waiting {interval} seconds")
await asyncio.sleep(interval)
except KeyboardInterrupt:
print("stopping price tracking")
async def run():
# scrape properties:
urls = ["https://www.idealista.com/en/inmueble/97028172/"]
result_properties = await scrape_properties(urls)
# find properties
result_search = await scrape_search("https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/")
result_province = await scrape_provinces(["https://www.idealista.com/en/venta-viviendas/balears-illes/con-chalets/municipios"])
# track properties
await track_search(
"https://www.idealista.com/en/venta-viviendas/barcelona/eixample/?ordenado-por=fecha-publicacion-desc",
Path("new-properties.jsonl"),
)
if __name__ == "__main__":
asyncio.run(run())
In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.
In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.
In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.