How to Scrape Idealista.com

by Bernardas Ališauskas Sep 10, 2024

#scrapeguide #python #real-estate

In this web scraping tutorial, we'll be scraping idealista.com - the biggest real estate marketplace in Spain, Portugal and Italy.

In this guide, we'll be exploring real estate data scraping by taking a look at Idealista.com. We'll be scraping common property data points like property pricing, addresses, photos and agent phone numbers.

When it comes to web scraping, Idealista.com is a traditional scraping target. To scrape it, we'll cover popular web scraping techniques used in Python like HTML parsing using CSS Selectors and concurrent requests using asyncio.

Finally, we'll also cover tracking to scrape newly listed properties - giving us an upper hand in real estate discovery and bidding.

Key Takeaways

Master idealista api scraping with advanced Python techniques, real estate data extraction, and property monitoring for comprehensive market analysis.

Reverse engineer Idealista's API endpoints by intercepting browser network requests and analyzing JSON responses
Extract structured property data including prices, locations, and property details from listing pages
Implement pagination handling and search parameter management for comprehensive property data collection
Configure proxy rotation and fingerprint management to avoid detection and rate limiting
Use specialized tools like ScrapFly for automated Idealista scraping with anti-blocking features
Implement data validation and error handling for reliable property information extraction

In this article, we'll focus on the Spanish version of the website (Idealista.com) though both Italian and Portuguese version function the same and our scraper code should work for these sources as well.

Latest Idealista.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Why Scrape Idealista.com?

Idealista.com is one of the biggest real estate websites in Spain (as well as Italy and Portugal) making it the biggest public real estate dataset for these areas. Containing fields like real estate prices, listing locations and sale dates and general property information.

This is valuable information for market analytics, the study of the housing industry, and a general competitor overview.

For more on scraping use cases see our extensive write-up Scraping Use Cases

Project Setup

In this tutorial, we'll be using Python with two community packages:

httpx - HTTP client library which will let us communicate with Idealista.com's servers
parsel - HTML parsing library which will help us to parse our web scraped HTML files using CSS selectors and XPath selectors.

These packages can be easily installed via the pip install command:

$ pip install httpx parsel

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is the beautifulsoup package.

Scraping Idealista Property Data

Let's start by taking a look at how to scrape Idealista for a single property. In later sections, we'll also take a look at how to find any properties and scrape them using this property scraper.

For example, let's start by taking a look at the listing page and where is all of the information stored on it. Let's pick a random property listing, like:

idealista.com/en/inmueble/94156485/

For parsing data on Idealista, we'll be using CSS selectors, so let's markup the fields we want to scrape:

screenshot and markup of idealista property page — We'll scrape fields highlighted in blue in this example

Idealista is a pure HTML website with a very convenient styling markup which we can take advantage in our scraper. For example, if we right-click on the price and inspect the HTML element we can see how clear the HTML structure is:

We can see that all of the data points are under clear class names like info-data-price for price or main-info__title-main for property name.

Parsing HTML with CSS Selectors

Introduction to using CSS selectors to parse web-scraped content. Best practices, available tools and common challenges by interactive examples.

Let's scrape it:

Python

ScrapFly

import asyncio
import json
import re
from typing import Dict, List
from collections import defaultdict 
from urllib.parse import urljoin
import httpx
from parsel import Selector
from typing_extensions import TypedDict

# Establish persisten HTTPX session with browser-like headers to avoid blocking
BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}
session = httpx.AsyncClient(headers=BASE_HEADERS, follow_redirects=True)

# type hints fo expected results so we can visualize our scraper easier:
class PropertyResult(TypedDict):
    url: str
    title: str
    location: str
    price: int
    currency: str
    description: str
    updated: str
    features: Dict[str, List[str]]
    images: Dict[str, List[str]]
    plans: List[str]


def parse_property(response: httpx.Response) -> PropertyResult:
    """parse Idealista.com property page"""
    # load response's HTML tree for parsing:
    selector = Selector(text=response.text)
    css = lambda x: selector.css(x).get("").strip()
    css_all = lambda x: selector.css(x).getall()

    data = {}
    # Meta data
    data["url"] = str(response.url)

    # Basic information
    data['title'] = css("h1 .main-info__title-main::text")
    data['location'] = css(".main-info__title-minor::text")
    data['currency'] = css(".info-data-price::text")
    data['price'] = int(css(".info-data-price span::text").replace(",", ""))
    data['description'] = "\n".join(css_all("div.comment ::text")).strip()
    data["updated"] = selector.xpath(
        "//p[@class='stats-text']"
        "[contains(text(),'updated on')]/text()"
    ).get("").split(" on ")[-1]

    # Features
    data["features"] = {}
    #  first we extract each feature block like "Basic Features" or "Amenities"
    for feature_block in selector.css(".details-property-h2"):
        # then for each block we extract all bullet points underneath them
        label = feature_block.xpath("text()").get()
        features = feature_block.xpath("following-sibling::div[1]//li")
        data["features"][label] = [
            ''.join(feat.xpath(".//text()").getall()).strip()
            for feat in features
        ]

    # Images
    # the images are tucked away in a javascript variable.
    # We can use regular expressions to find the variable and parse it as a dictionary:
    image_data = re.findall(r"fullScreenGalleryPics\s*:\s*(\[.+?\]),", 
        response.scrape_result['content']
    )[0]
    # we also need to replace unquoted keys to quoted keys (i.e. title -> "title"):
    images = json.loads(re.sub(r'(\w+?):([^/])', r'"\1":\2', image_data))
    data['images'] = defaultdict(list)
    data['plans'] = []
    for image in images:
        url = urljoin(str(response.url), image['imageUrl'])
        if image['isPlan']:
            data['plans'].append(url)
        else:
            data['images'][image['tag']].append(url)
    return data


async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
    """Scrape Idealista.com properties"""
    properties = []
    to_scrape = [session.get(url) for url in urls]
    # tip: asyncio.as_completed allows concurrent scraping - super fast!
    for response in asyncio.as_completed(to_scrape):
        response = await response
        print(response.status_code)
        if response.status_code != 200:
            print(f"can't scrape property: {response.url}")
            continue
        properties.append(parse_property(response))
    return properties

import asyncio
import json
import re
from typing import Dict, List
from typing_extensions import TypedDict
from collections import defaultdict 
from urllib.parse import urljoin
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly Key")

# type hints fo expected results so we can visualize our scraper easier:
class PropertyResult(TypedDict):
    url: str
    title: str
    location: str
    price: int
    currency: str
    description: str
    updated: str
    features: Dict[str, List[str]]
    images: Dict[str, List[str]]
    plans: List[str]

def parse_property(response: ScrapeApiResponse) -> PropertyResult:
    """parse Idealista.com property page"""
    # load response's HTML tree for parsing:
    selector = response.selector
    css = lambda x: selector.css(x).get("").strip()
    css_all = lambda x: selector.css(x).getall()

    data = {}
    # Meta data
    data["url"] = str(response.context["url"])

    # Basic information
    data['title'] = css("h1 .main-info__title-main::text")
    data['location'] = css(".main-info__title-minor::text")
    data['currency'] = css(".info-data-price::text")
    data['price'] = int(css(".info-data-price span::text").replace(",", ""))
    data['description'] = "\n".join(css_all("div.comment ::text")).strip()
    data["updated"] = selector.xpath(
        "//p[@class='stats-text']"
        "[contains(text(),'updated on')]/text()"
    ).get("").split(" on ")[-1]

    # Features
    data["features"] = {}
    #  first we extract each feature block like "Basic Features" or "Amenities"
    for feature_block in selector.css(".details-property-h2"):
        # then for each block we extract all bullet points underneath them
        label = feature_block.xpath("text()").get()
        features = feature_block.xpath("following-sibling::div[1]//li")
        data["features"][label] = [
            ''.join(feat.xpath(".//text()").getall()).strip()
            for feat in features
        ]

    # Images
    # the images are tucked away in a javascript variable.
    # We can use regular expressions to find the variable and parse it as a dictionary:
    image_data = re.findall(r"fullScreenGalleryPics\s*:\s*(\[.+?\]),", 
        response.scrape_result['content']
    )[0]
    # we also need to replace unquoted keys to quoted keys (i.e. title -> "title"):
    images = json.loads(re.sub(r'(\w+?):([^/])', r'"\1":\2', image_data))
    data['images'] = defaultdict(list)
    data['plans'] = []
    for image in images:
        url = urljoin(str(response.context["url"]), image['imageUrl'])
        if image['isPlan']:
            data['plans'].append(url)
        else:
            data['images'][image['tag']].append(url)
    return data

async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
    """Scrape Idealista.com properties"""
    properties = []
    to_scrape = [ScrapeConfig(url, asp=True, country="ES") for url in urls]
    async for response in scrapfly.concurrent_scrape(to_scrape):
        if response.upstream_status_code != 200:
            print(f"can't scrape property: {response.context['url']}")
            continue
        properties.append(parse_property(response))
    return properties

🧙‍ If you are experiencing errors while running the Python code tabs, it may be due to getting blocked. To bypass blocking, use the ScrapFly code tabs instead.

Run Code & Example Output

async def run():
    urls = ["https://www.idealista.com/en/inmueble/97028172/"]
    data = await scrape_properties(urls)
    print(json.dumps(data, indent=2, ensure_ascii=False))

if __name__ == "__main__":
    asyncio.run(run())

Which would result in a dataset similar to this:

[
  {
    "title": "Penthouse for sale in La Dreta de l'Eixample",
    "location": "Eixample, Barcelona",
    "price": 5200000,
    "currency": "€",
    "description": "This stunning penthouse hosts 269 m2 distributed across two floors and a turret with 360º exposures. With straight access from the main lift, we walk through a hall that leads to a large central space composed by the living area and a dining room and an access to a terrace at the same level.  A full equipped and red lacquered kitchen, is directly connected to the dining room and features a large window framing Gaudi's masterpiece, Sagrada Familia. On the same floor there are three bedrooms, one en-suite and two double bedrooms with their own bathroom. All the rooms are exterior facing and are surrounded by terraces. Moreover, oversize windows allow for abundant light to stream across the interiors with high-ceilings. \nOn the upper floor we find a room with access to 200 m2 of terraces hosting chill-out areas, a swimming pool and a jacuzzi.  In addition, an interior spiral staircase on the same floor, leads to a turret on a third level spanning 360º views over Barcelona. \nThe penthouse is well preserved with high quality finishes, air conditioning and heating, but it also offers the opportunity have the interiors renovated to contemporary standards, to convert it into one-of-a-kind piece in Barcelona city. \nContact us for more information or to arrange a viewing.",
    "features": {
      "Basic features": [
        "367 m² built",
        "5 bedrooms",
        "4 bathrooms",
        "Terrace",
        "Second hand/good condition",
        "Fitted wardrobes",
        "Built in 1954"
      ],
      "Building": [
        "exterior",
        "With lift"
      ],
      "Amenities": [
        "Air conditioning",
        "Swimming pool"
      ],
      "Energy performance certificate": [
        "Not indicated"
      ]
    },
    "updated": "2 November",
    "url": "https://www.idealista.com/en/inmueble/97028172/",
    "images": {
      "Communal areas": [
        "https://www.idealista.com/inmueble/97028172/foto/1/",
        "https://www.idealista.com/inmueble/97028172/foto/3/",
        "https://www.idealista.com/inmueble/97028172/foto/5/",
        "https://www.idealista.com/inmueble/97028172/foto/6/",
        "https://www.idealista.com/inmueble/97028172/foto/9/",
        "https://www.idealista.com/inmueble/97028172/foto/10/",
        "https://www.idealista.com/inmueble/97028172/foto/11/"
      ],
      "Swimming pool": [
        "https://www.idealista.com/inmueble/97028172/foto/2/",
        "https://www.idealista.com/inmueble/97028172/foto/4/",
        "https://www.idealista.com/inmueble/97028172/foto/7/",
        "https://www.idealista.com/inmueble/97028172/foto/8/"
      ],
      "Views": [
        "https://www.idealista.com/inmueble/97028172/foto/12/",
        "https://www.idealista.com/inmueble/97028172/foto/28/",
        "https://www.idealista.com/inmueble/97028172/foto/48/"
      ],
      "Living room": [
        "https://www.idealista.com/inmueble/97028172/foto/13/",
        "https://www.idealista.com/inmueble/97028172/foto/14/",
        "https://www.idealista.com/inmueble/97028172/foto/16/",
        "https://www.idealista.com/inmueble/97028172/foto/17/",
        "https://www.idealista.com/inmueble/97028172/foto/18/",
        "https://www.idealista.com/inmueble/97028172/foto/19/"
      ],
      "Dining room": [
        "https://www.idealista.com/inmueble/97028172/foto/15/",
        "https://www.idealista.com/inmueble/97028172/foto/25/"
      ],
      "Terrace": [
        "https://www.idealista.com/inmueble/97028172/foto/20/",
        "https://www.idealista.com/inmueble/97028172/foto/21/",
        "https://www.idealista.com/inmueble/97028172/foto/22/",
        "https://www.idealista.com/inmueble/97028172/foto/24/",
        "https://www.idealista.com/inmueble/97028172/foto/36/",
        "https://www.idealista.com/inmueble/97028172/foto/40/",
        "https://www.idealista.com/inmueble/97028172/foto/41/",
        "https://www.idealista.com/inmueble/97028172/foto/42/"
      ],
      "Bedroom": [
        "https://www.idealista.com/inmueble/97028172/foto/23/",
        "https://www.idealista.com/inmueble/97028172/foto/31/",
        "https://www.idealista.com/inmueble/97028172/foto/34/",
        "https://www.idealista.com/inmueble/97028172/foto/35/",
        "https://www.idealista.com/inmueble/97028172/foto/38/",
        "https://www.idealista.com/inmueble/97028172/foto/39/",
        "https://www.idealista.com/inmueble/97028172/foto/43/"
      ],
      "Kitchen": [
        "https://www.idealista.com/inmueble/97028172/foto/26/",
        "https://www.idealista.com/inmueble/97028172/foto/27/",
        "https://www.idealista.com/inmueble/97028172/foto/29/",
        "https://www.idealista.com/inmueble/97028172/foto/30/"
      ],
      "Bathroom": [
        "https://www.idealista.com/inmueble/97028172/foto/32/",
        "https://www.idealista.com/inmueble/97028172/foto/37/",
        "https://www.idealista.com/inmueble/97028172/foto/44/"
      ],
      "Office": [
        "https://www.idealista.com/inmueble/97028172/foto/33/",
        "https://www.idealista.com/inmueble/97028172/foto/46/"
      ],
      "Staircase": [
        "https://www.idealista.com/inmueble/97028172/foto/45/",
        "https://www.idealista.com/inmueble/97028172/foto/47/"
      ],
      "Reception": [
        "https://www.idealista.com/inmueble/97028172/foto/49/"
      ]
    },
    "plans": [
      "https://www.idealista.com/inmueble/97028172/foto/50/",
      "https://www.idealista.com/inmueble/97028172/foto/51/"
    ]
  }
]

In this demonstration, we used a few CSS and XPath selectors using parsel to extract property details like price, description, features etc.

However, the images are where things are getting a bit complex. For image carousels, many websites use javascript to generate dynamic HTML on demand. Idealista is no exception and it hides all of the image URLs in a javascript variable, then displays it using javascript.
To scrape this, we used a regular expression pattern to find the hidden javascript variable then load it as a Python dictionary object and parse the image and floor plans.

How to Scrape Hidden Web Data

The visible HTML doesn't always represent the whole dataset available on the page. In this article, we'll be taking a look at scraping of hidden web data. What is it and how can we scrape it using Python?

For scraping itself we used asynchronous capabilities of httpx and asyncio.as_completed to schedule multiple properties concurrently making our scraper super fast!

Next, let's take a look at how we can scale up this scraper by implementing exploration functionality.

Finding Idealista Properties

There are several ways to find properties listed in Idealista. The most popular and reliable is to explore by area. In this section, we'll take a look at how to scrape property listings with a little bit of crawling - we'll explore the location directory.

To find the location directory we can scroll to the bottom of the page:

screenshot of idealista location directory page — Location directory found at the bottom of the page.

Each link leads to a province listing URL which further leads to area listings URLs. We can easily scrape this with the same CSS selector technique we've used previously:

Python

ScrapFly

def parse_province(response: httpx.Response) -> List[str]:
    """parse province page for area search urls"""
    selector = Selector(text=response.text)
    urls = selector.css("#location_list li>a::attr(href)").getall()
    return [urljoin(str(response.url), url) for url in urls]


async def scrape_provinces(urls: List[str]) -> List[str]:
    """
    Scrape province pages like:
    https://www.idealista.com/en/venta-viviendas/malaga-provincia/con-chalets/municipios
    for search page urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    """
    to_scrape = [session.get(url) for url in urls]
    search_urls = []
    async for response in asyncio.as_completed(to_scrape):
        search_urls.extend(parse_province(await response))
    return search_urls

def parse_province(response: ScrapeApiResponse) -> List[str]:
    """parse province page for area search urls"""
    selector = response.selector
    urls = selector.css("#location_list li>a::attr(href)").getall()
    return [urljoin(str(response.context["url"]), url) for url in urls]


async def scrape_provinces(urls: List[str]) -> List[str]:
    """
    Scrape province pages like:
    https://www.idealista.com/en/venta-viviendas/malaga-provincia/con-chalets/municipios
    for search page urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    """
    to_scrape = [ScrapeConfig(url, asp=True, country="ES") for url in urls]
    search_urls = []
    async for response in scrapfly.concurrent_scrape(to_scrape):
        search_urls.extend(parse_province(response))
    return search_urls

Run Code & Example Output

async def run():
    data = await scrape_provinces([
        "https://www.idealista.com/en/venta-viviendas/balears-illes/con-chalets/municipios"
    ])
    print(json.dumps(data, indent=2))
if __name__ == "__main__":
    asyncio.run(run())

Will result in a dataset similar to this:

[
  "https://www.idealista.com/en/venta-viviendas/alaior-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/alaro-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/alcudia-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/algaida-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/andratx-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/ariany-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/arta-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/santa-maria-del-cami-balears-illes/con-chalets/",
  ...
]

This scraper will scrape all area pages for given provinces. To discover all property listings all we'd have to do is scrape all provinces. Next, let's scrape the search results page itself:

Python

ScrapFly

def parse_search(response: httpx.Response) -> List[str]:
    """Parse search result page for 30 listings"""
    selector = Selector(text=response.text)
    urls = selector.css("article.item .item-link::attr(href)").getall()
    return [urljoin(str(response.url), url) for url in urls]


async def scrape_search(url: str, paginate=True, max_pages: int = None) -> List[str]:
    """
    Scrape search urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    for proprety urls
    """
    first_page = await session.get(url)
    property_urls = parse_search(first_page)
    if not paginate:
        return property_urls
    total_results = first_page.selector.css("h1#h1-container").re(": (.+) houses")[0]
    total_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    if total_pages > 60:
        print(f"search contains more than max page limit ({total_pages}/60)")
        total_pages = 60
    # scrape all available pages in the search if max_scrape_pages is None or max_scrape_pages > total_pages
    if max_pages and max_pages < total_pages:
        total_pages = max_pages
    else:
        total_pages = total_pages
    print(f"scraping {total_pages} of search results concurrently")
    to_scrape = [
        session.get(first_page.url + f"pagina-{page}.htm")
        for page in range(2, total_pages + 1)
    ]
    async for response in asyncio.as_completed(to_scrape):
        property_urls.extend(parse_search(await response))
    return property_urls

def parse_search(response: ScrapeApiResponse) -> List[str]:
    """Parse search result page for 30 listings"""
    selector = response.selector
    urls = selector.css("article.item .item-link::attr(href)").getall()
    return [urljoin(str(response.context["url"]), url) for url in urls]


async def scrape_search(url: str, paginate=True, max_pages: int = None) -> List[str]:
    """
    Scrape search urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    for proprety urls
    """
    first_page = await scrapfly.async_scrape(ScrapeConfig(url, asp=True, country="ES"))
    property_urls = parse_search(first_page)
    if not paginate:
        return property_urls
    total_results = first_page.selector.css("h1#h1-container").re(": (.+) houses")[0]
    total_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    if total_pages > 60:
        print(f"search contains more than max page limit ({total_pages}/60)")
        total_pages = 60
    # scrape all available pages in the search if max_scrape_pages is None or max_scrape_pages > total_pages
    if max_pages and max_pages < total_pages:
        total_pages = max_pages
    else:
        total_pages = total_pages        
    print(f"scraping {total_pages} of search results concurrently")
    to_scrape = [
        ScrapeConfig(first_page.context["url"] + f"pagina-{page}.htm", asp=True, country="ES")
        for page in range(2, total_pages + 1)
    ]
    async for response in scrapfly.concurrent_scrape(to_scrape):
        property_urls.extend(parse_search(response))
    return property_urls

Run Code & Example Output

async def run():
    data = await scrape_search(
        url="https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/",
        max_pages=1
    )
    print(json.dumps(data, indent=2))
if __name__ == "__main__":
    asyncio.run(run())

Will result in a dataset similar to this:

[
  "https://www.idealista.com/en/inmueble/98935300/",
  "https://www.idealista.com/en/inmueble/102479109/",
  "https://www.idealista.com/en/inmueble/102051911/",
  "https://www.idealista.com/en/inmueble/99394819/",
  "https://www.idealista.com/en/inmueble/102695949/",
  "https://www.idealista.com/en/inmueble/102645488/",
  "https://www.idealista.com/en/inmueble/102953607/",
  "https://www.idealista.com/en/inmueble/86941032/",
  "https://www.idealista.com/en/inmueble/102953907/",
  "https://www.idealista.com/en/inmueble/103130779/",
  "https://www.idealista.com/en/inmueble/100285134/",
   .....
]

For scraping paginated content like the area results pages we first scrape the first page to extract the total result count. Then, we can scrape the remaining pages concurrently retrieving all listings in just a few seconds!

With this discovery scraper and our previous property scraper we can collect all of the existing real estate data on Idealista.com - though what if we want to be the first to know about new property listings? Next, let's take a look how can we scrape Idealista's search results.

How to Scrape Idealista Search

In this section, we'll scrape Idealista's search pages. These search pages enabling finding specific property listings and sorting them. For example, let's find properties in Malaga, Spain:

To build an Idealista scraper for search pages. We'll request search pages and parse their results while incrementing the pagina parameter for pagination:

Python

ScrapFly

import json
import math
import httpx
import asyncio

from typing import Dict, List

# Establish persisten HTTPX session with browser-like headers to avoid blocking
BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}

session = httpx.AsyncClient(headers=BASE_HEADERS, follow_redirects=True)


def parse_search_data(response) -> List[Dict]:
    """parse search result data"""
    selector = Selector(response.text)
    total_results = selector.css("h1#h1-container").re(": (.+) houses")[0]
    max_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    max_pages = 60  if max_pages > 60 else max_pages
    search_data = []
    for box in selector.xpath("//section[contains(@class, 'items-list')]/article[contains(@class, 'item')]"):
        ad = box.xpath(".//p[@class='adv_txt']") # ignore ad listings
        if ad:
            continue
        price = box.xpath(".//span[contains(@class, 'item-price')]/text()").get()
        parking = box.xpath(".//span[@class='item-parking']").get()
        company_url = box.xpath(".//picture[@class='logo-branding']/a/@href").get()
        search_data.append({
            "title": box.xpath(".//div/a/@title").get(),
            "link": "https://www.idealista.com" + box.xpath(".//div/a/@href").get(),
            "picture": box.xpath(".//img/@src").get(),
            "price": int(price.replace(",", '')) if price else None,
            "currency": box.xpath(".//span[contains(@class, 'item-price')]/span/text()").get(),
            "parking_included": True if parking else False,
            "details": box.xpath(".//div[@class='item-detail-char']/span/text()").getall(),
            "description": box.xpath(".//div[contains(@class, 'item-description')]/p/text()").get().replace('\n', ''),
            "tags": box.xpath(".//div[@class='listing-tags-container']/span/text()").getall(),
            "listing_company": box.xpath(".//picture[@class='logo-branding']/a/@title").get(),
            "listing_company_url": "https://www.idealista.com" + company_url if company_url else None
        })
    return {"max_pages": max_pages, "search_data": search_data}


async def scrape_search(url: str, max_scrape_pages: int = None) -> List[Dict]:
    """scrape Idealista search results"""
    first_page = await session.get(url)
    assert first_page == 200, "request is blocked, use ScrapFly code tab"
    data = parse_search_data(first_page)
    search_data = data["search_data"]
    max_pages = data["max_pages"]

    # get the number of total pages to scrape
    if max_scrape_pages and max_scrape_pages < max_pages:
        max_pages = max_scrape_pages

    # scrape the remaining pages concurrently
    to_scrape = [
        session(url + f"pagina-{page}.htm")
        for page in range(2, max_pages + 1)
    ]
    print(f"scraping search pagination, {max_pages - 1} pages remaining")
    for response in asyncio.as_completed(to_scrape):
        search_data.extend(parse_search_data(await response)["search_data"])
    print(f"scraped {len(search_data)} property listings from search pages")
    return search_data

import json
import math
import asyncio

from typing import Dict, List
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")


def parse_search_data(response: ScrapeApiResponse) -> List[Dict]:
    """parse search result data"""
    selector = response.selector
    total_results = selector.css("h1#h1-container").re(": (.+) houses")[0]
    max_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    max_pages = 60  if max_pages > 60 else max_pages
    search_data = []
    for box in selector.xpath("//section[contains(@class, 'items-list')]/article[contains(@class, 'item')]"):
        ad = box.xpath(".//p[@class='adv_txt']") # ignore ad listings
        if ad:
            continue
        price = box.xpath(".//span[contains(@class, 'item-price')]/text()").get()
        parking = box.xpath(".//span[@class='item-parking']").get()
        company_url = box.xpath(".//picture[@class='logo-branding']/a/@href").get()
        search_data.append({
            "title": box.xpath(".//div/a/@title").get(),
            "link": "https://www.idealista.com" + box.xpath(".//div/a/@href").get(),
            "picture": box.xpath(".//img/@src").get(),
            "price": int(price.replace(",", '')) if price else None,
            "currency": box.xpath(".//span[contains(@class, 'item-price')]/span/text()").get(),
            "parking_included": True if parking else False,
            "details": box.xpath(".//div[@class='item-detail-char']/span/text()").getall(),
            "description": box.xpath(".//div[contains(@class, 'item-description')]/p/text()").get().replace('\n', ''),
            "tags": box.xpath(".//div[@class='listing-tags-container']/span/text()").getall(),
            "listing_company": box.xpath(".//picture[@class='logo-branding']/a/@title").get(),
            "listing_company_url": "https://www.idealista.com" + company_url if company_url else None
        })
    return {"max_pages": max_pages, "search_data": search_data}


async def scrape_search(url: str, max_scrape_pages: int = None) -> List[Dict]:
    """scrape Idealista search results"""
    first_page = await SCRAPFLY.async_scrape(ScrapeConfig(url, asp=True, country="ES"))
    data = parse_search_data(first_page)
    search_data = data["search_data"]
    max_pages = data["max_pages"]

    # get the number of total pages to scrape
    if max_scrape_pages and max_scrape_pages < max_pages:
        max_pages = max_scrape_pages

    # scrape the remaining pages concurrently
    to_scrape = [
        ScrapeConfig(url + f"pagina-{page}.htm", asp=True, country="ES")
        for page in range(2, max_pages + 1)
    ]
    log.info(f"scraping search pagination, {max_pages - 1} pages remaining")

    async for response in SCRAPFLY.concurrent_scrape(to_scrape):
        # skip invalid property pages
        search_data.extend(parse_search_data(response)["search_data"])
    log.success(f"scraped {len(search_data)} property listings from search pages")
    return search_data

Run the code

if __name__ == "__main__":
    search_data = asyncio.run(scrape_search(
        url="https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/",
        # remove the max_scrape_pages paremeter to scrape all pages
        max_scrape_pages=3
    ))
    
    with open("search_data.json", "w", encoding="utf-8") as file:
        json.dump(search_data, file, indent=2, ensure_ascii=False)

Above, we define a parse_search_data utility. It parses the HTML page using XPath selectors to extract the search results. We also use the scrape_search function to paginate search pages by requesting the first page, adding the remaining pages to a scraping list, and then scraping them concurrently.

Here's an example reuslts to the above Idealista scraper:

Example output

[
  {
    "title": "Detached house in calle Verdi, 128 -27, Sierra Blanca, Marbella",
    "link": "https://www.idealista.com/en/inmueble/105709329/",
    "picture": "https://img4.idealista.com/blur/WEB_LISTING-M/0/id.pro.es.image.master/7e/17/de/1260664883.jpg",
    "price": 12450000,
    "currency": "€",
    "parking_included": true,
    "details": [
      "6 bed.",
      "774 m²"
    ],
    "description": "Nestled within Marbella's prestigious Sierra Blanca community, Villa Verdi epitomises luxury and refinement, a testament to the artistry of AMES arquitectos. Set amidst lush greenery on a generous plot, this unique villa seamlessly merges Andalusian heritage with contemporary design, offering an unparalleled living exp",
    "tags": [
      "Sea views",
      "Luxury",
      "Villa"
    ],
    "listing_company": "Solvilla",
    "listing_company_url": "https://www.idealista.com/en/pro/solvilla/"
  },
  ....
]

We scraped Idealista data from discovery, property, adn search pages - all is left is to scale our scraper. If we were to increase our scraping load Idealista is very likely to block us so let's take a look at how to avoid blocking using ScrapFly web scraping API next.

Bypass Idealista Blocking with ScrapFly

As we've seen, scraping Idealista.com using Python is pretty straight-forward, though when scraping at scale our scrapers are likely to be blocked or asked to solve captchas.

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - scrape web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - scrape dynamic web pages through cloud browsers.
Full browser automation - control browsers to scroll, input and click on objects.
Format conversion - scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

For example, we can use the scrapfly-sdk python package and the Anti Scraping Protection Bypass feature. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Idealista web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests:

import httpx

response = httpx.get("some idealista.com URL")
# in ScrapFly SDK becomes
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient("YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
    "some Idealista.ocm url",
    # we can select specific proxy country like Spain:
    country="ES",
    # and enable anti scraping protection bypass:
    asp=True
))

For more on how to scrape Idealista.com using ScrapFly, see the Full Scraper Code section.

FAQ

To wrap this guide up, let's take a look at some frequently asked questions about web scraping Idealista.com data:

Is it legal to scrape Idealista.com?

Yes. Idealista.com's data is publicly available; we're not extracting anything personal or private. Scraping Idealista.com at slow, respectful rates is perfectly legal and ethical.
That being said, attention should be paid to GDRP compliance in the EU when scraping personal data like (seller's name, phone number etc). For more, see our Is Web Scraping Legal? article.

Does Idealista.com have a public API?

No, Idealista.com (and it's sister websites) do not offer a public API for property data. However, as seen in this guide, it's easy to scrape and crawl using a little bit of Python.

Latest Idealista.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Idealista Scraping Summary

In this web scraping tutorial, we wrote a short Idealista scraper for real estate property data. We started by scraping a single property page and parsing details using CSS and XPath selectors.

Then, we've taken a look at how to find properties using Idealista's directory and search system. We wrote a small web crawler that can crawl and scrape all property listings in provided provinces of Spain.

Finally, we've taken a look at how to track new listings being posted on Idealista by creating a looping scraper that constantly checks for new listings.

For all of this, we used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API that smartly configures every web scraper connection to avoid being blocked.
For more about ScrapFly, see our documentation and try it out for FREE!

Try for FREE! More on Scrapfly

Full Idealista Scraper Code

import re
import asyncio
import json
import math
from pathlib import Path
from typing import List
from urllib.parse import urljoin

from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=2)

# -------------------------------------------------
# Property
# -------------------------------------------------
def parse_property(result: ScrapeApiResponse):
    sel = result.selector
    css = lambda x: result.selector.css(x).get("").strip()
    css_all = lambda x: result.selector.css(x).getall()

    data = {}
    # Meta data
    data["url"] = result.context["url"]

    # Basic information
    data['title'] = css("h1 .main-info__title-main::text")
    data['location'] = css(".main-info__title-minor::text")
    data['currency'] = css(".info-data-price::text")
    data['price'] = int(css(".info-data-price span::text").replace(",", ""))
    data['description'] = "\n".join(css_all("div.comment ::text")).strip()
    data["updated"] = sel.xpath(
        "//p[@class='stats-text']"
        "[contains(text(),'updated on')]/text()"
    ).get("").split(" on ")[-1]

    # Features
    data["features"] = {}
    #  first we extract each feature block like "Basic Features" or "Amenities"
    for feature_block in result.selector.css(".details-property-h3"):
        # then for each block we extract all bullet points underneath them
        label = feature_block.xpath("text()").get()
        features = feature_block.xpath("following-sibling::div[1]//li")
        data["features"][label] = [
            ''.join(feat.xpath(".//text()").getall()).strip()
            for feat in features
        ]

    # Images
    # the images are tucked away in a javascript variable.
    # We can use regular expressions to find the variable and parse it as a dictionary:
    image_data = re.findall("fullScreenGalleryPics\s*:\s*(\[.+?\]),", result.content)[0]
    # we also need to replace unquoted keys to quoted keys (i.e. title -> "title"):
    images = json.loads(re.sub(r'(\w+?):([^/])', r'"\1":\2', image_data))
    data['images'] = defaultdict(list)
    data['plans'] = []
    for image in images:
        url = urljoin(result.context['url'], image['imageUrl'])
        if image['isPlan']:
            data['plans'].append(url)
        else:
            data['images'][image['tag']].append(url)
    return data



async def scrape_properties(urls: List[str]):
    to_scrape = [ScrapeConfig(url=url, country="ES", asp=True) for url in urls]
    results = []
    async for result in scrapfly.concurrent_scrape(to_scrape):
        results.append(parse_property(result))
    return results


# -------------------------------------------------
# Search
# -------------------------------------------------
def parse_province(result: ScrapeApiResponse) -> List[str]:
    """parse province page for area search urls"""
    urls = result.selector.css("#location_list li>a::attr(href)").getall()
    return [urljoin(result.context["url"], url) for url in urls]


async def scrape_provinces(urls: List[str]) -> List[str]:
    """
    Scrape province pages like:
    https://www.idealista.com/en/venta-viviendas/malaga-provincia/con-chalets/municipios
    for search page urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    """
    to_scrape = [ScrapeConfig(url, country="US", asp=True) for url in urls]
    search_urls = []
    async for result in scrapfly.concurrent_scrape(to_scrape):
        search_urls.extend(parse_province(result))
    return search_urls


def parse_search(result: ScrapeApiResponse) -> List[str]:
    """Parse search result page for 30 listings"""
    urls = result.selector.css("article.item .item-link::attr(href)").getall()
    return [urljoin(result.context["url"], url) for url in urls]


async def scrape_search(url: str, max_pages: int = 2) -> List[str]:
    """
    Scrape search urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    for proprety urls
    """
    first_page = await scrapfly.async_scrape(ScrapeConfig(url=url, country="ES", asp=True))
    property_urls = parse_search(first_page)
    if not paginate:
        return property_urls

    total_results = first_page.selector.css("h1#h1-container").re(": (.+) houses")[0]
    total_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    if total_pages > max_pages:  # note idealista allows max 60 pages per search
        print(f"search contains more than max page limit ({total_pages}/60)")
        total_pages = max_pages 
    print(f"scraping {total_pages} of search results concurrently")
    to_scrape = [
        ScrapeConfig(
            url=first_page.context["url"] + f"pagina-{page}.htm",
            asp=True,
            country="ES",
        )
        for page in range(2, total_pages + 1)
    ]
    async for result in scrapfly.concurrent_scrape(to_scrape):
        property_urls.extend(parse_search(result))
    return property_urls


# -------------------------------------------------
# Track Search
# -------------------------------------------------
async def track_search(url: str, output: Path, interval=60):
    """Track Idealista.com results page, scrape new listings and append them as JSON to the output file"""
    seen = set()
    output.touch(exist_ok=True)
    try:
        while True:
            properties = await scrape_search(url=url, paginate=False)
            # check deduplication filter
            properties = [prop for prop in properties if prop not in seen]
            if properties:
                # scrape properties and save to file - 1 property as JSON per line
                results = await scrape_properties(properties)
                with output.open("a") as f:
                    f.write("\n".join(json.dumps(property) for property in results))

                # add seen to deduplication filter
                for prop in properties:
                    seen.add(prop)
            print(f"scraped {len(results)} new properties; waiting {interval} seconds")
            await asyncio.sleep(interval)
    except KeyboardInterrupt:
        print("stopping price tracking")


async def run():
    # scrape properties:
    urls = ["https://www.idealista.com/en/inmueble/97028172/"]
    result_properties = await scrape_properties(urls)
    # find properties
    result_search = await scrape_search("https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/")
    result_province = await scrape_provinces(["https://www.idealista.com/en/venta-viviendas/balears-illes/con-chalets/municipios"])
    # track properties
    await track_search(
        "https://www.idealista.com/en/venta-viviendas/barcelona/eixample/?ordenado-por=fecha-publicacion-desc",
        Path("new-properties.jsonl"),
    )


if __name__ == "__main__":
    asyncio.run(run())

Legal Disclaimer and Precautions

This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:

Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow in web scraping and for more you should consult a lawyer.

How to Scrape Idealista.com

Key Takeaways

Latest Idealista.com Scraper Code

Why Scrape Idealista.com?

Project Setup

Scraping Idealista Property Data

Parsing HTML with CSS Selectors

How to Scrape Hidden Web Data

Finding Idealista Properties

How to Scrape Idealista Search

Bypass Idealista Blocking with ScrapFly

FAQ

Is it legal to scrape Idealista.com?

Does Idealista.com have a public API?

Idealista Scraping Summary

Full Idealista Scraper Code

Explore this Article with AI

Related Knowledgebase

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

Python httpx vs requests vs aiohttp - key differences

What Python libraries support HTTP2?

How to handle popup dialogs in Playwright?

How to use proxies with Python httpx?

How to scrape images from a website?

How to select dictionary key recursively in Python?

How to use cURL in Python?

Selenium: geckodriver executable needs to be in PATH?

How to fix Python requests SSLError?

How to fix Python requests MissingSchema error?

How to open Python http responses in a web browser?

Related Articles

How to Scrape Domain.com.au Real Estate Property Data

How to Scrape Realestate.com.au Property Listing Data

How to Scrape Immowelt.de Real Estate Data

How to Scrape Homegate.ch Real Estate Property Data

How to Scrape Immoscout24.ch Real Estate Property Data

How to Scrape RightMove Real Estate Property Data