How to Scrape Idealista.com in Python - Real Estate Property Data

How to Scrape Idealista.com in Python - Real Estate Property Data

In this web scraping tutorial, we'll be scraping idealista.com - the biggest real estate marketplace in Spain, Portugal and Italy.

In this guide, we'll be exploring real estate data scraping by taking a look at Idealista.com. We'll be scraping common property data points like property pricing, addresses, photos and agent phone numbers.

When it comes to web scraping, Idealista.com is a traditional scraping target. To scrape it, we'll cover popular web scraping techniques used in Python like HTML parsing using CSS Selectors and concurrent requests using asyncio.

Finally, we'll also cover tracking to scrape newly listed properties - giving us an upper hand in real estate discovery and bidding.

In this article, we'll focus on the Spanish version of the website (Idealista.com) though both Italian and Portuguese version function the same and our scraper code should work for these sources as well.

Latest Idealista.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Why Scrape Idealista.com?

Idealista.com is one of the biggest real estate websites in Spain (as well as Italy and Portugal) making it the biggest public real estate dataset for these areas. Containing fields like real estate prices, listing locations and sale dates and general property information.

This is valuable information for market analytics, the study of the housing industry, and a general competitor overview.

For more on scraping use cases see our extensive write-up Scraping Use Cases

Project Setup

In this tutorial, we'll be using Python with two community packages:

  • httpx - HTTP client library which will let us communicate with Idealista.com's servers
  • parsel - HTML parsing library which will help us to parse our web scraped HTML files using CSS selectors and XPath selectors.

These packages can be easily installed via the pip install command:

$ pip install httpx parsel

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is the beautifulsoup package.

Scraping Idealista Property Data

Let's start by taking a look at how to scrape Idealista for a single property. In later sections, we'll also take a look at how to find any properties and scrape them using this property scraper.

For example, let's start by taking a look at the listing page and where is all of the information stored on it. Let's pick a random property listing, like:

idealista.com/en/inmueble/94156485/

For parsing data on Idealista, we'll be using CSS selectors, so let's markup the fields we want to scrape:

screenshot and markup of idealista property page
We'll scrape fields highlighted in blue in this example

Idealista is a pure HTML website with a very convenient styling markup which we can take advantage in our scraper. For example, if we right-click on the price and inspect the HTML element we can see how clear the HTML structure is:

illustration of idealista's source page

We can see that all of the data points are under clear class names like info-data-price for price or main-info__title-main for property name.

Parsing HTML with CSS Selectors

For more on HTML parsing using CSS selectors see our interactive introduction article.

Parsing HTML with CSS Selectors

Let's scrape it:

Python
ScrapFly
import asyncio
import json
import re
from typing import Dict, List
from collections import defaultdict 
from urllib.parse import urljoin
import httpx
from parsel import Selector
from typing_extensions import TypedDict

# Establish persisten HTTPX session with browser-like headers to avoid blocking
BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}
session = httpx.AsyncClient(headers=BASE_HEADERS, follow_redirects=True)

# type hints fo expected results so we can visualize our scraper easier:
class PropertyResult(TypedDict):
    url: str
    title: str
    location: str
    price: int
    currency: str
    description: str
    updated: str
    features: Dict[str, List[str]]
    images: Dict[str, List[str]]
    plans: List[str]


def parse_property(response: httpx.Response) -> PropertyResult:
    """parse Idealista.com property page"""
    # load response's HTML tree for parsing:
    selector = Selector(text=response.text)
    css = lambda x: selector.css(x).get("").strip()
    css_all = lambda x: selector.css(x).getall()

    data = {}
    # Meta data
    data["url"] = str(response.url)

    # Basic information
    data['title'] = css("h1 .main-info__title-main::text")
    data['location'] = css(".main-info__title-minor::text")
    data['currency'] = css(".info-data-price::text")
    data['price'] = int(css(".info-data-price span::text").replace(",", ""))
    data['description'] = "\n".join(css_all("div.comment ::text")).strip()
    data["updated"] = selector.xpath(
        "//p[@class='stats-text']"
        "[contains(text(),'updated on')]/text()"
    ).get("").split(" on ")[-1]

    # Features
    data["features"] = {}
    #  first we extract each feature block like "Basic Features" or "Amenities"
    for feature_block in selector.css(".details-property-h2"):
        # then for each block we extract all bullet points underneath them
        label = feature_block.xpath("text()").get()
        features = feature_block.xpath("following-sibling::div[1]//li")
        data["features"][label] = [
            ''.join(feat.xpath(".//text()").getall()).strip()
            for feat in features
        ]

    # Images
    # the images are tucked away in a javascript variable.
    # We can use regular expressions to find the variable and parse it as a dictionary:
    image_data = re.findall(r"fullScreenGalleryPics\s*:\s*(\[.+?\]),", 
        response.scrape_result['content']
    )[0]
    # we also need to replace unquoted keys to quoted keys (i.e. title -> "title"):
    images = json.loads(re.sub(r'(\w+?):([^/])', r'"\1":\2', image_data))
    data['images'] = defaultdict(list)
    data['plans'] = []
    for image in images:
        url = urljoin(str(response.url), image['imageUrl'])
        if image['isPlan']:
            data['plans'].append(url)
        else:
            data['images'][image['tag']].append(url)
    return data


async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
    """Scrape Idealista.com properties"""
    properties = []
    to_scrape = [session.get(url) for url in urls]
    # tip: asyncio.as_completed allows concurrent scraping - super fast!
    for response in asyncio.as_completed(to_scrape):
        response = await response
        print(response.status_code)
        if response.status_code != 200:
            print(f"can't scrape property: {response.url}")
            continue
        properties.append(parse_property(response))
    return properties    
import asyncio
import json
import re
from typing import Dict, List
from typing_extensions import TypedDict
from collections import defaultdict 
from urllib.parse import urljoin
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly Key")

# type hints fo expected results so we can visualize our scraper easier:
class PropertyResult(TypedDict):
    url: str
    title: str
    location: str
    price: int
    currency: str
    description: str
    updated: str
    features: Dict[str, List[str]]
    images: Dict[str, List[str]]
    plans: List[str]

def parse_property(response: ScrapeApiResponse) -> PropertyResult:
    """parse Idealista.com property page"""
    # load response's HTML tree for parsing:
    selector = response.selector
    css = lambda x: selector.css(x).get("").strip()
    css_all = lambda x: selector.css(x).getall()

    data = {}
    # Meta data
    data["url"] = str(response.context["url"])

    # Basic information
    data['title'] = css("h1 .main-info__title-main::text")
    data['location'] = css(".main-info__title-minor::text")
    data['currency'] = css(".info-data-price::text")
    data['price'] = int(css(".info-data-price span::text").replace(",", ""))
    data['description'] = "\n".join(css_all("div.comment ::text")).strip()
    data["updated"] = selector.xpath(
        "//p[@class='stats-text']"
        "[contains(text(),'updated on')]/text()"
    ).get("").split(" on ")[-1]

    # Features
    data["features"] = {}
    #  first we extract each feature block like "Basic Features" or "Amenities"
    for feature_block in selector.css(".details-property-h2"):
        # then for each block we extract all bullet points underneath them
        label = feature_block.xpath("text()").get()
        features = feature_block.xpath("following-sibling::div[1]//li")
        data["features"][label] = [
            ''.join(feat.xpath(".//text()").getall()).strip()
            for feat in features
        ]

    # Images
    # the images are tucked away in a javascript variable.
    # We can use regular expressions to find the variable and parse it as a dictionary:
    image_data = re.findall(r"fullScreenGalleryPics\s*:\s*(\[.+?\]),", 
        response.scrape_result['content']
    )[0]
    # we also need to replace unquoted keys to quoted keys (i.e. title -> "title"):
    images = json.loads(re.sub(r'(\w+?):([^/])', r'"\1":\2', image_data))
    data['images'] = defaultdict(list)
    data['plans'] = []
    for image in images:
        url = urljoin(str(response.context["url"]), image['imageUrl'])
        if image['isPlan']:
            data['plans'].append(url)
        else:
            data['images'][image['tag']].append(url)
    return data

async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
    """Scrape Idealista.com properties"""
    properties = []
    to_scrape = [ScrapeConfig(url, asp=True, country="ES") for url in urls]
    async for response in scrapfly.concurrent_scrape(to_scrape):
        if response.upstream_status_code != 200:
            print(f"can't scrape property: {response.context['url']}")
            continue
        properties.append(parse_property(response))
    return properties    

🧙‍ If you are experiencing errors while running the Python code tabs, it may be due to getting blocked. To bypass blocking, use the ScrapFly code tabs instead.

Run Code & Example Output
async def run():
    urls = ["https://www.idealista.com/en/inmueble/97028172/"]
    data = await scrape_properties(urls)
    print(json.dumps(data, indent=2, ensure_ascii=False))

if __name__ == "__main__":
    asyncio.run(run())

Which would result in a dataset similar to this:

[
  {
    "title": "Penthouse for sale in La Dreta de l'Eixample",
    "location": "Eixample, Barcelona",
    "price": 5200000,
    "currency": "€",
    "description": "This stunning penthouse hosts 269 m2 distributed across two floors and a turret with 360º exposures. With straight access from the main lift, we walk through a hall that leads to a large central space composed by the living area and a dining room and an access to a terrace at the same level.  A full equipped and red lacquered kitchen, is directly connected to the dining room and features a large window framing Gaudi's masterpiece, Sagrada Familia. On the same floor there are three bedrooms, one en-suite and two double bedrooms with their own bathroom. All the rooms are exterior facing and are surrounded by terraces. Moreover, oversize windows allow for abundant light to stream across the interiors with high-ceilings. \nOn the upper floor we find a room with access to 200 m2 of terraces hosting chill-out areas, a swimming pool and a jacuzzi.  In addition, an interior spiral staircase on the same floor, leads to a turret on a third level spanning 360º views over Barcelona. \nThe penthouse is well preserved with high quality finishes, air conditioning and heating, but it also offers the opportunity have the interiors renovated to contemporary standards, to convert it into one-of-a-kind piece in Barcelona city. \nContact us for more information or to arrange a viewing.",
    "features": {
      "Basic features": [
        "367 m² built",
        "5 bedrooms",
        "4 bathrooms",
        "Terrace",
        "Second hand/good condition",
        "Fitted wardrobes",
        "Built in 1954"
      ],
      "Building": [
        "exterior",
        "With lift"
      ],
      "Amenities": [
        "Air conditioning",
        "Swimming pool"
      ],
      "Energy performance certificate": [
        "Not indicated"
      ]
    },
    "updated": "2 November",
    "url": "https://www.idealista.com/en/inmueble/97028172/",
    "images": {
      "Communal areas": [
        "https://www.idealista.com/inmueble/97028172/foto/1/",
        "https://www.idealista.com/inmueble/97028172/foto/3/",
        "https://www.idealista.com/inmueble/97028172/foto/5/",
        "https://www.idealista.com/inmueble/97028172/foto/6/",
        "https://www.idealista.com/inmueble/97028172/foto/9/",
        "https://www.idealista.com/inmueble/97028172/foto/10/",
        "https://www.idealista.com/inmueble/97028172/foto/11/"
      ],
      "Swimming pool": [
        "https://www.idealista.com/inmueble/97028172/foto/2/",
        "https://www.idealista.com/inmueble/97028172/foto/4/",
        "https://www.idealista.com/inmueble/97028172/foto/7/",
        "https://www.idealista.com/inmueble/97028172/foto/8/"
      ],
      "Views": [
        "https://www.idealista.com/inmueble/97028172/foto/12/",
        "https://www.idealista.com/inmueble/97028172/foto/28/",
        "https://www.idealista.com/inmueble/97028172/foto/48/"
      ],
      "Living room": [
        "https://www.idealista.com/inmueble/97028172/foto/13/",
        "https://www.idealista.com/inmueble/97028172/foto/14/",
        "https://www.idealista.com/inmueble/97028172/foto/16/",
        "https://www.idealista.com/inmueble/97028172/foto/17/",
        "https://www.idealista.com/inmueble/97028172/foto/18/",
        "https://www.idealista.com/inmueble/97028172/foto/19/"
      ],
      "Dining room": [
        "https://www.idealista.com/inmueble/97028172/foto/15/",
        "https://www.idealista.com/inmueble/97028172/foto/25/"
      ],
      "Terrace": [
        "https://www.idealista.com/inmueble/97028172/foto/20/",
        "https://www.idealista.com/inmueble/97028172/foto/21/",
        "https://www.idealista.com/inmueble/97028172/foto/22/",
        "https://www.idealista.com/inmueble/97028172/foto/24/",
        "https://www.idealista.com/inmueble/97028172/foto/36/",
        "https://www.idealista.com/inmueble/97028172/foto/40/",
        "https://www.idealista.com/inmueble/97028172/foto/41/",
        "https://www.idealista.com/inmueble/97028172/foto/42/"
      ],
      "Bedroom": [
        "https://www.idealista.com/inmueble/97028172/foto/23/",
        "https://www.idealista.com/inmueble/97028172/foto/31/",
        "https://www.idealista.com/inmueble/97028172/foto/34/",
        "https://www.idealista.com/inmueble/97028172/foto/35/",
        "https://www.idealista.com/inmueble/97028172/foto/38/",
        "https://www.idealista.com/inmueble/97028172/foto/39/",
        "https://www.idealista.com/inmueble/97028172/foto/43/"
      ],
      "Kitchen": [
        "https://www.idealista.com/inmueble/97028172/foto/26/",
        "https://www.idealista.com/inmueble/97028172/foto/27/",
        "https://www.idealista.com/inmueble/97028172/foto/29/",
        "https://www.idealista.com/inmueble/97028172/foto/30/"
      ],
      "Bathroom": [
        "https://www.idealista.com/inmueble/97028172/foto/32/",
        "https://www.idealista.com/inmueble/97028172/foto/37/",
        "https://www.idealista.com/inmueble/97028172/foto/44/"
      ],
      "Office": [
        "https://www.idealista.com/inmueble/97028172/foto/33/",
        "https://www.idealista.com/inmueble/97028172/foto/46/"
      ],
      "Staircase": [
        "https://www.idealista.com/inmueble/97028172/foto/45/",
        "https://www.idealista.com/inmueble/97028172/foto/47/"
      ],
      "Reception": [
        "https://www.idealista.com/inmueble/97028172/foto/49/"
      ]
    },
    "plans": [
      "https://www.idealista.com/inmueble/97028172/foto/50/",
      "https://www.idealista.com/inmueble/97028172/foto/51/"
    ]
  }
]

In this demonstration, we used a few CSS and XPath selectors using parsel to extract property details like price, description, features etc.

However, the images are where things are getting a bit complex. For image carousels, many websites use javascript to generate dynamic HTML on demand. Idealista is no exception and it hides all of the image URLs in a javascript variable, then displays it using javascript.
To scrape this, we used a regular expression pattern to find the hidden javascript variable then load it as a Python dictionary object and parse the image and floor plans.

How to Scrape Hidden Web Data

For more on hidden web data scraping see our introduction article which covers several popular examples and how to scrape them in Python.

How to Scrape Hidden Web Data

For scraping itself we used asynchronous capabilities of httpx and asyncio.as_completed to schedule multiple properties concurrently making our scraper super fast!

Next, let's take a look at how we can scale up this scraper by implementing exploration functionality.

Finding Idealista Properties

There are several ways to find properties listed in Idealista. The most popular and reliable is to explore by area. In this section, we'll take a look at how to scrape property listings with a little bit of crawling - we'll explore the location directory.

To find the location directory we can scroll to the bottom of the page:

screenshot of idealista location directory page
Location directory found at the bottom of the page.

Each link leads to a province listing URL which further leads to area listings URLs. We can easily scrape this with the same CSS selector technique we've used previously:

Python
ScrapFly
def parse_province(response: httpx.Response) -> List[str]:
    """parse province page for area search urls"""
    selector = Selector(text=response.text)
    urls = selector.css("#location_list li>a::attr(href)").getall()
    return [urljoin(str(response.url), url) for url in urls]


async def scrape_provinces(urls: List[str]) -> List[str]:
    """
    Scrape province pages like:
    https://www.idealista.com/en/venta-viviendas/malaga-provincia/con-chalets/municipios
    for search page urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    """
    to_scrape = [session.get(url) for url in urls]
    search_urls = []
    async for response in asyncio.as_completed(to_scrape):
        search_urls.extend(parse_province(await response))
    return search_urls    
def parse_province(response: ScrapeApiResponse) -> List[str]:
    """parse province page for area search urls"""
    selector = response.selector
    urls = selector.css("#location_list li>a::attr(href)").getall()
    return [urljoin(str(response.context["url"]), url) for url in urls]


async def scrape_provinces(urls: List[str]) -> List[str]:
    """
    Scrape province pages like:
    https://www.idealista.com/en/venta-viviendas/malaga-provincia/con-chalets/municipios
    for search page urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    """
    to_scrape = [ScrapeConfig(url, asp=True, country="ES") for url in urls]
    search_urls = []
    async for response in scrapfly.concurrent_scrape(to_scrape):
        search_urls.extend(parse_province(response))
    return search_urls    
Run Code & Example Output
async def run():
    data = await scrape_provinces([
        "https://www.idealista.com/en/venta-viviendas/balears-illes/con-chalets/municipios"
    ])
    print(json.dumps(data, indent=2))
if __name__ == "__main__":
    asyncio.run(run())    

Will result in a dataset similar to this:

[
  "https://www.idealista.com/en/venta-viviendas/alaior-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/alaro-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/alcudia-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/algaida-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/andratx-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/ariany-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/arta-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/santa-maria-del-cami-balears-illes/con-chalets/",
  ...
]

This scraper will scrape all area pages for given provinces. To discover all property listings all we'd have to do is scrape all provinces. Next, let's scrape the search results page itself:

Python
ScrapFly
def parse_search(response: httpx.Response) -> List[str]:
    """Parse search result page for 30 listings"""
    selector = Selector(text=response.text)
    urls = selector.css("article.item .item-link::attr(href)").getall()
    return [urljoin(str(response.url), url) for url in urls]


async def scrape_search(url: str, paginate=True, max_pages: int = None) -> List[str]:
    """
    Scrape search urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    for proprety urls
    """
    first_page = await session.get(url)
    property_urls = parse_search(first_page)
    if not paginate:
        return property_urls
    total_results = first_page.selector.css("h1#h1-container").re(": (.+) houses")[0]
    total_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    if total_pages > 60:
        print(f"search contains more than max page limit ({total_pages}/60)")
        total_pages = 60
    # scrape all available pages in the search if max_scrape_pages is None or max_scrape_pages > total_pages
    if max_pages and max_pages < total_pages:
        total_pages = max_pages
    else:
        total_pages = total_pages
    print(f"scraping {total_pages} of search results concurrently")
    to_scrape = [
        session.get(first_page.url + f"pagina-{page}.htm")
        for page in range(2, total_pages + 1)
    ]
    async for response in asyncio.as_completed(to_scrape):
        property_urls.extend(parse_search(await response))
    return property_urls    
def parse_search(response: ScrapeApiResponse) -> List[str]:
    """Parse search result page for 30 listings"""
    selector = response.selector
    urls = selector.css("article.item .item-link::attr(href)").getall()
    return [urljoin(str(response.context["url"]), url) for url in urls]


async def scrape_search(url: str, paginate=True, max_pages: int = None) -> List[str]:
    """
    Scrape search urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    for proprety urls
    """
    first_page = await scrapfly.async_scrape(ScrapeConfig(url, asp=True, country="ES"))
    property_urls = parse_search(first_page)
    if not paginate:
        return property_urls
    total_results = first_page.selector.css("h1#h1-container").re(": (.+) houses")[0]
    total_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    if total_pages > 60:
        print(f"search contains more than max page limit ({total_pages}/60)")
        total_pages = 60
    # scrape all available pages in the search if max_scrape_pages is None or max_scrape_pages > total_pages
    if max_pages and max_pages < total_pages:
        total_pages = max_pages
    else:
        total_pages = total_pages        
    print(f"scraping {total_pages} of search results concurrently")
    to_scrape = [
        ScrapeConfig(first_page.context["url"] + f"pagina-{page}.htm", asp=True, country="ES")
        for page in range(2, total_pages + 1)
    ]
    async for response in scrapfly.concurrent_scrape(to_scrape):
        property_urls.extend(parse_search(response))
    return property_urls
Run Code & Example Output
async def run():
    data = await scrape_search(
        url="https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/",
        max_pages=1
    )
    print(json.dumps(data, indent=2))
if __name__ == "__main__":
    asyncio.run(run())  

Will result in a dataset similar to this:

[
  "https://www.idealista.com/en/inmueble/98935300/",
  "https://www.idealista.com/en/inmueble/102479109/",
  "https://www.idealista.com/en/inmueble/102051911/",
  "https://www.idealista.com/en/inmueble/99394819/",
  "https://www.idealista.com/en/inmueble/102695949/",
  "https://www.idealista.com/en/inmueble/102645488/",
  "https://www.idealista.com/en/inmueble/102953607/",
  "https://www.idealista.com/en/inmueble/86941032/",
  "https://www.idealista.com/en/inmueble/102953907/",
  "https://www.idealista.com/en/inmueble/103130779/",
  "https://www.idealista.com/en/inmueble/100285134/",
   .....
]

For scraping paginated content like the area results pages we first scrape the first page to extract the total result count. Then, we can scrape the remaining pages concurrently retrieving all listings in just a few seconds!


With this discovery scraper and our previous property scraper we can collect all of the existing real estate data on Idealista.com - though what if we want to be the first to know about new property listings? Next, let's take a look how can we scrape Idealista's search results.

In this section, we'll scrape Idealista's search pages. These search pages enabling finding specific property listings and sorting them. For example, let's find properties in Malaga, Spain:

screenshot of idealista search pages

To build an Idealista scraper for search pages. We'll request search pages and parse their results while incrementing the pagina parameter for pagination:

Python
ScrapFly
import json
import math
import httpx
import asyncio

from typing import Dict, List

# Establish persisten HTTPX session with browser-like headers to avoid blocking
BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}

session = httpx.AsyncClient(headers=BASE_HEADERS, follow_redirects=True)


def parse_search_data(response) -> List[Dict]:
    """parse search result data"""
    selector = Selector(response.text)
    total_results = selector.css("h1#h1-container").re(": (.+) houses")[0]
    max_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    max_pages = 60  if max_pages > 60 else max_pages
    search_data = []
    for box in selector.xpath("//section[contains(@class, 'items-list')]/article[contains(@class, 'item')]"):
        ad = box.xpath(".//p[@class='adv_txt']") # ignore ad listings
        if ad:
            continue
        price = box.xpath(".//span[contains(@class, 'item-price')]/text()").get()
        parking = box.xpath(".//span[@class='item-parking']").get()
        company_url = box.xpath(".//picture[@class='logo-branding']/a/@href").get()
        search_data.append({
            "title": box.xpath(".//div/a/@title").get(),
            "link": "https://www.idealista.com" + box.xpath(".//div/a/@href").get(),
            "picture": box.xpath(".//img/@src").get(),
            "price": int(price.replace(",", '')) if price else None,
            "currency": box.xpath(".//span[contains(@class, 'item-price')]/span/text()").get(),
            "parking_included": True if parking else False,
            "details": box.xpath(".//div[@class='item-detail-char']/span/text()").getall(),
            "description": box.xpath(".//div[contains(@class, 'item-description')]/p/text()").get().replace('\n', ''),
            "tags": box.xpath(".//div[@class='listing-tags-container']/span/text()").getall(),
            "listing_company": box.xpath(".//picture[@class='logo-branding']/a/@title").get(),
            "listing_company_url": "https://www.idealista.com" + company_url if company_url else None
        })
    return {"max_pages": max_pages, "search_data": search_data}


async def scrape_search(url: str, max_scrape_pages: int = None) -> List[Dict]:
    """scrape Idealista search results"""
    first_page = await session.get(url)
    assert first_page == 200, "request is blocked, use ScrapFly code tab"
    data = parse_search_data(first_page)
    search_data = data["search_data"]
    max_pages = data["max_pages"]

    # get the number of total pages to scrape
    if max_scrape_pages and max_scrape_pages < max_pages:
        max_pages = max_scrape_pages

    # scrape the remaining pages concurrently
    to_scrape = [
        session(url + f"pagina-{page}.htm")
        for page in range(2, max_pages + 1)
    ]
    print(f"scraping search pagination, {max_pages - 1} pages remaining")
    for response in asyncio.as_completed(to_scrape):
        search_data.extend(parse_search_data(await response)["search_data"])
    print(f"scraped {len(search_data)} property listings from search pages")
    return search_data
import json
import math
import asyncio

from typing import Dict, List
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")


def parse_search_data(response: ScrapeApiResponse) -> List[Dict]:
    """parse search result data"""
    selector = response.selector
    total_results = selector.css("h1#h1-container").re(": (.+) houses")[0]
    max_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    max_pages = 60  if max_pages > 60 else max_pages
    search_data = []
    for box in selector.xpath("//section[contains(@class, 'items-list')]/article[contains(@class, 'item')]"):
        ad = box.xpath(".//p[@class='adv_txt']") # ignore ad listings
        if ad:
            continue
        price = box.xpath(".//span[contains(@class, 'item-price')]/text()").get()
        parking = box.xpath(".//span[@class='item-parking']").get()
        company_url = box.xpath(".//picture[@class='logo-branding']/a/@href").get()
        search_data.append({
            "title": box.xpath(".//div/a/@title").get(),
            "link": "https://www.idealista.com" + box.xpath(".//div/a/@href").get(),
            "picture": box.xpath(".//img/@src").get(),
            "price": int(price.replace(",", '')) if price else None,
            "currency": box.xpath(".//span[contains(@class, 'item-price')]/span/text()").get(),
            "parking_included": True if parking else False,
            "details": box.xpath(".//div[@class='item-detail-char']/span/text()").getall(),
            "description": box.xpath(".//div[contains(@class, 'item-description')]/p/text()").get().replace('\n', ''),
            "tags": box.xpath(".//div[@class='listing-tags-container']/span/text()").getall(),
            "listing_company": box.xpath(".//picture[@class='logo-branding']/a/@title").get(),
            "listing_company_url": "https://www.idealista.com" + company_url if company_url else None
        })
    return {"max_pages": max_pages, "search_data": search_data}


async def scrape_search(url: str, max_scrape_pages: int = None) -> List[Dict]:
    """scrape Idealista search results"""
    first_page = await SCRAPFLY.async_scrape(ScrapeConfig(url, asp=True, country="ES"))
    data = parse_search_data(first_page)
    search_data = data["search_data"]
    max_pages = data["max_pages"]

    # get the number of total pages to scrape
    if max_scrape_pages and max_scrape_pages < max_pages:
        max_pages = max_scrape_pages

    # scrape the remaining pages concurrently
    to_scrape = [
        ScrapeConfig(url + f"pagina-{page}.htm", asp=True, country="ES")
        for page in range(2, max_pages + 1)
    ]
    log.info(f"scraping search pagination, {max_pages - 1} pages remaining")

    async for response in SCRAPFLY.concurrent_scrape(to_scrape):
        # skip invalid property pages
        search_data.extend(parse_search_data(response)["search_data"])
    log.success(f"scraped {len(search_data)} property listings from search pages")
    return search_data
Run the code
if __name__ == "__main__":
    search_data = asyncio.run(scrape_search(
        url="https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/",
        # remove the max_scrape_pages paremeter to scrape all pages
        max_scrape_pages=3
    ))
    
    with open("search_data.json", "w", encoding="utf-8") as file:
        json.dump(search_data, file, indent=2, ensure_ascii=False)

Above, we define a parse_search_data utility. It parses the HTML page using XPath selectors to extract the search results. We also use the scrape_search function to paginate search pages by requesting the first page, adding the remaining pages to a scraping list, and then scraping them concurrently.

Here's an example reuslts to the above Idealista scraper:

Example output
[
  {
    "title": "Detached house in calle Verdi, 128 -27, Sierra Blanca, Marbella",
    "link": "https://www.idealista.com/en/inmueble/105709329/",
    "picture": "https://img4.idealista.com/blur/WEB_LISTING-M/0/id.pro.es.image.master/7e/17/de/1260664883.jpg",
    "price": 12450000,
    "currency": "€",
    "parking_included": true,
    "details": [
      "6 bed.",
      "774 m²"
    ],
    "description": "Nestled within Marbella's prestigious Sierra Blanca community, Villa Verdi epitomises luxury and refinement, a testament to the artistry of AMES arquitectos. Set amidst lush greenery on a generous plot, this unique villa seamlessly merges Andalusian heritage with contemporary design, offering an unparalleled living exp",
    "tags": [
      "Sea views",
      "Luxury",
      "Villa"
    ],
    "listing_company": "Solvilla",
    "listing_company_url": "https://www.idealista.com/en/pro/solvilla/"
  },
  ....
]

We scraped Idealista data from discovery, property, adn search pages - all is left is to scale our scraper. If we were to increase our scraping load Idealista is very likely to block us so let's take a look at how to avoid blocking using ScrapFly web scraping API next.

Bypass Idealista Blocking with ScrapFly

As we've seen, scraping Idealista.com using Python is pretty straight-forward, though when scraping at scale our scrapers are likely to be blocked or asked to solve captchas.

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

For example, we can use the scrapfly-sdk python package and the Anti Scraping Protection Bypass feature. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Idealista web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests:

import httpx

response = httpx.get("some idealista.com URL")
# in ScrapFly SDK becomes
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient("YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
    "some Idealista.ocm url",
    # we can select specific proxy country like Spain:
    country="ES",
    # and enable anti scraping protection bypass:
    asp=True
))

For more on how to scrape Idealista.com using ScrapFly, see the Full Scraper Code section.

FAQ

To wrap this guide up, let's take a look at some frequently asked questions about web scraping Idealista.com data:

Yes. Idealista.com's data is publicly available; we're not extracting anything personal or private. Scraping Idealista.com at slow, respectful rates is perfectly legal and ethical.
That being said, attention should be paid to GDRP compliance in the EU when scraping personal data like (seller's name, phone number etc). For more, see our Is Web Scraping Legal? article.

Does Idealista.com have a public API?

No, Idealista.com (and it's sister websites) do not offer a public API for property data. However, as seen in this guide, it's easy to scrape and crawl using a little bit of Python.

Latest Idealista.com Scraper Code
https://github.com/scrapfly/scrapfly-scrapers/

Idealista Scraping Summary

In this web scraping tutorial, we wrote a short Idealista scraper for real estate property data. We started by scraping a single property page and parsing details using CSS and XPath selectors.

Then, we've taken a look at how to find properties using Idealista's directory and search system. We wrote a small web crawler that can crawl and scrape all property listings in provided provinces of Spain.

Finally, we've taken a look at how to track new listings being posted on Idealista by creating a looping scraper that constantly checks for new listings.

For all of this, we used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API that smartly configures every web scraper connection to avoid being blocked.
For more about ScrapFly, see our documentation and try it out for FREE!

Full Idealista Scraper Code

import re
import asyncio
import json
import math
from pathlib import Path
from typing import List
from urllib.parse import urljoin

from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=2)

# -------------------------------------------------
# Property
# -------------------------------------------------
def parse_property(result: ScrapeApiResponse):
    sel = result.selector
    css = lambda x: result.selector.css(x).get("").strip()
    css_all = lambda x: result.selector.css(x).getall()

    data = {}
    # Meta data
    data["url"] = result.context["url"]

    # Basic information
    data['title'] = css("h1 .main-info__title-main::text")
    data['location'] = css(".main-info__title-minor::text")
    data['currency'] = css(".info-data-price::text")
    data['price'] = int(css(".info-data-price span::text").replace(",", ""))
    data['description'] = "\n".join(css_all("div.comment ::text")).strip()
    data["updated"] = sel.xpath(
        "//p[@class='stats-text']"
        "[contains(text(),'updated on')]/text()"
    ).get("").split(" on ")[-1]

    # Features
    data["features"] = {}
    #  first we extract each feature block like "Basic Features" or "Amenities"
    for feature_block in result.selector.css(".details-property-h3"):
        # then for each block we extract all bullet points underneath them
        label = feature_block.xpath("text()").get()
        features = feature_block.xpath("following-sibling::div[1]//li")
        data["features"][label] = [
            ''.join(feat.xpath(".//text()").getall()).strip()
            for feat in features
        ]

    # Images
    # the images are tucked away in a javascript variable.
    # We can use regular expressions to find the variable and parse it as a dictionary:
    image_data = re.findall("fullScreenGalleryPics\s*:\s*(\[.+?\]),", result.content)[0]
    # we also need to replace unquoted keys to quoted keys (i.e. title -> "title"):
    images = json.loads(re.sub(r'(\w+?):([^/])', r'"\1":\2', image_data))
    data['images'] = defaultdict(list)
    data['plans'] = []
    for image in images:
        url = urljoin(result.context['url'], image['imageUrl'])
        if image['isPlan']:
            data['plans'].append(url)
        else:
            data['images'][image['tag']].append(url)
    return data



async def scrape_properties(urls: List[str]):
    to_scrape = [ScrapeConfig(url=url, country="ES", asp=True) for url in urls]
    results = []
    async for result in scrapfly.concurrent_scrape(to_scrape):
        results.append(parse_property(result))
    return results


# -------------------------------------------------
# Search
# -------------------------------------------------
def parse_province(result: ScrapeApiResponse) -> List[str]:
    """parse province page for area search urls"""
    urls = result.selector.css("#location_list li>a::attr(href)").getall()
    return [urljoin(result.context["url"], url) for url in urls]


async def scrape_provinces(urls: List[str]) -> List[str]:
    """
    Scrape province pages like:
    https://www.idealista.com/en/venta-viviendas/malaga-provincia/con-chalets/municipios
    for search page urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    """
    to_scrape = [ScrapeConfig(url, country="US", asp=True) for url in urls]
    search_urls = []
    async for result in scrapfly.concurrent_scrape(to_scrape):
        search_urls.extend(parse_province(result))
    return search_urls


def parse_search(result: ScrapeApiResponse) -> List[str]:
    """Parse search result page for 30 listings"""
    urls = result.selector.css("article.item .item-link::attr(href)").getall()
    return [urljoin(result.context["url"], url) for url in urls]


async def scrape_search(url: str, max_pages: int = 2) -> List[str]:
    """
    Scrape search urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    for proprety urls
    """
    first_page = await scrapfly.async_scrape(ScrapeConfig(url=url, country="ES", asp=True))
    property_urls = parse_search(first_page)
    if not paginate:
        return property_urls

    total_results = first_page.selector.css("h1#h1-container").re(": (.+) houses")[0]
    total_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    if total_pages > max_pages:  # note idealista allows max 60 pages per search
        print(f"search contains more than max page limit ({total_pages}/60)")
        total_pages = max_pages 
    print(f"scraping {total_pages} of search results concurrently")
    to_scrape = [
        ScrapeConfig(
            url=first_page.context["url"] + f"pagina-{page}.htm",
            asp=True,
            country="ES",
        )
        for page in range(2, total_pages + 1)
    ]
    async for result in scrapfly.concurrent_scrape(to_scrape):
        property_urls.extend(parse_search(result))
    return property_urls


# -------------------------------------------------
# Track Search
# -------------------------------------------------
async def track_search(url: str, output: Path, interval=60):
    """Track Idealista.com results page, scrape new listings and append them as JSON to the output file"""
    seen = set()
    output.touch(exist_ok=True)
    try:
        while True:
            properties = await scrape_search(url=url, paginate=False)
            # check deduplication filter
            properties = [prop for prop in properties if prop not in seen]
            if properties:
                # scrape properties and save to file - 1 property as JSON per line
                results = await scrape_properties(properties)
                with output.open("a") as f:
                    f.write("\n".join(json.dumps(property) for property in results))

                # add seen to deduplication filter
                for prop in properties:
                    seen.add(prop)
            print(f"scraped {len(results)} new properties; waiting {interval} seconds")
            await asyncio.sleep(interval)
    except KeyboardInterrupt:
        print("stopping price tracking")


async def run():
    # scrape properties:
    urls = ["https://www.idealista.com/en/inmueble/97028172/"]
    result_properties = await scrape_properties(urls)
    # find properties
    result_search = await scrape_search("https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/")
    result_province = await scrape_provinces(["https://www.idealista.com/en/venta-viviendas/balears-illes/con-chalets/municipios"])
    # track properties
    await track_search(
        "https://www.idealista.com/en/venta-viviendas/barcelona/eixample/?ordenado-por=fecha-publicacion-desc",
        Path("new-properties.jsonl"),
    )


if __name__ == "__main__":
    asyncio.run(run())

Related Posts

How to Scrape Reddit Posts, Subreddits and Profiles

In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.

How to Scrape LinkedIn in 2024

In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.

How to Scrape SimilarWeb Website Traffic Analytics

In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.