How to Scrape Idealista.com in Python - Real Estate Property Data

Apr 11, 2024

scrapeguide Python Real Estate

In this web scraping tutorial, we'll be scraping idealista.com - the biggest real estate marketplace in Spain, Portugal and Italy.

In this guide, we'll be exploring real estate data scraping by taking a look at Idealista.com. We'll be scraping common property data points like property pricing, addresses, photos and agent phone numbers.

When it comes to web scraping, Idealista.com is a traditional scraping target. To scrape it, we'll cover popular web scraping techniques used in Python like HTML parsing using CSS Selectors and concurrent requests using asyncio.

Finally, we'll also cover tracking to scrape newly listed properties - giving us an upper hand in real estate discovery and bidding.

In this article, we'll focus on the Spanish version of the website (Idealista.com) though both Italian and Portuguese version function the same and our scraper code should work for these sources as well.

Latest Idealista.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Legal Disclaimer and Precautions

This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:

Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.

Why Scrape Idealista.com?

Idealista.com is one of the biggest real estate websites in Spain (as well as Italy and Portugal) making it the biggest public real estate dataset for these areas. Containing fields like real estate prices, listing locations and sale dates and general property information.

This is valuable information for market analytics, the study of the housing industry, and a general competitor overview.

For more on scraping use cases see our extensive write-up Scraping Use Cases

Project Setup

In this tutorial, we'll be using Python with two community packages:

httpx - HTTP client library which will let us communicate with Idealista.com's servers
parsel - HTML parsing library which will help us to parse our web scraped HTML files using CSS selectors and XPath selectors.

These packages can be easily installed via the pip install command:

$ pip install httpx parsel

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is the beautifulsoup package.

Scraping Idealista Property Data

Let's start by taking a look at how to scrape Idealista for a single property. In later sections, we'll also take a look at how to find any properties and scrape them using this property scraper.

For example, let's start by taking a look at the listing page and where is all of the information stored on it. Let's pick a random property listing, like:

idealista.com/en/inmueble/94156485/

For parsing data on Idealista, we'll be using CSS selectors, so let's markup the fields we want to scrape:

screenshot and markup of idealista property page — We'll scrape fields highlighted in blue in this example

Idealista is a pure HTML website with a very convenient styling markup which we can take advantage in our scraper. For example, if we right-click on the price and inspect the HTML element we can see how clear the HTML structure is:

We can see that all of the data points are under clear class names like info-data-price for price or main-info__title-main for property name.

Parsing HTML with CSS Selectors

For more on HTML parsing using CSS selectors see our interactive introduction article.

Let's scrape it:

Python

ScrapFly

import asyncio
import json
import re
from typing import Dict, List
from collections import defaultdict 
from urllib.parse import urljoin
import httpx
from parsel import Selector
from typing_extensions import TypedDict

# Establish persisten HTTPX session with browser-like headers to avoid blocking
BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}
session = httpx.AsyncClient(headers=BASE_HEADERS, follow_redirects=True)

# type hints fo expected results so we can visualize our scraper easier:
class PropertyResult(TypedDict):
    url: str
    title: str
    location: str
    price: int
    currency: str
    description: str
    updated: str
    features: Dict[str, List[str]]
    images: Dict[str, List[str]]
    plans: List[str]


def parse_property(response: httpx.Response) -> PropertyResult:
    """parse Idealista.com property page"""
    # load response's HTML tree for parsing:
    selector = Selector(text=response.text)
    css = lambda x: selector.css(x).get("").strip()
    css_all = lambda x: selector.css(x).getall()

    data = {}
    # Meta data
    data["url"] = str(response.url)

    # Basic information
    data['title'] = css("h1 .main-info__title-main::text")
    data['location'] = css(".main-info__title-minor::text")
    data['currency'] = css(".info-data-price::text")
    data['price'] = int(css(".info-data-price span::text").replace(",", ""))
    data['description'] = "\n".join(css_all("div.comment ::text")).strip()
    data["updated"] = selector.xpath(
        "//p[@class='stats-text']"
        "[contains(text(),'updated on')]/text()"
    ).get("").split(" on ")[-1]

    # Features
    data["features"] = {}
    #  first we extract each feature block like "Basic Features" or "Amenities"
    for feature_block in selector.css(".details-property-h2"):
        # then for each block we extract all bullet points underneath them
        label = feature_block.xpath("text()").get()
        features = feature_block.xpath("following-sibling::div[1]//li")
        data["features"][label] = [
            ''.join(feat.xpath(".//text()").getall()).strip()
            for feat in features
        ]

    # Images
    # the images are tucked away in a javascript variable.
    # We can use regular expressions to find the variable and parse it as a dictionary:
    image_data = re.findall(
        "fullScreenGalleryPics\s*:\s*(\[.+?\]),", 
        response.text
    )[0]
    # we also need to replace unquoted keys to quoted keys (i.e. title -> "title"):
    images = json.loads(re.sub(r'(\w+?):([^/])', r'"\1":\2', image_data))
    data['images'] = defaultdict(list)
    data['plans'] = []
    for image in images:
        url = urljoin(str(response.url), image['imageUrl'])
        if image['isPlan']:
            data['plans'].append(url)
        else:
            data['images'][image['tag']].append(url)
    return data


async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
    """Scrape Idealista.com properties"""
    properties = []
    to_scrape = [session.get(url) for url in urls]
    # tip: asyncio.as_completed allows concurrent scraping - super fast!
    for response in asyncio.as_completed(to_scrape):
        response = await response
        print(response.status_code)
        if response.status_code != 200:
            print(f"can't scrape property: {response.url}")
            continue
        properties.append(parse_property(response))
    return properties

import asyncio
import json
import re
from typing import Dict, List
from typing_extensions import TypedDict
from collections import defaultdict 
from urllib.parse import urljoin
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly Key")

# type hints fo expected results so we can visualize our scraper easier:
class PropertyResult(TypedDict):
    url: str
    title: str
    location: str
    price: int
    currency: str
    description: str
    updated: str
    features: Dict[str, List[str]]
    images: Dict[str, List[str]]
    plans: List[str]

def parse_property(response: ScrapeApiResponse) -> PropertyResult:
    """parse Idealista.com property page"""
    # load response's HTML tree for parsing:
    selector = response.selector
    css = lambda x: selector.css(x).get("").strip()
    css_all = lambda x: selector.css(x).getall()

    data = {}
    # Meta data
    data["url"] = str(response.context["url"])

    # Basic information
    data['title'] = css("h1 .main-info__title-main::text")
    data['location'] = css(".main-info__title-minor::text")
    data['currency'] = css(".info-data-price::text")
    data['price'] = int(css(".info-data-price span::text").replace(",", ""))
    data['description'] = "\n".join(css_all("div.comment ::text")).strip()
    data["updated"] = selector.xpath(
        "//p[@class='stats-text']"
        "[contains(text(),'updated on')]/text()"
    ).get("").split(" on ")[-1]

    # Features
    data["features"] = {}
    #  first we extract each feature block like "Basic Features" or "Amenities"
    for feature_block in selector.css(".details-property-h2"):
        # then for each block we extract all bullet points underneath them
        label = feature_block.xpath("text()").get()
        features = feature_block.xpath("following-sibling::div[1]//li")
        data["features"][label] = [
            ''.join(feat.xpath(".//text()").getall()).strip()
            for feat in features
        ]

    # Images
    # the images are tucked away in a javascript variable.
    # We can use regular expressions to find the variable and parse it as a dictionary:
    image_data = re.findall(
        "fullScreenGalleryPics\s*:\s*(\[.+?\]),", 
        response.scrape_result['content']
    )[0]
    # we also need to replace unquoted keys to quoted keys (i.e. title -> "title"):
    images = json.loads(re.sub(r'(\w+?):([^/])', r'"\1":\2', image_data))
    data['images'] = defaultdict(list)
    data['plans'] = []
    for image in images:
        url = urljoin(str(response.context["url"]), image['imageUrl'])
        if image['isPlan']:
            data['plans'].append(url)
        else:
            data['images'][image['tag']].append(url)
    return data

async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
    """Scrape Idealista.com properties"""
    properties = []
    to_scrape = [ScrapeConfig(url, asp=True, country="ES") for url in urls]
    async for response in scrapfly.concurrent_scrape(to_scrape):
        if response.upstream_status_code != 200:
            print(f"can't scrape property: {response.context['url']}")
            continue
        properties.append(parse_property(response))
    return properties

🧙‍ If you are experiencing errors while running the Python code tabs, it may be due to getting blocked. To bypass blocking, use the ScrapFly code tabs instead.

Run Code & Example Output

async def run():
    urls = ["https://www.idealista.com/en/inmueble/97028172/"]
    data = await scrape_properties(urls)
    print(json.dumps(data, indent=2, ensure_ascii=False))

if __name__ == "__main__":
    asyncio.run(run())

Which would result in a dataset similar to this:

[
  {
    "title": "Penthouse for sale in La Dreta de l'Eixample",
    "location": "Eixample, Barcelona",
    "price": 5200000,
    "currency": "€",
    "description": "This stunning penthouse hosts 269 m2 distributed across two floors and a turret with 360º exposures. With straight access from the main lift, we walk through a hall that leads to a large central space composed by the living area and a dining room and an access to a terrace at the same level.  A full equipped and red lacquered kitchen, is directly connected to the dining room and features a large window framing Gaudi's masterpiece, Sagrada Familia. On the same floor there are three bedrooms, one en-suite and two double bedrooms with their own bathroom. All the rooms are exterior facing and are surrounded by terraces. Moreover, oversize windows allow for abundant light to stream across the interiors with high-ceilings. \nOn the upper floor we find a room with access to 200 m2 of terraces hosting chill-out areas, a swimming pool and a jacuzzi.  In addition, an interior spiral staircase on the same floor, leads to a turret on a third level spanning 360º views over Barcelona. \nThe penthouse is well preserved with high quality finishes, air conditioning and heating, but it also offers the opportunity have the interiors renovated to contemporary standards, to convert it into one-of-a-kind piece in Barcelona city. \nContact us for more information or to arrange a viewing.",
    "features": {
      "Basic features": [
        "367 m² built",
        "5 bedrooms",
        "4 bathrooms",
        "Terrace",
        "Second hand/good condition",
        "Fitted wardrobes",
        "Built in 1954"
      ],
      "Building": [
        "exterior",
        "With lift"
      ],
      "Amenities": [
        "Air conditioning",
        "Swimming pool"
      ],
      "Energy performance certificate": [
        "Not indicated"
      ]
    },
    "updated": "2 November",
    "url": "https://www.idealista.com/en/inmueble/97028172/",
    "images": {
      "Communal areas": [
        "https://www.idealista.com/inmueble/97028172/foto/1/",
        "https://www.idealista.com/inmueble/97028172/foto/3/",
        "https://www.idealista.com/inmueble/97028172/foto/5/",
        "https://www.idealista.com/inmueble/97028172/foto/6/",
        "https://www.idealista.com/inmueble/97028172/foto/9/",
        "https://www.idealista.com/inmueble/97028172/foto/10/",
        "https://www.idealista.com/inmueble/97028172/foto/11/"
      ],
      "Swimming pool": [
        "https://www.idealista.com/inmueble/97028172/foto/2/",
        "https://www.idealista.com/inmueble/97028172/foto/4/",
        "https://www.idealista.com/inmueble/97028172/foto/7/",
        "https://www.idealista.com/inmueble/97028172/foto/8/"
      ],
      "Views": [
        "https://www.idealista.com/inmueble/97028172/foto/12/",
        "https://www.idealista.com/inmueble/97028172/foto/28/",
        "https://www.idealista.com/inmueble/97028172/foto/48/"
      ],
      "Living room": [
        "https://www.idealista.com/inmueble/97028172/foto/13/",
        "https://www.idealista.com/inmueble/97028172/foto/14/",
        "https://www.idealista.com/inmueble/97028172/foto/16/",
        "https://www.idealista.com/inmueble/97028172/foto/17/",
        "https://www.idealista.com/inmueble/97028172/foto/18/",
        "https://www.idealista.com/inmueble/97028172/foto/19/"
      ],
      "Dining room": [
        "https://www.idealista.com/inmueble/97028172/foto/15/",
        "https://www.idealista.com/inmueble/97028172/foto/25/"
      ],
      "Terrace": [
        "https://www.idealista.com/inmueble/97028172/foto/20/",
        "https://www.idealista.com/inmueble/97028172/foto/21/",
        "https://www.idealista.com/inmueble/97028172/foto/22/",
        "https://www.idealista.com/inmueble/97028172/foto/24/",
        "https://www.idealista.com/inmueble/97028172/foto/36/",
        "https://www.idealista.com/inmueble/97028172/foto/40/",
        "https://www.idealista.com/inmueble/97028172/foto/41/",
        "https://www.idealista.com/inmueble/97028172/foto/42/"
      ],
      "Bedroom": [
        "https://www.idealista.com/inmueble/97028172/foto/23/",
        "https://www.idealista.com/inmueble/97028172/foto/31/",
        "https://www.idealista.com/inmueble/97028172/foto/34/",
        "https://www.idealista.com/inmueble/97028172/foto/35/",
        "https://www.idealista.com/inmueble/97028172/foto/38/",
        "https://www.idealista.com/inmueble/97028172/foto/39/",
        "https://www.idealista.com/inmueble/97028172/foto/43/"
      ],
      "Kitchen": [
        "https://www.idealista.com/inmueble/97028172/foto/26/",
        "https://www.idealista.com/inmueble/97028172/foto/27/",
        "https://www.idealista.com/inmueble/97028172/foto/29/",
        "https://www.idealista.com/inmueble/97028172/foto/30/"
      ],
      "Bathroom": [
        "https://www.idealista.com/inmueble/97028172/foto/32/",
        "https://www.idealista.com/inmueble/97028172/foto/37/",
        "https://www.idealista.com/inmueble/97028172/foto/44/"
      ],
      "Office": [
        "https://www.idealista.com/inmueble/97028172/foto/33/",
        "https://www.idealista.com/inmueble/97028172/foto/46/"
      ],
      "Staircase": [
        "https://www.idealista.com/inmueble/97028172/foto/45/",
        "https://www.idealista.com/inmueble/97028172/foto/47/"
      ],
      "Reception": [
        "https://www.idealista.com/inmueble/97028172/foto/49/"
      ]
    },
    "plans": [
      "https://www.idealista.com/inmueble/97028172/foto/50/",
      "https://www.idealista.com/inmueble/97028172/foto/51/"
    ]
  }
]

In this demonstration, we used a few CSS and XPath selectors using parsel to extract property details like price, description, features etc.

However, the images are where things are getting a bit complex. For image carousels, many websites use javascript to generate dynamic HTML on demand. Idealista is no exception and it hides all of the image URLs in a javascript variable, then displays it using javascript.
To scrape this, we used a regular expression pattern to find the hidden javascript variable then load it as a Python dictionary object and parse the image and floor plans.

How to Scrape Hidden Web Data

For more on hidden web data scraping see our introduction article which covers several popular examples and how to scrape them in Python.

For scraping itself we used asynchronous capabilities of httpx and asyncio.as_completed to schedule multiple properties concurrently making our scraper super fast!

Next, let's take a look at how we can scale up this scraper by implementing exploration functionality.

Finding Idealista Properties

There are several ways to find properties listed in Idealista. The most popular and reliable is to explore by area. In this section, we'll take a look at how to scrape property listings with a little bit of crawling - we'll explore the location directory.

To find the location directory we can scroll to the bottom of the page:

screenshot of idealista location directory page — Location directory found at the bottom of the page.

Each link leads to a province listing URL which further leads to area listings URLs. We can easily scrape this with the same CSS selector technique we've used previously:

Python

ScrapFly

def parse_province(response: httpx.Response) -> List[str]:
    """parse province page for area search urls"""
    selector = Selector(text=response.text)
    urls = selector.css("#location_list li>a::attr(href)").getall()
    return [urljoin(str(response.url), url) for url in urls]


async def scrape_provinces(urls: List[str]) -> List[str]:
    """
    Scrape province pages like:
    https://www.idealista.com/en/venta-viviendas/malaga-provincia/con-chalets/municipios
    for search page urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    """
    to_scrape = [session.get(url) for url in urls]
    search_urls = []
    async for response in asyncio.as_completed(to_scrape):
        search_urls.extend(parse_province(await response))
    return search_urls

def parse_province(response: ScrapeApiResponse) -> List[str]:
    """parse province page for area search urls"""
    selector = response.selector
    urls = selector.css("#location_list li>a::attr(href)").getall()
    return [urljoin(str(response.context["url"]), url) for url in urls]


async def scrape_provinces(urls: List[str]) -> List[str]:
    """
    Scrape province pages like:
    https://www.idealista.com/en/venta-viviendas/malaga-provincia/con-chalets/municipios
    for search page urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    """
    to_scrape = [ScrapeConfig(url, asp=True, country="ES") for url in urls]
    search_urls = []
    async for response in scrapfly.concurrent_scrape(to_scrape):
        search_urls.extend(parse_province(response))
    return search_urls

Run Code & Example Output

async def run():
    data = await scrape_provinces([
        "https://www.idealista.com/en/venta-viviendas/balears-illes/con-chalets/municipios"
    ])
    print(json.dumps(data, indent=2))
if __name__ == "__main__":
    asyncio.run(run())

Will result in a dataset similar to this:

[
  "https://www.idealista.com/en/venta-viviendas/alaior-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/alaro-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/alcudia-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/algaida-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/andratx-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/ariany-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/arta-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/santa-maria-del-cami-balears-illes/con-chalets/",
  ...
]

This scraper will scrape all area pages for given provinces. To discover all property listings all we'd have to do is scrape all provinces. Next, let's scrape the search results page itself:

Python

ScrapFly

def parse_search(response: httpx.Response) -> List[str]:
    """Parse search result page for 30 listings"""
    selector = Selector(text=response.text)
    urls = selector.css("article.item .item-link::attr(href)").getall()
    return [urljoin(str(response.url), url) for url in urls]


async def scrape_search(url: str, paginate=True, max_pages: int = None) -> List[str]:
    """
    Scrape search urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    for proprety urls
    """
    first_page = await session.get(url)
    property_urls = parse_search(first_page)
    if not paginate:
        return property_urls
    total_results = first_page.selector.css("h1#h1-container").re(": (.+) houses")[0]
    total_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    if total_pages > 60:
        print(f"search contains more than max page limit ({total_pages}/60)")
        total_pages = 60
    # scrape all available pages in the search if max_scrape_pages is None or max_scrape_pages > total_pages
    if max_pages and max_pages < total_pages:
        total_pages = max_pages
    else:
        total_pages = total_pages
    print(f"scraping {total_pages} of search results concurrently")
    to_scrape = [
        session.get(first_page.url + f"pagina-{page}.htm")
        for page in range(2, total_pages + 1)
    ]
    async for response in asyncio.as_completed(to_scrape):
        property_urls.extend(parse_search(await response))
    return property_urls

def parse_search(response: ScrapeApiResponse) -> List[str]:
    """Parse search result page for 30 listings"""
    selector = response.selector
    urls = selector.css("article.item .item-link::attr(href)").getall()
    return [urljoin(str(response.context["url"]), url) for url in urls]


async def scrape_search(url: str, paginate=True, max_pages: int = None) -> List[str]:
    """
    Scrape search urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    for proprety urls
    """
    first_page = await scrapfly.async_scrape(ScrapeConfig(url, asp=True, country="ES"))
    property_urls = parse_search(first_page)
    if not paginate:
        return property_urls
    total_results = first_page.selector.css("h1#h1-container").re(": (.+) houses")[0]
    total_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    if total_pages > 60:
        print(f"search contains more than max page limit ({total_pages}/60)")
        total_pages = 60
    # scrape all available pages in the search if max_scrape_pages is None or max_scrape_pages > total_pages
    if max_pages and max_pages < total_pages:
        total_pages = max_pages
    else:
        total_pages = total_pages        
    print(f"scraping {total_pages} of search results concurrently")
    to_scrape = [
        ScrapeConfig(first_page.context["url"] + f"pagina-{page}.htm", asp=True, country="ES")
        for page in range(2, total_pages + 1)
    ]
    async for response in scrapfly.concurrent_scrape(to_scrape):
        property_urls.extend(parse_search(response))
    return property_urls

Run Code & Example Output

async def run():
    data = await scrape_search(
        url="https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/",
        max_pages=1
    )
    print(json.dumps(data, indent=2))
if __name__ == "__main__":
    asyncio.run(run())

Will result in a dataset similar to this:

[
  "https://www.idealista.com/en/inmueble/98935300/",
  "https://www.idealista.com/en/inmueble/102479109/",
  "https://www.idealista.com/en/inmueble/102051911/",
  "https://www.idealista.com/en/inmueble/99394819/",
  "https://www.idealista.com/en/inmueble/102695949/",
  "https://www.idealista.com/en/inmueble/102645488/",
  "https://www.idealista.com/en/inmueble/102953607/",
  "https://www.idealista.com/en/inmueble/86941032/",
  "https://www.idealista.com/en/inmueble/102953907/",
  "https://www.idealista.com/en/inmueble/103130779/",
  "https://www.idealista.com/en/inmueble/100285134/",
   .....
]

For scraping paginated content like the area results pages we first scrape the first page to extract the total result count. Then, we can scrape the remaining pages concurrently retrieving all listings in just a few seconds!

With this discovery scraper and our previous property scraper we can collect all of the existing real estate data on Idealista.com - though what if we want to be the first to know about new property listings? Next, let's take a look how can we track Idealista for new property listings.

Tracking New Idealista Listings

To track new listings we can take advantage of Idealista's result sorting for which we can reuse our search scraper. Each search result page on Idealista can be sorted by "most recent" which we can scrape continuously to be the first one to know when a new property gets listed.

For example, let's take a look at properties in Eixample, Barecelona:

screenshot of idealista most recent listing page

If we click the "most recent" button we can see that each result page can be ordered via URL parameter ordenado-por, which in the "most recent" case is ordenado-por=fecha-pbulicacion-desc. Let's take advantage of this fact and build a tracker scraper.

To scrape this in Python we can start an endless while loop that keeps checking this page for new listings:

...  # include code from previous sections

async def track_search(url: str, output: Path, interval=60):
    """Track Idealista.com results page, scrape new listings and append them as JSON to the output file"""
    seen = set()
    output.touch(exist_ok=True)  # create file if it doesn't exist
    try:
        while True:
            properties = await scrape_search(url=url, max_pages=1)
            # check deduplication filter
            properties = [prop for prop in properties if prop not in seen]
            if properties:
                # scrape properties and save to file - 1 property as JSON per line
                results = await scrape_properties(properties)
                with output.open("a") as f:
                    f.write("\n".join(json.dumps(property) for property in results))

                # add seen to deduplication filter
                for prop in properties:
                    seen.add(prop)
            print(f"scraped {len(results)} new properties; waiting {interval} seconds")
            await asyncio.sleep(interval)
    except KeyboardInterrupt:
        print("stopping price tracking")
        
# Example run:
from pathlib import Path
asyncio.run(track_search(
    "https://www.idealista.com/en/venta-viviendas/barcelona/eixample/?ordenado-por=fecha-publicacion-desc",
    Path("new-barcelona-eixample-area-properties.jsonl"),
))

This short tracker scraper will scrape provided results page for new listings. It keeps a memory of seen listing to prevent duplicates and appends results to a JSON-lines file (1 JSON object per line).

We wrote property discovery, scraping and tracking - all is left is to scale our scraper. If we were to increase our scraping load Idealista is very likely to block us so let's take a look at how to avoid blocking using ScrapFly web scraping API next.

Bypass Idealista Blocking with ScrapFly

As we've seen, scraping Idealista.com using Python is pretty straight-forward, though when scraping at scale our scrapers are likely to be blocked or asked to solve captchas.

screenshot of idealista blocking page — Idealista blocking with 'Se ha detectado un uso indebido El acceso se ha bloqueado' message

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!
ScrapFly offers several powerful features that'll help us to get around Idealista scraper blocking:

For this, we'll be using the scrapfly-sdk python package and the Anti Scraping Protection Bypass feature. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Idealista web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests:

import httpx

response = httpx.get("some idealista.com URL")
# in ScrapFly SDK becomes
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient("YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
    "some Idealista.ocm url",
    # we can select specific proxy country like Spain:
    country="ES",
    # and enable anti scraping protection bypass:
    asp=True
))

For more on how to scrape Idealista.com using ScrapFly, see the Full Scraper Code section.

FAQ

To wrap this guide up, let's take a look at some frequently asked questions about web scraping Idealista.com data:

Is it legal to scrape Idealista.com?

Yes. Idealista.com's data is publicly available; we're not extracting anything personal or private. Scraping Idealista.com at slow, respectful rates is perfectly legal and ethical.
That being said, attention should be paid to GDRP compliance in the EU when scraping personal data like (seller's name, phone number etc). For more, see our Is Web Scraping Legal? article.

Does Idealista.com have a public API?

No, Idealista.com (and it's sister websites) do not offer a public API for property data. However, as seen in this guide, it's easy to scrape and crawl using a little bit of Python.

Latest Idealista.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Idealista Scraping Summary

In this web scraping tutorial, we wrote a short Idealista scraper for real estate property data. We started by scraping a single property page and parsing details using CSS and XPath selectors.

Then, we've taken a look at how to find properties using Idealista's directory and search system. We wrote a small web crawler that can crawl and scrape all property listings in provided provinces of Spain.

Finally, we've taken a look at how to track new listings being posted on Idealista by creating a looping scraper that constantly checks for new listings.

For all of this, we used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API that smartly configures every web scraper connection to avoid being blocked.
For more about ScrapFly, see our documentation and try it out for FREE!

Try for FREE!

Full Idealista Scraper Code

import re
import asyncio
import json
import math
from pathlib import Path
from typing import List
from urllib.parse import urljoin

from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=2)

# -------------------------------------------------
# Property
# -------------------------------------------------
def parse_property(result: ScrapeApiResponse):
    sel = result.selector
    css = lambda x: result.selector.css(x).get("").strip()
    css_all = lambda x: result.selector.css(x).getall()

    data = {}
    # Meta data
    data["url"] = result.context["url"]

    # Basic information
    data['title'] = css("h1 .main-info__title-main::text")
    data['location'] = css(".main-info__title-minor::text")
    data['currency'] = css(".info-data-price::text")
    data['price'] = int(css(".info-data-price span::text").replace(",", ""))
    data['description'] = "\n".join(css_all("div.comment ::text")).strip()
    data["updated"] = sel.xpath(
        "//p[@class='stats-text']"
        "[contains(text(),'updated on')]/text()"
    ).get("").split(" on ")[-1]

    # Features
    data["features"] = {}
    #  first we extract each feature block like "Basic Features" or "Amenities"
    for feature_block in result.selector.css(".details-property-h3"):
        # then for each block we extract all bullet points underneath them
        label = feature_block.xpath("text()").get()
        features = feature_block.xpath("following-sibling::div[1]//li")
        data["features"][label] = [
            ''.join(feat.xpath(".//text()").getall()).strip()
            for feat in features
        ]

    # Images
    # the images are tucked away in a javascript variable.
    # We can use regular expressions to find the variable and parse it as a dictionary:
    image_data = re.findall("fullScreenGalleryPics\s*:\s*(\[.+?\]),", result.content)[0]
    # we also need to replace unquoted keys to quoted keys (i.e. title -> "title"):
    images = json.loads(re.sub(r'(\w+?):([^/])', r'"\1":\2', image_data))
    data['images'] = defaultdict(list)
    data['plans'] = []
    for image in images:
        url = urljoin(result.context['url'], image['imageUrl'])
        if image['isPlan']:
            data['plans'].append(url)
        else:
            data['images'][image['tag']].append(url)
    return data



async def scrape_properties(urls: List[str]):
    to_scrape = [ScrapeConfig(url=url, country="ES", asp=True) for url in urls]
    results = []
    async for result in scrapfly.concurrent_scrape(to_scrape):
        results.append(parse_property(result))
    return results


# -------------------------------------------------
# Search
# -------------------------------------------------
def parse_province(result: ScrapeApiResponse) -> List[str]:
    """parse province page for area search urls"""
    urls = result.selector.css("#location_list li>a::attr(href)").getall()
    return [urljoin(result.context["url"], url) for url in urls]


async def scrape_provinces(urls: List[str]) -> List[str]:
    """
    Scrape province pages like:
    https://www.idealista.com/en/venta-viviendas/malaga-provincia/con-chalets/municipios
    for search page urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    """
    to_scrape = [ScrapeConfig(url, country="US", asp=True) for url in urls]
    search_urls = []
    async for result in scrapfly.concurrent_scrape(to_scrape):
        search_urls.extend(parse_province(result))
    return search_urls


def parse_search(result: ScrapeApiResponse) -> List[str]:
    """Parse search result page for 30 listings"""
    urls = result.selector.css("article.item .item-link::attr(href)").getall()
    return [urljoin(result.context["url"], url) for url in urls]


async def scrape_search(url: str, max_pages: int = 2) -> List[str]:
    """
    Scrape search urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    for proprety urls
    """
    first_page = await scrapfly.async_scrape(ScrapeConfig(url=url, country="ES", asp=True))
    property_urls = parse_search(first_page)
    if not paginate:
        return property_urls

    total_results = first_page.selector.css("h1#h1-container").re(": (.+) houses")[0]
    total_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    if total_pages > max_pages:  # note idealista allows max 60 pages per search
        print(f"search contains more than max page limit ({total_pages}/60)")
        total_pages = max_pages 
    print(f"scraping {total_pages} of search results concurrently")
    to_scrape = [
        ScrapeConfig(
            url=first_page.context["url"] + f"pagina-{page}.htm",
            asp=True,
            country="ES",
        )
        for page in range(2, total_pages + 1)
    ]
    async for result in scrapfly.concurrent_scrape(to_scrape):
        property_urls.extend(parse_search(result))
    return property_urls


# -------------------------------------------------
# Track Search
# -------------------------------------------------
async def track_search(url: str, output: Path, interval=60):
    """Track Idealista.com results page, scrape new listings and append them as JSON to the output file"""
    seen = set()
    output.touch(exist_ok=True)
    try:
        while True:
            properties = await scrape_search(url=url, paginate=False)
            # check deduplication filter
            properties = [prop for prop in properties if prop not in seen]
            if properties:
                # scrape properties and save to file - 1 property as JSON per line
                results = await scrape_properties(properties)
                with output.open("a") as f:
                    f.write("\n".join(json.dumps(property) for property in results))

                # add seen to deduplication filter
                for prop in properties:
                    seen.add(prop)
            print(f"scraped {len(results)} new properties; waiting {interval} seconds")
            await asyncio.sleep(interval)
    except KeyboardInterrupt:
        print("stopping price tracking")


async def run():
    # scrape properties:
    urls = ["https://www.idealista.com/en/inmueble/97028172/"]
    result_properties = await scrape_properties(urls)
    # find properties
    result_search = await scrape_search("https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/")
    result_province = await scrape_provinces(["https://www.idealista.com/en/venta-viviendas/balears-illes/con-chalets/municipios"])
    # track properties
    await track_search(
        "https://www.idealista.com/en/venta-viviendas/barcelona/eixample/?ordenado-por=fecha-publicacion-desc",
        Path("new-properties.jsonl"),
    )


if __name__ == "__main__":
    asyncio.run(run())

Check out ScrapFly Python SDK

Try ScrapFly for FREE!

Apr 22, 2024

How to Scrape Idealista.com in Python - Real Estate Property Data

Latest Idealista.com Scraper Code

Why Scrape Idealista.com?

Project Setup

Scraping Idealista Property Data

Finding Idealista Properties

Tracking New Idealista Listings

Bypass Idealista Blocking with ScrapFly

FAQ

Is it legal to scrape Idealista.com?

Does Idealista.com have a public API?

Idealista Scraping Summary

Full Idealista Scraper Code

Company

Tools

Resources

Learn Web Scraping

Usage

How to Scrape Idealista.com in Python - Real Estate Property Data

Latest Idealista.com Scraper Code

Why Scrape Idealista.com?

Project Setup

Scraping Idealista Property Data

Finding Idealista Properties

Tracking New Idealista Listings

Bypass Idealista Blocking with ScrapFly

FAQ

Is it legal to scrape Idealista.com?

Does Idealista.com have a public API?

Idealista Scraping Summary

Full Idealista Scraper Code

Related Posts

How to Scrape Reddit Posts, Subreddits and Profiles

How to Scrape LinkedIn in 2024

How to Scrape SimilarWeb Website Traffic Analytics

Company

Tools

Resources

Learn Web Scraping

Usage