How to Scrape Imovelweb.com

by Ziad Shamndy Sep 26, 2025

#python #scrapeguide #beautifulsoup #requests #scrapfly

Imovelweb is one of Brazil’s biggest real estate marketplaces. If you’re comparing prices, tracking supply, or building a lead pipeline, scraping it can save days of manual work. The catch: Imovelweb uses modern protections (including DataDome and regional controls), so a naïve scraper will often get blocked or see limited content from non‑Brazil IPs.

In this guide, we’ll build a clean, function‑based scraper in Python to extract:

listings from search pages (title, price, area, bedrooms, link, thumbnail)
complete details from property pages (address, price, size, amenities, images)
structured data from JSON‑LD when available
pagination flow so you can iterate across multiple pages

We'll start with requests + BeautifulSoup to keep things simple. Then we'll show a production approach using Scrapfly with Brazilian geolocation and JavaScript rendering, which sidesteps most blocking headaches. For reference, you can also apply the same ideas you've seen in our other how‑to guides like Algolia scraping and Allegro scraping.

Key Takeaways

Master imovelweb api scraping with advanced Python techniques, real estate data extraction, and property monitoring for comprehensive Brazilian market analysis.

Reverse engineer Imovelweb's API endpoints by intercepting browser network requests and analyzing JSON responses
Extract structured property data including prices, locations, and property details from Brazilian real estate listings
Implement pagination handling and search parameter management for comprehensive property data collection
Configure proxy rotation and fingerprint management to avoid detection and rate limiting
Use specialized tools like ScrapFly for automated Imovelweb scraping with anti-blocking features
Implement data validation and error handling for reliable Brazilian property information extraction

What we'll build

A listings scraper that extracts summary data from search pages and follows pagination
A property details scraper that prefers JSON‑LD but falls back to HTML selectors
Anti‑block tactics (headers, retry, delays) and a practical geolocation setup
A reliable, production path with Scrapfly (BR IP, JS rendering, session/cookie handling)

If you want the full code as a single file, check content/posts/how-to-scrape-imovelweb/code.py in this article’s folder.

Prerequisites

Install the basics:

pip install requests beautifulsoup4 lxml

We’ll use requests for HTTP and BeautifulSoup for parsing. The lxml parser makes parsing faster and more tolerant. If you plan to use Scrapfly for production scraping:

pip install scrapfly-sdk

If you only need an HTTP API, you can also call Scrapfly’s API endpoint with requests. We’ll show both.

Anatomy of an Imovelweb page

Imovelweb pages often ship structured data via JSON‑LD (<script type="application/ld+json">). When present, this is the cleanest way to pull price, address, number of rooms, and images. If JSON‑LD is missing or partial, we’ll extract from the HTML.

Common elements to look for:

title: h1 or a header container near the top
price: a price container in BRL (R$); sometimes in JSON‑LD offers.price
area: square meters (m²) and land size
bedrooms/bathrooms/parking: small icon + label sets near the price
address: neighborhood, city, state, sometimes full address
gallery: img tags or a slider; JSON‑LD often includes image URLs

A simple, requests-based scraper

We’ll keep code modular so you can reuse parts. First we import dependencies, then create a session with realistic headers, and add a tiny fetch helper with retries.

import time
import json
import random
import re
from typing import Any, Dict, List, Optional

import requests
from bs4 import BeautifulSoup

These imports are all you need for a plain HTML workflow with requests and BeautifulSoup.

def create_session() -> requests.Session:
    """Create a requests session with realistic headers and BR Portuguese preferences."""
    session = requests.Session()
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
    ]
    session.headers.update({
        "User-Agent": random.choice(user_agents),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "pt-BR,pt;q=0.9,en;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "DNT": "1",
    })
    return session

This session sets a realistic User-Agent, localizes language to pt-BR, and mirrors a normal browser’s headers to avoid sticking out.

def get_html(session: requests.Session, url: str, max_retries: int = 3, delay_range: tuple = (1.0, 2.5)) -> Optional[str]:
    """Fetch URL with basic retry and small random delay. Returns HTML text or None."""
    for attempt in range(1, max_retries + 1):
        try:
            time.sleep(random.uniform(*delay_range))
            resp = session.get(url, timeout=30)
            if resp.status_code == 200 and "text/html" in resp.headers.get("Content-Type", ""):
                return resp.text
            if resp.status_code in (403, 429) or "datadome" in resp.text.lower():
                return None
        except requests.RequestException:
            pass
        if attempt < max_retries:
            time.sleep(random.uniform(2.0, 4.0))
    return None

This helper adds a tiny random pause, retries on transient errors, and bails out on 403/429 or DataDome pages.

Extracting listing cards

Selectors change over time, but here we’ll stick to the exact, current hooks (no fallbacks). Cards use div.postingsList-module__card-container > div.postingCardLayout-module__posting-card-layout with:

price: div.postingPrices-module__price[data-qa="POSTING_CARD_PRICE"]
features: h3[data-qa="POSTING_CARD_FEATURES"] span (e.g., 1250 m² tot., 4 quartos, 4 ban., 5 vagas)
title/description link: h3[data-qa="POSTING_CARD_DESCRIPTION"] a
URL: also present in data-to-posting on .postingCardLayout-module__posting-card-layout
thumbnail: the first img inside .postingGallery-module__gallery-container

def extract_listings(html: str) -> List[Dict[str, Any]]:
    """Parse a listings page and return a list of listing summaries."""
    soup = BeautifulSoup(html, "lxml")
    listings: List[Dict[str, Any]] = []

    # Use the current card container selector only
    card_candidates = soup.select(".postingCardLayout-module__posting-card-layout")

    for card in card_candidates:
        try:
            # Title/description (anchor inside description block)
            title_el = card.select_one('[data-qa="POSTING_CARD_DESCRIPTION"] a')

            # Price
            price_block = card.select_one('[data-qa="POSTING_CARD_PRICE"]')
            price_el = price_block.get_text(strip=True) if price_block else None

            # Main features: area (m²), bedrooms (quartos), bathrooms (ban./banheiros)
            feature_spans = card.select('[data-qa="POSTING_CARD_FEATURES"] span')
            area_el = None
            beds_el = None
            baths_el = None
            for sp in feature_spans:
                txt = sp.get_text(strip=True)
                if not area_el and "m²" in txt:
                    area_el = txt
                elif not beds_el and re.search(r"\bquartos?\b", txt, flags=re.I):
                    beds_el = txt
                elif not baths_el and re.search(r"\bban(\.|heiros?)\b", txt, flags=re.I):
                    baths_el = txt

            # URL from data-to-posting attribute
            link = card.get("data-to-posting")
            if link and link.startswith("/"):
                link = f"https://www.imovelweb.com.br{link}"

            # Thumbnail from gallery (first image src)
            thumb_el = card.select_one(".postingGallery-module__gallery-container img")
            thumb = thumb_el.get("src") if thumb_el else None

            listings.append({
                "title": (title_el.get_text(strip=True) if title_el else None),
                "price": (price_el.strip() if isinstance(price_el, str) else price_el),
                "area": (area_el.strip() if isinstance(area_el, str) else area_el),
                "bedrooms": (beds_el.strip() if isinstance(beds_el, str) else beds_el),
                "bathrooms": (baths_el.strip() if isinstance(baths_el, str) else baths_el),
                "url": link,
                "thumbnail": thumb,
            })
        except Exception:
            continue

    return [l for l in listings if l.get("url")]

This extracts the basics you see on a card: title, price, area, room counts, the link, and a thumbnail. We now use only data-qa hooks (POSTING_CARD_PRICE, POSTING_CARD_FEATURES, POSTING_CARD_DESCRIPTION) and data-to-posting for the URL, with no fallbacks.

Extracting property details (JSON‑LD first)

On detail pages, JSON‑LD (if present) is the cleanest source. We’ll parse JSON blocks and look for objects that smell like real estate listings (they often include offers, address, and image).

def parse_first_jsonld(soup: BeautifulSoup) -> Optional[Dict[str, Any]]:
    for script in soup.find_all("script", attrs={"type": "application/ld+json"}):
        try:
            data = json.loads(script.string or "{}")
        except json.JSONDecodeError:
            continue
        # Sometimes JSON-LD is a list
        if isinstance(data, list):
            for item in data:
                if isinstance(item, dict) and (item.get("@type") or item.get("offers")):
                    return item
        elif isinstance(data, dict) and (data.get("@type") or data.get("offers")):
            return data
    return None

This scans <script type="application/ld+json"> blocks and returns the first relevant object that looks like a listing.

def extract_property_details(html: str) -> Dict[str, Any]:
    """Extract rich property data from a detail page using JSON‑LD with HTML fallbacks."""
    soup = BeautifulSoup(html, "lxml")
    out: Dict[str, Any] = {}

    jsonld = parse_first_jsonld(soup)
    if jsonld:
        out["jsonld"] = jsonld
        # Common fields
        out["title"] = jsonld.get("name") or jsonld.get("headline")
        if isinstance(jsonld.get("offers"), dict):
            price = jsonld["offers"].get("price")
            currency = jsonld["offers"].get("priceCurrency", "BRL")
            out["price"] = f"{price} {currency}" if price else None
        addr = jsonld.get("address")
        if isinstance(addr, dict):
            out["address"] = ", ".join(filter(None, [
                addr.get("streetAddress"),
                addr.get("addressLocality"),
                addr.get("addressRegion"),
                addr.get("postalCode"),
            ])) or None
        images = jsonld.get("image")
        if isinstance(images, list):
            out["images"] = images
        elif isinstance(images, str):
            out["images"] = [images]

    # HTML fallbacks
    title_el = soup.select_one("h1")
    if title_el and not out.get("title"):
        out["title"] = title_el.get_text(strip=True)

    price_text = soup.find(string=re.compile(r"R\$\s?\d"))
    if price_text and not out.get("price"):
        out["price"] = price_text.strip()

    # Room counts (Portuguese labels vary: quartos, suítes, banheiros, vagas)
    def find_label(pattern: str) -> Optional[str]:
        el = soup.find(string=re.compile(pattern, re.I))
        return el.strip() if isinstance(el, str) else None

    out.setdefault("bedrooms", find_label(r"\bquartos?\b|\bdormitórios?\b"))
    out.setdefault("bathrooms", find_label(r"\bbanheiros?\b"))
    out.setdefault("parking", find_label(r"\bvagas?\b"))
    out.setdefault("area", find_label(r"\d+[\.,]?\d*\s*m²"))

    # Description
    desc_el = soup.select_one("[data-testid='description'], .description, #description")
    if desc_el:
        out["description"] = desc_el.get_text(" ", strip=True)

    return out

This prefers JSON‑LD for clean fields (title, price, address, images) and fills any gaps with simple HTML lookups.

Pagination helper

We’ll look for a “next” link and return an absolute URL.

from urllib.parse import urljoin

def find_next_page(html: str, current_url: str) -> Optional[str]:
    soup = BeautifulSoup(html, "lxml")
    next_link = soup.find("a", attrs={"rel": "next"}) or soup.find("a", string=re.compile(r"Próxima|Seguinte|Próximo", re.I))
    if next_link and next_link.get("href"):
        return urljoin(current_url, next_link["href"])
    return None

This looks for a rel="next" link or a localized “Próximo” label and returns an absolute URL.

Putting it together: scrape N listing pages

The function below walks listing pages, collects summary data, and yields property URLs to process downstream.

def scrape_list_pages(start_url: str, max_pages: int = 3) -> List[Dict[str, Any]]:
    session = create_session()
    url = start_url
    all_listings: List[Dict[str, Any]] = []

    for _ in range(max_pages):
        html = get_html(session, url)
        if not html:
            break
        page_listings = extract_listings(html)
        all_listings.extend(page_listings)
        nxt = find_next_page(html, url)
        if not nxt:
            break
        url = nxt

    # Deduplicate by URL
    seen: set = set()
    unique: List[Dict[str, Any]] = []
    for item in all_listings:
        u = item.get("url")
        if u and u not in seen:
            seen.add(u)
            unique.append(item)
    return unique

Example usage:

if __name__ == "__main__":
    # Example: São Paulo houses for sale (adjust filters on site and paste the URL)
    start = "https://www.imovelweb.com.br/casas-venda-sao-paulo-sp.html"
    listings = scrape_list_pages(start, max_pages=2)
    print(f"Collected {len(listings)} listings")
    for it in listings[:5]:
        print(it["title"], it["price"], it["url"])  # sample

Here’s how the output might look:

Example output

{
  "title": "Rua Luigi Alamanni",
  "price": "R$ 600.000",
  "area": "90 m²",
  "bedrooms": "1 quartos",
  "bathrooms": "1 banheiros",
  "url": "https://www.imovelweb.com.br/propriedades/casa-a-venda-sacoma-1-quarto-90-m-sao-paulo-3000738541.html",
  "thumbnail": "https://imgbr.imovelwebcdn.com/avisos/2/30/00/73/85/41/360x266/4556613970.jpg?isFirstImage=true"
}

Scraping property details

Given a list of property URLs from the step above, fetch each page and extract details. Prefer JSON‑LD, then fall back to HTML.

def scrape_property(url: str) -> Optional[Dict[str, Any]]:
    session = create_session()
    html = get_html(session, url)
    if not html:
        return None
    return extract_property_details(html)

This fetches a single property page with your session and returns a dict built by extract_property_details.

Handling anti‑bot, DataDome, and geolocation (BR IPs)

Imovelweb inspects browser fingerprints and often enforces Brazilian geolocation. If you hit 403/429 responses or a DataDome page, switch to a rendering proxy with country targeting. The most straightforward path is Scrapfly.

Scrapfly

When you need reliable BR IPs and JavaScript rendering, use the Scrapfly SDK to fetch pages consistently.

# pip install scrapfly-sdk

import os
from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key=os.getenv("SCRAPFLY_API_KEY"))
cfg = ScrapeConfig(url="https://www.imovelweb.com.br/casas-venda-sao-paulo-sp.html", render_js=True, country="br")
result = client.scrape(cfg)

print(result.content)

With Scrapfly you get session handling, high‑quality Brazilian IPs, and automatic mitigation for common bot challenges. You can then pass the returned HTML into the same extract_listings / extract_property_details functions you already wrote.

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - extract web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
LLM prompts - extract data or ask questions using LLMs
Extraction models - automatically find objects like products, articles, jobs, and more.
Extraction templates - extract data using your own specification.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

Try for FREE! More on Scrapfly

Tips to avoid blocks

Keep headers realistic and localized (Accept-Language: pt-BR).
Add short, random delays between requests; back off on error spikes.
Use retries for network hiccups; don’t retry instantly on 403/429.
Prefer JSON‑LD over brittle CSS selectors.
For multi‑page crawls, use a render proxy with BR geolocation.
Cache pages during development so you debug parser logic offline.

For a deep dive into anti‑block tactics (TLS fingerprints, headless browsers, etc.), see

5 Tools to Scrape Without Blocking and How it All Works

Tutorial on how to avoid web scraper blocking. What is javascript and TLS (JA3) fingerprinting and what role request headers play in blocking.

FAQ

Why do I see a DataDome page or get 403/429?

Your requests likely lack a real browser fingerprint or the IP isn’t in Brazil. Switch to a render proxy with BR geolocation (e.g., Scrapfly with country="br" and render_js=True).

Do I need JavaScript rendering for Imovelweb?

Often yes, especially for listings and dynamic widgets. JSON‑LD can still load server‑side, but JS rendering increases success rates and consistency.

Is HTML parsing enough if JSON‑LD is missing?

Yes. Target stable containers around price, area (m²), and room counts. Keep multiple selectors and review them periodically as the site evolves.

Summary

We put together a small, readable toolkit for Imovelweb: a listings scraper, a detail extractor that prefers JSON‑LD, simple pagination, and a few guardrails (headers, delays, retries). The code is intentionally plain so you can change selectors fast when the UI shifts, and you can drop each function into your own pipeline without dragging along extra structure.

When you need reliability at scale, fetch pages through Scrapfly with Brazilian geolocation and, when necessary, JavaScript rendering. Add modest rate limiting, retries with backoff, and a cache for debugging. From here, wire results into your storage (CSV/DB), schedule runs, and keep an eye on error rates so you can adjust selectors and pacing before anything breaks.

Legal Disclaimer and Precautions

This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:

Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow in web scraping and for more you should consult a lawyer.

How to Scrape Imovelweb.com

Key Takeaways

What we'll build

Prerequisites

Anatomy of an Imovelweb page

A simple, requests-based scraper

Extracting listing cards

Extracting property details (JSON‑LD first)

Putting it together: scrape N listing pages

Scraping property details

Handling anti‑bot, DataDome, and geolocation (BR IPs)

Scrapfly

Tips to avoid blocks

5 Tools to Scrape Without Blocking and How it All Works

FAQ

Why do I see a DataDome page or get 403/429?

Do I need JavaScript rendering for Imovelweb?

Is HTML parsing enough if JSON‑LD is missing?

Summary

Explore this Article with AI

Related Knowledgebase

How to fix python requests ConnectTimeout error?

How to fix Python requests SSLError?

How to open Python http responses in a web browser?

How to fix Python requests TooManyRedirects error?

How to fix Python requests MissingSchema error?

How to fix Python requests ReadTimeout error?

What are some BeautifulSoup alternatives in Python?

What Python libraries support HTTP2?

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

Python httpx vs requests vs aiohttp - key differences

How to handle popup dialogs in Playwright?

How to use proxies with Python httpx?

Related Articles

How to Scrape Naver.com

How to Scrape AutoScout24

How to Scrape Allegro.pl

How to Scrape Ticketmaster

How to Scrape Mouser.com

Products

Features

SDKs

No-Code Platforms

LLM & RAG Apps

Technical Challenges

Popular Targets

Real Estate

eCommerce

Social Media

Company & Reviews

Jobs

Search & SEO

Fashion

Travel & Hotels

Industry Solutions

How to Scrape Imovelweb.com

Explore this Article with AI

Key Takeaways

What we'll build

Prerequisites

Anatomy of an Imovelweb page

A simple, requests-based scraper

Extracting listing cards

Extracting property details (JSON‑LD first)

Pagination helper

Putting it together: scrape N listing pages

Scraping property details

Handling anti‑bot, DataDome, and geolocation (BR IPs)

Scrapfly

Tips to avoid blocks

5 Tools to Scrape Without Blocking and How it All Works

FAQ

Why do I see a DataDome page or get 403/429?

Do I need JavaScript rendering for Imovelweb?

Is HTML parsing enough if JSON‑LD is missing?

Summary

Explore this Article with AI

Related Knowledgebase

How to fix python requests ConnectTimeout error?

How to fix Python requests SSLError?

How to open Python http responses in a web browser?

How to fix Python requests TooManyRedirects error?

How to fix Python requests MissingSchema error?

How to fix Python requests ReadTimeout error?

What are some BeautifulSoup alternatives in Python?

What Python libraries support HTTP2?

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

Python httpx vs requests vs aiohttp - key differences

How to handle popup dialogs in Playwright?

How to use proxies with Python httpx?

Related Articles

How to Scrape Naver.com

How to Scrape AutoScout24

How to Scrape Allegro.pl

How to Scrape Ticketmaster

How to Scrape Mouser.com