🚀 We are hiring! See open positions
How to Scrape Imovelweb.com

Imovelweb is one of Brazil’s biggest real estate marketplaces. If you’re comparing prices, tracking supply, or building a lead pipeline, scraping it can save days of manual work. The catch: Imovelweb uses modern protections (including DataDome and regional controls), so a naïve scraper will often get blocked or see limited content from non‑Brazil IPs.

In this guide, we’ll build a clean, function‑based scraper in Python to extract:

  • listings from search pages (title, price, area, bedrooms, link, thumbnail)
  • complete details from property pages (address, price, size, amenities, images)
  • structured data from JSON‑LD when available
  • pagination flow so you can iterate across multiple pages

We’ll start with requests + BeautifulSoup to keep things simple. Then we’ll show a production approach using Scrapfly with Brazilian geolocation and JavaScript rendering, which sidesteps most blocking headaches. For reference, you can also apply the same ideas you’ve seen in our other how‑to guides like Algolia scraping and Allegro scraping.

What we’ll build

  • A listings scraper that extracts summary data from search pages and follows pagination
  • A property details scraper that prefers JSON‑LD but falls back to HTML selectors
  • Anti‑block tactics (headers, retry, delays) and a practical geolocation setup
  • A reliable, production path with Scrapfly (BR IP, JS rendering, session/cookie handling)

If you want the full code as a single file, check content/posts/how-to-scrape-imovelweb/code.py in this article’s folder.

Prerequisites

Install the basics:

pip install requests beautifulsoup4 lxml

We’ll use requests for HTTP and BeautifulSoup for parsing. The lxml parser makes parsing faster and more tolerant. If you plan to use Scrapfly for production scraping:

pip install scrapfly-sdk

If you only need an HTTP API, you can also call Scrapfly’s API endpoint with requests. We’ll show both.

Anatomy of an Imovelweb page

Imovelweb pages often ship structured data via JSON‑LD (<script type="application/ld+json">). When present, this is the cleanest way to pull price, address, number of rooms, and images. If JSON‑LD is missing or partial, we’ll extract from the HTML.

Common elements to look for:

  • title: h1 or a header container near the top
  • price: a price container in BRL (R$); sometimes in JSON‑LD offers.price
  • area: square meters (m²) and land size
  • bedrooms/bathrooms/parking: small icon + label sets near the price
  • address: neighborhood, city, state, sometimes full address
  • gallery: img tags or a slider; JSON‑LD often includes image URLs

A simple, requests-based scraper

We’ll keep code modular so you can reuse parts. First we import dependencies, then create a session with realistic headers, and add a tiny fetch helper with retries.

import time
import json
import random
import re
from typing import Any, Dict, List, Optional

import requests
from bs4 import BeautifulSoup

These imports are all you need for a plain HTML workflow with requests and BeautifulSoup.

def create_session() -> requests.Session:
    """Create a requests session with realistic headers and BR Portuguese preferences."""
    session = requests.Session()
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
    ]
    session.headers.update({
        "User-Agent": random.choice(user_agents),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "pt-BR,pt;q=0.9,en;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "DNT": "1",
    })
    return session

This session sets a realistic User-Agent, localizes language to pt-BR, and mirrors a normal browser’s headers to avoid sticking out.

def get_html(session: requests.Session, url: str, max_retries: int = 3, delay_range: tuple = (1.0, 2.5)) -> Optional[str]:
    """Fetch URL with basic retry and small random delay. Returns HTML text or None."""
    for attempt in range(1, max_retries + 1):
        try:
            time.sleep(random.uniform(*delay_range))
            resp = session.get(url, timeout=30)
            if resp.status_code == 200 and "text/html" in resp.headers.get("Content-Type", ""):
                return resp.text
            if resp.status_code in (403, 429) or "datadome" in resp.text.lower():
                return None
        except requests.RequestException:
            pass
        if attempt < max_retries:
            time.sleep(random.uniform(2.0, 4.0))
    return None

This helper adds a tiny random pause, retries on transient errors, and bails out on 403/429 or DataDome pages.

Extracting listing cards

Selectors change over time, but here we’ll stick to the exact, current hooks (no fallbacks). Cards use div.postingsList-module__card-container > div.postingCardLayout-module__posting-card-layout with:

  • price: div.postingPrices-module__price[data-qa="POSTING_CARD_PRICE"]
  • features: h3[data-qa="POSTING_CARD_FEATURES"] span (e.g., 1250 m² tot., 4 quartos, 4 ban., 5 vagas)
  • title/description link: h3[data-qa="POSTING_CARD_DESCRIPTION"] a
  • URL: also present in data-to-posting on .postingCardLayout-module__posting-card-layout
  • thumbnail: the first img inside .postingGallery-module__gallery-container
def extract_listings(html: str) -> List[Dict[str, Any]]:
    """Parse a listings page and return a list of listing summaries."""
    soup = BeautifulSoup(html, "lxml")
    listings: List[Dict[str, Any]] = []

    # Use the current card container selector only
    card_candidates = soup.select(".postingCardLayout-module__posting-card-layout")

    for card in card_candidates:
        try:
            # Title/description (anchor inside description block)
            title_el = card.select_one('[data-qa="POSTING_CARD_DESCRIPTION"] a')

            # Price
            price_block = card.select_one('[data-qa="POSTING_CARD_PRICE"]')
            price_el = price_block.get_text(strip=True) if price_block else None

            # Main features: area (m²), bedrooms (quartos), bathrooms (ban./banheiros)
            feature_spans = card.select('[data-qa="POSTING_CARD_FEATURES"] span')
            area_el = None
            beds_el = None
            baths_el = None
            for sp in feature_spans:
                txt = sp.get_text(strip=True)
                if not area_el and "m²" in txt:
                    area_el = txt
                elif not beds_el and re.search(r"\bquartos?\b", txt, flags=re.I):
                    beds_el = txt
                elif not baths_el and re.search(r"\bban(\.|heiros?)\b", txt, flags=re.I):
                    baths_el = txt

            # URL from data-to-posting attribute
            link = card.get("data-to-posting")
            if link and link.startswith("/"):
                link = f"https://www.imovelweb.com.br{link}"

            # Thumbnail from gallery (first image src)
            thumb_el = card.select_one(".postingGallery-module__gallery-container img")
            thumb = thumb_el.get("src") if thumb_el else None

            listings.append({
                "title": (title_el.get_text(strip=True) if title_el else None),
                "price": (price_el.strip() if isinstance(price_el, str) else price_el),
                "area": (area_el.strip() if isinstance(area_el, str) else area_el),
                "bedrooms": (beds_el.strip() if isinstance(beds_el, str) else beds_el),
                "bathrooms": (baths_el.strip() if isinstance(baths_el, str) else baths_el),
                "url": link,
                "thumbnail": thumb,
            })
        except Exception:
            continue

    return [l for l in listings if l.get("url")]

This extracts the basics you see on a card: title, price, area, room counts, the link, and a thumbnail. We now use only data-qa hooks (POSTING_CARD_PRICE, POSTING_CARD_FEATURES, POSTING_CARD_DESCRIPTION) and data-to-posting for the URL, with no fallbacks.

Extracting property details (JSON‑LD first)

On detail pages, JSON‑LD (if present) is the cleanest source. We’ll parse JSON blocks and look for objects that smell like real estate listings (they often include offers, address, and image).

def parse_first_jsonld(soup: BeautifulSoup) -> Optional[Dict[str, Any]]:
    for script in soup.find_all("script", attrs={"type": "application/ld+json"}):
        try:
            data = json.loads(script.string or "{}")
        except json.JSONDecodeError:
            continue
        # Sometimes JSON-LD is a list
        if isinstance(data, list):
            for item in data:
                if isinstance(item, dict) and (item.get("@type") or item.get("offers")):
                    return item
        elif isinstance(data, dict) and (data.get("@type") or data.get("offers")):
            return data
    return None

This scans <script type="application/ld+json"> blocks and returns the first relevant object that looks like a listing.

def extract_property_details(html: str) -> Dict[str, Any]:
    """Extract rich property data from a detail page using JSON‑LD with HTML fallbacks."""
    soup = BeautifulSoup(html, "lxml")
    out: Dict[str, Any] = {}

    jsonld = parse_first_jsonld(soup)
    if jsonld:
        out["jsonld"] = jsonld
        # Common fields
        out["title"] = jsonld.get("name") or jsonld.get("headline")
        if isinstance(jsonld.get("offers"), dict):
            price = jsonld["offers"].get("price")
            currency = jsonld["offers"].get("priceCurrency", "BRL")
            out["price"] = f"{price} {currency}" if price else None
        addr = jsonld.get("address")
        if isinstance(addr, dict):
            out["address"] = ", ".join(filter(None, [
                addr.get("streetAddress"),
                addr.get("addressLocality"),
                addr.get("addressRegion"),
                addr.get("postalCode"),
            ])) or None
        images = jsonld.get("image")
        if isinstance(images, list):
            out["images"] = images
        elif isinstance(images, str):
            out["images"] = [images]

    # HTML fallbacks
    title_el = soup.select_one("h1")
    if title_el and not out.get("title"):
        out["title"] = title_el.get_text(strip=True)

    price_text = soup.find(string=re.compile(r"R\$\s?\d"))
    if price_text and not out.get("price"):
        out["price"] = price_text.strip()

    # Room counts (Portuguese labels vary: quartos, suítes, banheiros, vagas)
    def find_label(pattern: str) -> Optional[str]:
        el = soup.find(string=re.compile(pattern, re.I))
        return el.strip() if isinstance(el, str) else None

    out.setdefault("bedrooms", find_label(r"\bquartos?\b|\bdormitórios?\b"))
    out.setdefault("bathrooms", find_label(r"\bbanheiros?\b"))
    out.setdefault("parking", find_label(r"\bvagas?\b"))
    out.setdefault("area", find_label(r"\d+[\.,]?\d*\s*m²"))

    # Description
    desc_el = soup.select_one("[data-testid='description'], .description, #description")
    if desc_el:
        out["description"] = desc_el.get_text(" ", strip=True)

    return out

This prefers JSON‑LD for clean fields (title, price, address, images) and fills any gaps with simple HTML lookups.

Pagination helper

We’ll look for a “next” link and return an absolute URL.

from urllib.parse import urljoin

def find_next_page(html: str, current_url: str) -> Optional[str]:
    soup = BeautifulSoup(html, "lxml")
    next_link = soup.find("a", attrs={"rel": "next"}) or soup.find("a", string=re.compile(r"Próxima|Seguinte|Próximo", re.I))
    if next_link and next_link.get("href"):
        return urljoin(current_url, next_link["href"])
    return None

This looks for a rel="next" link or a localized “Próximo” label and returns an absolute URL.

Putting it together: scrape N listing pages

The function below walks listing pages, collects summary data, and yields property URLs to process downstream.

def scrape_list_pages(start_url: str, max_pages: int = 3) -> List[Dict[str, Any]]:
    session = create_session()
    url = start_url
    all_listings: List[Dict[str, Any]] = []

    for _ in range(max_pages):
        html = get_html(session, url)
        if not html:
            break
        page_listings = extract_listings(html)
        all_listings.extend(page_listings)
        nxt = find_next_page(html, url)
        if not nxt:
            break
        url = nxt

    # Deduplicate by URL
    seen: set = set()
    unique: List[Dict[str, Any]] = []
    for item in all_listings:
        u = item.get("url")
        if u and u not in seen:
            seen.add(u)
            unique.append(item)
    return unique

Example usage:

if __name__ == "__main__":
    # Example: São Paulo houses for sale (adjust filters on site and paste the URL)
    start = "https://www.imovelweb.com.br/casas-venda-sao-paulo-sp.html"
    listings = scrape_list_pages(start, max_pages=2)
    print(f"Collected {len(listings)} listings")
    for it in listings[:5]:
        print(it["title"], it["price"], it["url"])  # sample

Here’s how the output might look:

Example output
{
  "title": "Rua Luigi Alamanni",
  "price": "R$ 600.000",
  "area": "90 m²",
  "bedrooms": "1 quartos",
  "bathrooms": "1 banheiros",
  "url": "https://www.imovelweb.com.br/propriedades/casa-a-venda-sacoma-1-quarto-90-m-sao-paulo-3000738541.html",
  "thumbnail": "https://imgbr.imovelwebcdn.com/avisos/2/30/00/73/85/41/360x266/4556613970.jpg?isFirstImage=true"
}
  

Scraping property details

Given a list of property URLs from the step above, fetch each page and extract details. Prefer JSON‑LD, then fall back to HTML.

def scrape_property(url: str) -> Optional[Dict[str, Any]]:
    session = create_session()
    html = get_html(session, url)
    if not html:
        return None
    return extract_property_details(html)

This fetches a single property page with your session and returns a dict built by extract_property_details.

Handling anti‑bot, DataDome, and geolocation (BR IPs)

Imovelweb inspects browser fingerprints and often enforces Brazilian geolocation. If you hit 403/429 responses or a DataDome page, switch to a rendering proxy with country targeting. The most straightforward path is Scrapfly.

Option A: Scrapfly Python SDK

When you need reliable BR IPs and JavaScript rendering, use the Scrapfly SDK to fetch pages consistently.

# pip install scrapfly-sdk

from typing import Optional

try:
    from scrapfly import ScrapflyClient, ScrapeConfig
except Exception:
    ScrapflyClient = None  # type: ignore
    ScrapeConfig = None  # type: ignore


def scrapfly_fetch_html(api_key: str, url: str, render_js: bool = True, country: str = "br") -> Optional[str]:
    """Use Scrapfly to fetch HTML with BR geolocation and (optional) JS rendering."""
    if ScrapflyClient is None or ScrapeConfig is None:
        return None
    client = ScrapflyClient(key=api_key)
    cfg = ScrapeConfig(url=url, render_js=render_js, country=country)
    result = client.scrape(cfg)
    return result.content

With Scrapfly you get session handling, high‑quality Brazilian IPs, and automatic mitigation for common bot challenges. You can then pass the returned HTML into the same extract_listings / extract_property_details functions you already wrote.

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Tips to avoid blocks

  • Keep headers realistic and localized (Accept-Language: pt-BR).
  • Add short, random delays between requests; back off on error spikes.
  • Use retries for network hiccups; don’t retry instantly on 403/429.
  • Prefer JSON‑LD over brittle CSS selectors.
  • For multi‑page crawls, use a render proxy with BR geolocation.
  • Cache pages during development so you debug parser logic offline.

For a deep dive into anti‑block tactics (TLS fingerprints, headless browsers, etc.), see

5 Tools to Scrape Without Blocking and How it All Works

Tutorial on how to avoid web scraper blocking. What is javascript and TLS (JA3) fingerprinting and what role request headers play in blocking.

5 Tools to Scrape Without Blocking and How it All Works
.

FAQ

Why do I see a DataDome page or get 403/429?

Your requests likely lack a real browser fingerprint or the IP isn’t in Brazil. Switch to a render proxy with BR geolocation (e.g., Scrapfly with country="br" and render_js=True).

Do I need JavaScript rendering for Imovelweb?

Often yes, especially for listings and dynamic widgets. JSON‑LD can still load server‑side, but JS rendering increases success rates and consistency.

Is HTML parsing enough if JSON‑LD is missing?

Yes. Target stable containers around price, area (m²), and room counts. Keep multiple selectors and review them periodically as the site evolves.

Summary

We put together a small, readable toolkit for Imovelweb: a listings scraper, a detail extractor that prefers JSON‑LD, simple pagination, and a few guardrails (headers, delays, retries). The code is intentionally plain so you can change selectors fast when the UI shifts, and you can drop each function into your own pipeline without dragging along extra structure.

When you need reliability at scale, fetch pages through Scrapfly with Brazilian geolocation and, when necessary, JavaScript rendering. Add modest rate limiting, retries with backoff, and a cache for debugging. From here, wire results into your storage (CSV/DB), schedule runs, and keep an eye on error rates so you can adjust selectors and pacing before anything breaks.

Explore this Article with AI

Related Knowledgebase

Related Articles