
Imovelweb is one of Brazil’s biggest real estate marketplaces. If you’re comparing prices, tracking supply, or building a lead pipeline, scraping it can save days of manual work. The catch: Imovelweb uses modern protections (including DataDome and regional controls), so a naïve scraper will often get blocked or see limited content from non‑Brazil IPs.
In this guide, we’ll build a clean, function‑based scraper in Python to extract:
- listings from search pages (title, price, area, bedrooms, link, thumbnail)
- complete details from property pages (address, price, size, amenities, images)
- structured data from JSON‑LD when available
- pagination flow so you can iterate across multiple pages
We’ll start with requests + BeautifulSoup to keep things simple. Then we’ll show a production approach using Scrapfly with Brazilian geolocation and JavaScript rendering, which sidesteps most blocking headaches. For reference, you can also apply the same ideas you’ve seen in our other how‑to guides like Algolia scraping and Allegro scraping.
Legal Disclaimer and Precautions
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:- Do not scrape at rates that could damage the website.
- Do not scrape data that's not available publicly.
- Do not store PII of EU citizens who are protected by GDPR.
- Do not repurpose the entire public datasets which can be illegal in some countries.
What we’ll build
- A listings scraper that extracts summary data from search pages and follows pagination
- A property details scraper that prefers JSON‑LD but falls back to HTML selectors
- Anti‑block tactics (headers, retry, delays) and a practical geolocation setup
- A reliable, production path with Scrapfly (BR IP, JS rendering, session/cookie handling)
If you want the full code as a single file, check content/posts/how-to-scrape-imovelweb/code.py
in this article’s folder.
Prerequisites
Install the basics:
pip install requests beautifulsoup4 lxml
We’ll use requests
for HTTP and BeautifulSoup
for parsing. The lxml
parser makes parsing faster and more tolerant. If you plan to use Scrapfly for production scraping:
pip install scrapfly-sdk
If you only need an HTTP API, you can also call Scrapfly’s API endpoint with requests
. We’ll show both.
Anatomy of an Imovelweb page
Imovelweb pages often ship structured data via JSON‑LD (<script type="application/ld+json">
). When present, this is the cleanest way to pull price, address, number of rooms, and images. If JSON‑LD is missing or partial, we’ll extract from the HTML.
Common elements to look for:
- title:
h1
or a header container near the top - price: a price container in BRL (R$); sometimes in JSON‑LD
offers.price
- area: square meters (m²) and land size
- bedrooms/bathrooms/parking: small icon + label sets near the price
- address: neighborhood, city, state, sometimes full address
- gallery:
img
tags or a slider; JSON‑LD often includes image URLs
A simple, requests-based scraper
We’ll keep code modular so you can reuse parts. First we import dependencies, then create a session with realistic headers, and add a tiny fetch helper with retries.
import time
import json
import random
import re
from typing import Any, Dict, List, Optional
import requests
from bs4 import BeautifulSoup
These imports are all you need for a plain HTML workflow with requests
and BeautifulSoup
.
def create_session() -> requests.Session:
"""Create a requests session with realistic headers and BR Portuguese preferences."""
session = requests.Session()
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
]
session.headers.update({
"User-Agent": random.choice(user_agents),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "pt-BR,pt;q=0.9,en;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"DNT": "1",
})
return session
This session sets a realistic User-Agent
, localizes language to pt-BR
, and mirrors a normal browser’s headers to avoid sticking out.
def get_html(session: requests.Session, url: str, max_retries: int = 3, delay_range: tuple = (1.0, 2.5)) -> Optional[str]:
"""Fetch URL with basic retry and small random delay. Returns HTML text or None."""
for attempt in range(1, max_retries + 1):
try:
time.sleep(random.uniform(*delay_range))
resp = session.get(url, timeout=30)
if resp.status_code == 200 and "text/html" in resp.headers.get("Content-Type", ""):
return resp.text
if resp.status_code in (403, 429) or "datadome" in resp.text.lower():
return None
except requests.RequestException:
pass
if attempt < max_retries:
time.sleep(random.uniform(2.0, 4.0))
return None
This helper adds a tiny random pause, retries on transient errors, and bails out on 403/429 or DataDome pages.
Extracting listing cards
Selectors change over time, but here we’ll stick to the exact, current hooks (no fallbacks). Cards use div.postingsList-module__card-container > div.postingCardLayout-module__posting-card-layout
with:
- price:
div.postingPrices-module__price[data-qa="POSTING_CARD_PRICE"]
- features:
h3[data-qa="POSTING_CARD_FEATURES"] span
(e.g.,1250 m² tot.
,4 quartos
,4 ban.
,5 vagas
) - title/description link:
h3[data-qa="POSTING_CARD_DESCRIPTION"] a
- URL: also present in
data-to-posting
on.postingCardLayout-module__posting-card-layout
- thumbnail: the first
img
inside.postingGallery-module__gallery-container
def extract_listings(html: str) -> List[Dict[str, Any]]:
"""Parse a listings page and return a list of listing summaries."""
soup = BeautifulSoup(html, "lxml")
listings: List[Dict[str, Any]] = []
# Use the current card container selector only
card_candidates = soup.select(".postingCardLayout-module__posting-card-layout")
for card in card_candidates:
try:
# Title/description (anchor inside description block)
title_el = card.select_one('[data-qa="POSTING_CARD_DESCRIPTION"] a')
# Price
price_block = card.select_one('[data-qa="POSTING_CARD_PRICE"]')
price_el = price_block.get_text(strip=True) if price_block else None
# Main features: area (m²), bedrooms (quartos), bathrooms (ban./banheiros)
feature_spans = card.select('[data-qa="POSTING_CARD_FEATURES"] span')
area_el = None
beds_el = None
baths_el = None
for sp in feature_spans:
txt = sp.get_text(strip=True)
if not area_el and "m²" in txt:
area_el = txt
elif not beds_el and re.search(r"\bquartos?\b", txt, flags=re.I):
beds_el = txt
elif not baths_el and re.search(r"\bban(\.|heiros?)\b", txt, flags=re.I):
baths_el = txt
# URL from data-to-posting attribute
link = card.get("data-to-posting")
if link and link.startswith("/"):
link = f"https://www.imovelweb.com.br{link}"
# Thumbnail from gallery (first image src)
thumb_el = card.select_one(".postingGallery-module__gallery-container img")
thumb = thumb_el.get("src") if thumb_el else None
listings.append({
"title": (title_el.get_text(strip=True) if title_el else None),
"price": (price_el.strip() if isinstance(price_el, str) else price_el),
"area": (area_el.strip() if isinstance(area_el, str) else area_el),
"bedrooms": (beds_el.strip() if isinstance(beds_el, str) else beds_el),
"bathrooms": (baths_el.strip() if isinstance(baths_el, str) else baths_el),
"url": link,
"thumbnail": thumb,
})
except Exception:
continue
return [l for l in listings if l.get("url")]
This extracts the basics you see on a card: title, price, area, room counts, the link, and a thumbnail. We now use only data-qa
hooks (POSTING_CARD_PRICE
, POSTING_CARD_FEATURES
, POSTING_CARD_DESCRIPTION
) and data-to-posting
for the URL, with no fallbacks.
Extracting property details (JSON‑LD first)
On detail pages, JSON‑LD (if present) is the cleanest source. We’ll parse JSON blocks and look for objects that smell like real estate listings (they often include offers
, address
, and image
).
def parse_first_jsonld(soup: BeautifulSoup) -> Optional[Dict[str, Any]]:
for script in soup.find_all("script", attrs={"type": "application/ld+json"}):
try:
data = json.loads(script.string or "{}")
except json.JSONDecodeError:
continue
# Sometimes JSON-LD is a list
if isinstance(data, list):
for item in data:
if isinstance(item, dict) and (item.get("@type") or item.get("offers")):
return item
elif isinstance(data, dict) and (data.get("@type") or data.get("offers")):
return data
return None
This scans <script type="application/ld+json">
blocks and returns the first relevant object that looks like a listing.
def extract_property_details(html: str) -> Dict[str, Any]:
"""Extract rich property data from a detail page using JSON‑LD with HTML fallbacks."""
soup = BeautifulSoup(html, "lxml")
out: Dict[str, Any] = {}
jsonld = parse_first_jsonld(soup)
if jsonld:
out["jsonld"] = jsonld
# Common fields
out["title"] = jsonld.get("name") or jsonld.get("headline")
if isinstance(jsonld.get("offers"), dict):
price = jsonld["offers"].get("price")
currency = jsonld["offers"].get("priceCurrency", "BRL")
out["price"] = f"{price} {currency}" if price else None
addr = jsonld.get("address")
if isinstance(addr, dict):
out["address"] = ", ".join(filter(None, [
addr.get("streetAddress"),
addr.get("addressLocality"),
addr.get("addressRegion"),
addr.get("postalCode"),
])) or None
images = jsonld.get("image")
if isinstance(images, list):
out["images"] = images
elif isinstance(images, str):
out["images"] = [images]
# HTML fallbacks
title_el = soup.select_one("h1")
if title_el and not out.get("title"):
out["title"] = title_el.get_text(strip=True)
price_text = soup.find(string=re.compile(r"R\$\s?\d"))
if price_text and not out.get("price"):
out["price"] = price_text.strip()
# Room counts (Portuguese labels vary: quartos, suítes, banheiros, vagas)
def find_label(pattern: str) -> Optional[str]:
el = soup.find(string=re.compile(pattern, re.I))
return el.strip() if isinstance(el, str) else None
out.setdefault("bedrooms", find_label(r"\bquartos?\b|\bdormitórios?\b"))
out.setdefault("bathrooms", find_label(r"\bbanheiros?\b"))
out.setdefault("parking", find_label(r"\bvagas?\b"))
out.setdefault("area", find_label(r"\d+[\.,]?\d*\s*m²"))
# Description
desc_el = soup.select_one("[data-testid='description'], .description, #description")
if desc_el:
out["description"] = desc_el.get_text(" ", strip=True)
return out
This prefers JSON‑LD for clean fields (title, price, address, images) and fills any gaps with simple HTML lookups.
Pagination helper
We’ll look for a “next” link and return an absolute URL.
from urllib.parse import urljoin
def find_next_page(html: str, current_url: str) -> Optional[str]:
soup = BeautifulSoup(html, "lxml")
next_link = soup.find("a", attrs={"rel": "next"}) or soup.find("a", string=re.compile(r"Próxima|Seguinte|Próximo", re.I))
if next_link and next_link.get("href"):
return urljoin(current_url, next_link["href"])
return None
This looks for a rel="next"
link or a localized “Próximo” label and returns an absolute URL.
Putting it together: scrape N listing pages
The function below walks listing pages, collects summary data, and yields property URLs to process downstream.
def scrape_list_pages(start_url: str, max_pages: int = 3) -> List[Dict[str, Any]]:
session = create_session()
url = start_url
all_listings: List[Dict[str, Any]] = []
for _ in range(max_pages):
html = get_html(session, url)
if not html:
break
page_listings = extract_listings(html)
all_listings.extend(page_listings)
nxt = find_next_page(html, url)
if not nxt:
break
url = nxt
# Deduplicate by URL
seen: set = set()
unique: List[Dict[str, Any]] = []
for item in all_listings:
u = item.get("url")
if u and u not in seen:
seen.add(u)
unique.append(item)
return unique
Example usage:
if __name__ == "__main__":
# Example: São Paulo houses for sale (adjust filters on site and paste the URL)
start = "https://www.imovelweb.com.br/casas-venda-sao-paulo-sp.html"
listings = scrape_list_pages(start, max_pages=2)
print(f"Collected {len(listings)} listings")
for it in listings[:5]:
print(it["title"], it["price"], it["url"]) # sample
Here’s how the output might look:
Example output
{
"title": "Rua Luigi Alamanni",
"price": "R$ 600.000",
"area": "90 m²",
"bedrooms": "1 quartos",
"bathrooms": "1 banheiros",
"url": "https://www.imovelweb.com.br/propriedades/casa-a-venda-sacoma-1-quarto-90-m-sao-paulo-3000738541.html",
"thumbnail": "https://imgbr.imovelwebcdn.com/avisos/2/30/00/73/85/41/360x266/4556613970.jpg?isFirstImage=true"
}
Scraping property details
Given a list of property URLs from the step above, fetch each page and extract details. Prefer JSON‑LD, then fall back to HTML.
def scrape_property(url: str) -> Optional[Dict[str, Any]]:
session = create_session()
html = get_html(session, url)
if not html:
return None
return extract_property_details(html)
This fetches a single property page with your session and returns a dict built by extract_property_details
.
Handling anti‑bot, DataDome, and geolocation (BR IPs)
Imovelweb inspects browser fingerprints and often enforces Brazilian geolocation. If you hit 403/429 responses or a DataDome page, switch to a rendering proxy with country targeting. The most straightforward path is Scrapfly.
Option A: Scrapfly Python SDK
When you need reliable BR IPs and JavaScript rendering, use the Scrapfly SDK to fetch pages consistently.
# pip install scrapfly-sdk
from typing import Optional
try:
from scrapfly import ScrapflyClient, ScrapeConfig
except Exception:
ScrapflyClient = None # type: ignore
ScrapeConfig = None # type: ignore
def scrapfly_fetch_html(api_key: str, url: str, render_js: bool = True, country: str = "br") -> Optional[str]:
"""Use Scrapfly to fetch HTML with BR geolocation and (optional) JS rendering."""
if ScrapflyClient is None or ScrapeConfig is None:
return None
client = ScrapflyClient(key=api_key)
cfg = ScrapeConfig(url=url, render_js=render_js, country=country)
result = client.scrape(cfg)
return result.content
With Scrapfly you get session handling, high‑quality Brazilian IPs, and automatic mitigation for common bot challenges. You can then pass the returned HTML into the same extract_listings
/ extract_property_details
functions you already wrote.
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
- Anti-bot protection bypass - extract web pages without blocking!
- Rotating residential proxies - prevent IP address and geographic blocks.
- LLM prompts - extract data or ask questions using LLMs
- Extraction models - automatically find objects like products, articles, jobs, and more.
- Extraction templates - extract data using your own specification.
- Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.
Tips to avoid blocks
- Keep headers realistic and localized (
Accept-Language: pt-BR
). - Add short, random delays between requests; back off on error spikes.
- Use retries for network hiccups; don’t retry instantly on 403/429.
- Prefer JSON‑LD over brittle CSS selectors.
- For multi‑page crawls, use a render proxy with BR geolocation.
- Cache pages during development so you debug parser logic offline.
For a deep dive into anti‑block tactics (TLS fingerprints, headless browsers, etc.), see
5 Tools to Scrape Without Blocking and How it All Works
Tutorial on how to avoid web scraper blocking. What is javascript and TLS (JA3) fingerprinting and what role request headers play in blocking.
FAQ
Why do I see a DataDome page or get 403/429?
Your requests likely lack a real browser fingerprint or the IP isn’t in Brazil. Switch to a render proxy with BR geolocation (e.g., Scrapfly with country="br"
and render_js=True
).
Do I need JavaScript rendering for Imovelweb?
Often yes, especially for listings and dynamic widgets. JSON‑LD can still load server‑side, but JS rendering increases success rates and consistency.
Is HTML parsing enough if JSON‑LD is missing?
Yes. Target stable containers around price, area (m²), and room counts. Keep multiple selectors and review them periodically as the site evolves.
Summary
We put together a small, readable toolkit for Imovelweb: a listings scraper, a detail extractor that prefers JSON‑LD, simple pagination, and a few guardrails (headers, delays, retries). The code is intentionally plain so you can change selectors fast when the UI shifts, and you can drop each function into your own pipeline without dragging along extra structure.
When you need reliability at scale, fetch pages through Scrapfly with Brazilian geolocation and, when necessary, JavaScript rendering. Add modest rate limiting, retries with backoff, and a cache for debugging. From here, wire results into your storage (CSV/DB), schedule runs, and keep an eye on error rates so you can adjust selectors and pacing before anything breaks.