How to Scrape - Real Estate Property Data

article feature image

In this web scraping tutorial, we'll be taking a look at how to scrape - the biggest real estate marketplace in the United States.

In this guide, we'll be scraping real estate data such as pricing info, addresses, photos and phone numbers displayed on property pages. is easy to scrape and in this guide, we'll be taking advantage of its hidden web data systems to quickly scrape entire property datasets. We'll also take a look at how to use the search system to find all property listings in specific areas.

Finally, we'll also cover tracking to scrape newly listed or sold properties or properties that have updated their pricing - giving us an upper hand in real estate bidding!

We'll be using Python with a few community packages that'll make this web scraper a breeze. Let's dive in!

Latest Scraper Code

Why Scrape is one of the biggest real estate websites in the United States making it the biggest public real estate dataset out there. Containing fields like real estate prices, listing locations and sale dates and general property information.

This is valuable information for market analytics, the study of the housing industry, and a general competitor overview.

For more on scraping use cases see our extensive write-up Scraping Use Cases

Project Setup

In this tutorial, we'll be using Python with two community packages:

  • httpx - HTTP client library which will let us communicate with's servers
  • parsel - HTML parsing library which will help us to parse our web scraped HTML files.

These packages can be easily installed via the pip install command:

$ pip install httpx parsel

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is the beautifulsoup package.

Hands on Python Web Scraping Tutorial and Example Project

If you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.

Hands on Python Web Scraping Tutorial and Example Project

Scraping Property Data

Let's dive right in and see how can we scrape data of a single property listed on Then, we'll also take a look at how to find these properties and scale up our scraper.

For example, let's start by taking a look at the listing page and where is all of the information stored on it. Let's pick a random property listing, like:

We can see that this page contains a lot of data and parsing everything using CSS selectors or XPath selectors would be a lot of work. Instead, let's take a look at the page source of this page.

If we look up some unique identifier (like the realtor's phone number or address) in the page source we can see that this page contains hidden web data which holds the whole property dataset:

illustration of hidden data in page source
We can see entire property dataset hidden in a script element

Let's see how can we scrape it using Python. We'll retrieve the property HTML page, find the <script> containing the hidden web data and parse it as a JSON document:

import asyncio
import json
from typing import List

import httpx
from parsel import Selector
from typing_extensions import TypedDict

# First, we sneed to establish a persisten HTTPX session 
# with browser-like headers to avoid instant blocking
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
session = httpx.AsyncClient(headers=BASE_HEADERS)

# type hints fo expected results - property listing has a lot of data!
class PropertyResult(TypedDict):
    property_id: str
    listing_id: str
    href: str
    status: str
    list_price: int
    list_date: str
    ...  # and much more!

def parse_property(response: httpx.Response) -> PropertyResult:
    """parse property page"""
    # load response's HTML tree for parsing:
    selector = Selector(text=response.text)
    # find <script id="__NEXT_DATA__"> node and select it's text:
    data = selector.css("script#__NEXT_DATA__::text").get()
    if not data:
        print(f"page {response.url} is not a property listing page")
    # load JSON as python dictionary and select property value:
    data = json.loads(data)

    return data["props"]["pageProps"]["initialProps"]["property"]

async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
    """Scrape properties"""
    properties = []
    to_scrape = [session.get(url) for url in urls]
    # tip: asyncio.as_completed allows concurrent scraping - super fast!
    for response in asyncio.as_completed(to_scrape):
        response = await response
        if response.status_code != 200:
            print(f"can't scrape property: {response.url}")
    return properties
Run Code & Example Output
async def run():
    # some property urls
    urls = [
    results = await scrape_properties(urls)
    print(json.dumps(results, indent=2))

if __name__ == "__main__":

The resulting dataset is too big to embed into this article: click here to see it

We can see that our simple scraper received an extensive property dataset containing everything we see on the page like property price, address, photos and realtor's phone number as well as meta information fields that are not visible on the page.

How to Scrape Hidden Web Data

For more on hidden web data scraping see our full introduction article which covers many different methods of scraping this data.

How to Scrape Hidden Web Data

Now that we know how to scrape a single property, let's take a look at how can we find properties to scrape next!

Finding Properties

There are several ways to find properties on however the easiest and most reliable way is to use their search system. Let's take a look at how's search works and how can we scrape it.

If we type in a location into the Realtor's search bar we can see a few important bits of information:

screenshot of search page
Search page contains metadata such as total pages and total results

The search automatically redirects us to the results URL which contains property listings and pagination metadata (how many listings are available in this area). If we further click on the second page we can see a clear URL pattern that we can use in our scraper:<CITY>_<STATE>/pg-<PAGE>

Knowing this we can write our scraper that scrapes all property listings from a given geographical location variables - city and state:

import asyncio
import json
import math
from typing import List, Optional

import httpx
from parsel import Selector
from typing_extensions import TypedDict

# 1. Establish persisten HTTPX session with browser-like headers to avoid blocking
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
session = httpx.AsyncClient(headers=BASE_HEADERS, follow_redirects=True)


# Type hints for search results
# note: the property preview contains a lot of data though not the whole dataset
class PropertyPreviewResult(TypedDict):
    property_id: str
    listing_id: str
    permalink: str
    list_price: int
    price_reduces_amount: Optional[int]
    description: dict
    location: dict
    photos: List[dict]
    list_date: str
    last_update_date: str
    tags: List[str]
    ...  # and more

# Type hint for search results of a single page
class SearchResults(TypedDict):
    count: int  # results on this page
    total: int  # total results for all pages
    results: List[PropertyPreviewResult]

def parse_search(response: httpx.Response) -> SearchResults:
    """Parse search for hidden search result data"""
    selector = Selector(text=response.text)
    data = selector.css("script#__NEXT_DATA__::text").get()
    if not data:
        print(f"page {response.url} is not a property listing page")
    data = json.loads(data)
    return data["props"]["pageProps"]["searchResults"]["home_search"]

async def find_properties(state: str, city: str):
    """Scrape search for property preview data"""
    print(f"scraping first result page for {city}, {state}")
    first_page = f"{city}_{state.upper()}/pg-1"
    first_result = await session.get(first_page)
    first_data = parse_search(first_result)
    results = first_data["results"]

    total_pages = math.ceil(first_data["total"] / first_data["count"])
    print(f"found {total_pages} total pages ({first_data['total']} total properties)")
    to_scrape = []
    for page in range(1, total_pages + 1):
        assert "pg-1" in str(first_result.url)  # make sure we don't accidently scrape duplicate pages
        page_url = str(first_result.url).replace("pg-1", f"pg-{page}")
    for response in asyncio.as_completed(to_scrape):
        parsed = parse_search(await response)
    print(f"scraped search of {len(results)} results for {city}, {state}")
    return results
Run Code & Example Output
async def run():
    results_search = await find_properties("CA", "San-Francisco")
    print(json.dumps(results_search, indent=2))

if __name__ == "__main__":

Will results in a list of property preview items:

        "property_id": "1601714990",
        "list_price": 4780000,
        "primary": true,
        "primary_photo": { "href": "" },
        "source": { "id": "SFCA", "agents": [ { "office_name": null } ], "type": "mls", "spec_id": null, "plan_id": null },
        "community": null,
        "products": { "brand_name": "basic_opt_in", "products": [ "co_broke" ] },
        "listing_id": "2949512103",
        "matterport": false,
        "virtual_tours": null,
        "status": "for_sale",
        "permalink": "149-3rd-Ave_San-Francisco_CA_94118_M16017-14990",
        "price_reduced_amount": null,
        "other_listings": { "rdc": [ { "listing_id": "2949512103", "status": "for_sale", "listing_key": null, "primary": true } ] },
        "description": {
            "beds": 4,
            "baths": 5,
            "baths_full": 4,
            "baths_half": 1,
            "baths_1qtr": null,
            "baths_3qtr": null,
            "garage": null,
            "stories": null,
            "type": "single_family",
            "sub_type": null,
            "lot_sqft": 3000,
            "sqft": 2748,
            "year_built": 1902,
            "sold_price": 1825000,
            "sold_date": "2021-03-29",
            "name": null
        "location": {
            "street_view_url": "",
            "address": {
                "line": "149 3rd Ave",
                "postal_code": "94118",
                "state": "California",
                "state_code": "CA",
                "city": "San Francisco",
                "coordinate": {
                    "lat": 37.785838,
                    "lon": -122.461732
            "county": { "name": "San Francisco", "fips_code": "06075" }
        "tax_record": { "public_record_id": "5121F94BD44030C02787624D7249A34D" },
        "lead_attributes": { "show_contact_an_agent": true, "opcity_lead_attributes": { "cashback_enabled": false, "flip_the_market_enabled": false }, "lead_type": "co_broke", "ready_connect_mortgage": { "show_contact_a_lender": true, "show_veterans_united": true }, "flip_the_market_enabled": false }, "open_houses": null,
        "flags": {
            "is_coming_soon": null,
            "is_pending": null,
            "is_foreclosure": null,
            "is_contingent": null,
            "is_new_construction": null,
            "is_new_listing": true,
            "is_price_reduced": null,
            "is_plan": null,
            "is_subdivision": null
        "list_date": "2022-11-03T07:10:44Z",
        "last_update_date": "2022-11-03T00:09:33Z",
        "coming_soon_date": null,
        "photos": [
                "href": ""
                "href": ""
        "tags": [
        "branding": [ { "type": "Office", "photo": null, "name": "Marcus & Millichap" }
        "home_photos": {
            "collection": [
                    "href": ""
                    "href": ""
            "count": 2

Above, our scraper first scrapes the first page for results and how many pages are there in total in this query. Then, it scrapes the remaining pages concurrently returning a list of property URLs.

Following Listing Changes offers several RSS feeds that announce listed property changes such as:

These are great resources if we want to follow to track events in the real estate market. We can observe property price changes, new listings and sales in real-time!
Let's take a look at how can we write a tracker scraper for these feeds that will keep scraping them periodically.

Each of these feeds is split by US state and it's a simple RSS XML file that contains announcements and dates. For example, let's take a look at Price Change Feed for California:

<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="" version="2.0">
    <title>Price Changed</title>
    <atom:link href="" rel="hub"/>
    <atom:link href="" rel="self"/>
      <pubDate>Fri, 04 Nov 2022 08:54:48</pubDate>
      <pubDate>Fri, 04 Nov 2022 08:55:03</pubDate>

We can see it contains links to properties and dates when the price has changed. So, to write our tracker scraper all we have to do is:

  1. Scrape the feed every X seconds
  2. Parse <link> elements for property URLs
  3. Use our property scraper to collect the datasets
  4. Save the data (to database or file) and repeat #1

Let's take a look at how would this look in Python. We'll be scraping this feed every 5 minutes and appending results to a JSON-list file (1 JSON object per line):

import asyncio
import json
import math
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional
from typing_extensions import TypedDict

import httpx
from parsel import Selector

...  # NOTE: include code from property scraping section

async def scrape_feed(url) -> Dict[str, datetime]:
    """scrapes atom RSS feed and returns all entries in "url:publish date" format"""
    response = await session.get(url)
    body =
    selector = Selector(text=body.decode(), type="xml")
    results = {}
    for item in selector.xpath("//item"):
        url = item.xpath("link/text()").get()
        pub_date = item.xpath("pubDate/text()").get()
        results[url] = datetime.strptime(pub_date, "%a, %d %b %Y %H:%M:%S")
    return results

async def track_feed(url: str, output: Path, interval: int = 60):
    """Track feed, scrape new listings and append them as JSON to the output file"""
    # to prevent duplicates let's keep a set of property IDs we already scraped
    seen = set()
    output.touch(exist_ok=True)  # create file if it doesn't exist
        while True:
            # scrape feed for listings
            listings = await scrape_feed(url=url)
            # remove listings we scraped in previous loops
            listings = {k: v for k, v in listings.items() if f"{k}:{v}" not in seen}
            if listings:
                # scrape properties and save to file - 1 property as JSON per line
                properties = await scrape_properties(list(listings.keys()))
                with"a") as f:
                    f.write("\n".join(json.dumps(property) for property in properties))

                # add seen to deduplication filter
                for k, v in listings.items():
            print(f"scraped {len(properties)} properties; waiting {interval} seconds")
            await asyncio.sleep(interval)
    except KeyboardInterrupt:  # Note: CTRL+C will gracefully stop our scraper
        print("stopping price tracking")
Run Code & Example Output
async def run():
    # for example price feed for California
    feed_url = f""
    # or 
    await track_feed(feed_url, Path("track-pricing.jsonl"))

if __name__ == "__main__":

In the example above, we wrote an RSS feed scraper that scrapes announcements. Then we wrote an endless loop scraper, which scrapes this feed and the full property dataset to a JSON-list file.

Bypass Blocking with ScrapFly

Scraping seems to be easy though when scraping at scale we're very likely to be blocked or asked to solve captchas.

screenshot of blocking page can block web scrapers with a 'Pardon Our Interruption' message

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!
ScrapFly offers several powerful features that'll help us to get around Realtor's web scraper blocking:

For this, we'll be using the scrapfly-sdk python package and the Anti Scraping Protection Bypass feature. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests:

import httpx

response = httpx.get("some url")
# in ScrapFly SDK becomes
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient("YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
    "some Realtor.ocm url",
    # we can select specific proxy country
    # and enable anti scraping protection bypass:

For more on how to scrape using ScrapFly, see the Full Scraper Code section.


To wrap this guide up, let's take a look at some frequently asked questions about web scraping data:

Yes.'s data is publicly available; we're not extracting anything personal or private. Scraping at slow, respectful rates would fall under the ethical scraping definition.
That being said, attention should be paid to GDRP compliance in the EU when scraping personal data of non-agent listings (seller's name, phone number etc). For more, see our Is Web Scraping Legal? article.

Does have an API?

No, doesn't offer a public API for property data. However, as seen in this guide, it's really easy to scrape property data and track property changes using's official RSS feeds.

How to crawl

Like scraping we can also crawl by following related rental pages listed on every property page. For that see relatedRentals field in datasets scraped in the Scraping Property Data section

Latest Scraper Code Scraping Summary

In this tutorial, we built a scraper in Python. We started by taking a look at how to scrape a single property page by extracting hidden web data.

Then, we've taken a look at how to find properties using's search system. We build a search URL from the given parameters and scraped all of the listings listed in the query pagination.

Finally, we've taken a look at how to track changes on like price changes, sales and new listing announcements by scraping the official RSS feed.

For all of this, we used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API that smartly configures every web scraper connection to avoid being blocked.
For more about ScrapFly, see our documentation and try it out for FREE!

Related Posts

How to scrape Threads by Meta using Python (2023-08 Update)

Guide how to scrape Threads - new social media network by Meta and Instagram - using Python and popular libraries like Playwright and background request capture techniques.

How to Scrape for Fashion Apparel Data in Python is a rising storefront for luxury fashion apparel items. It's known for high quality apparel data so in this tutorial we'll take a look how to scrape it using Python.

How to Scrape Fashionphile for Second Hand Fashion Data

In this fashion scrapeguide we'll be taking a look at Fashionphile - another major 2nd hand luxury fashion marketplace. We'll be using Python and hidden web data scraping to grap all of this data in just few lines of code.