How to Scrape Zillow Real Estate Property Data in Python

How to Scrape Zillow Real Estate Property Data in Python

In this web scraping tutorial, we'll look at how to scrape Zillow property data - the biggest real estate marketplace in the United States.

In this Zillow data scraper, we'll extract real estate data, including rent and sale property information, such as prices, addresses, photos, and other website details. We'll start with a brief overview of how the website works and how to navigate it. Then, we'll explain how to use its search system for effective Zillow real estate data discovery, and finally, we'll extract the full property details. Let's get started!

Latest Zillow.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Why Scrape Zillow?

Zillow.com contains a massive real estate dataset: prices, locations, contact information, etc. This is valuable information for market analytics, the study of the housing industry, and a general competitor overview.

This means that by web scraping Zillow, we have access to the biggest real estate market in the US!

For further details on data scraping use cases, refer to our extensive guide.

Project Setup

In this tutorial, we'll scrape Zillow using Python with two community packages:

  • httpx - HTTP client library to get Zillow data in either HTML or JSON.
  • parsel - HTML parsing library to parse our web scraped HTML files.

Optionally, we'll also use loguru, a logging library that will allow us to track our Zillow data scraper.
These packages can be installed using the following pip command:

$ pip install httpx parsel loguru

Alternatively, feel free to replace httpx with any other HTTP client package, such as requests, as we'll only send basic HTTP requests. As for parsel, another great alternative is the beautifulsoup package.

Hands on Python Web Scraping Tutorial and Example Project

If you're new to web scraping with Python, we recommend checking out our full introduction tutorial with the common best practices.

Hands on Python Web Scraping Tutorial and Example Project

How to Scrape Zillow Property Pages?

To start, let's explore scraping Zillow data from property pages. First, let's locate the data on the HTML from a given Zillow page, like this one.

To scrape this page data, we can parse every detail using XPath or CSS selectors. However, there is a better approach: hidden web data. To find this data, follow the below steps:

After following the above steps, you will find the property dataset hidden in the JavaScript variable with the above XPath selector:

capture of page source of Zillow's property page
We can see property data is available as JSON object in a script tag

The above real estate data is the same on the page but before getting rendered into the HTML, commonly known as hidden web data.

How to Scrape Hidden Web Data

Learn what hidden data is through some common examples. You will also learn how to scrape it using regular expressions and other clever parsing algorithms.

How to Scrape Hidden Web Data

Let's power our Zillow data scraper with requesting and parsing logic for property pages:

Python
ScrapFly
import asyncio
from typing import List
import httpx
import json
from parsel import Selector

client = httpx.AsyncClient(
    # enable http2
    http2=True,
    # add basic browser like headers to prevent being blocked
    headers={
        "accept-language": "en-US,en;q=0.9",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "accept-language": "en-US;en;q=0.9",
        "accept-encoding": "gzip, deflate, br",
    },
)

async def scrape_properties(urls: List[str]):
    """scrape zillow property pages for property data"""
    to_scrape = [client.get(url) for url in urls]
    results = []
    for response in asyncio.as_completed(to_scrape):
        response = await response
        assert response.status_code == 200, "request has been blocked"
        selector = Selector(response.text)
        data = selector.css("script#__NEXT_DATA__::text").get()
        if data:
            # Option 1: some properties are located in NEXT DATA cache
            data = json.loads(data)
            property_data = json.loads(data["props"]["pageProps"]["componentProps"]["gdpClientCache"])
            property_data = property_data[list(property_data)[0]]['property']
        else:
            # Option 2: other times it's in Apollo cache
            data = selector.css("script#hdpApolloPreloadedData::text").get()
            data = json.loads(json.loads(data)["apiCache"])
            property_data = next(
                v["property"] for k, v in data.items() if "ForSale" in k
            )
        results.append(property_data)
    return results
import asyncio
import json
from typing import List
from scrapfly import ScrapeConfig, ScrapflyClient

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

async def scrape_properties(urls: List[str]):
    """scrape zillow property pages for property data"""
    to_scrape = [ScrapeConfig(url, asp=True, country="US") for url in urls]
    results = []
    async for result in scrapfly.concurrent_scrape(to_scrape):
        data = result.selector.css("script#__NEXT_DATA__::text").get()
        if data:
            # Option 1: some properties are located in NEXT DATA cache
            data = json.loads(data)
            property_data = json.loads(data["props"]["pageProps"]["componentProps"]["gdpClientCache"])
            property_data = property_data[list(property_data)[0]]['property']
        else:
            # Option 2: other times it's in Apollo cache
            data = result.selector.css("script#hdpApolloPreloadedData::text").get()
            data = json.loads(json.loads(data)["apiCache"])
            property_data = next(v["property"] for k, v in data.items() if "ForSale" in k)
        results.append(property_data)
    return results
Run the code
async def run():
    data = await scrape_properties(
            ["https://www.zillow.com/homedetails/1625-E-13th-St-APT-3K-Brooklyn-NY-11229/245001606_zpid/"]
        )
    print(json.dumps(data, indent=2))
if __name__ == "__main__":
    asyncio.run(run())

Let's break down the above code for scraping Zillow. We start by defining an httpx client with standard browser headers to avoid blocking. Then, we define a scrape_properties function, which does the following:

  • Add the Zillow property page URLs to a scraping list.
  • Request the page URLs concurrently to get the data as HTML pages.
  • Parse each page HTML for the script tag with the JSON data.
  • Let's run this property scraper and see the results it generates:

Here is what the extracted data from Zillow looks like:

Example Output
[
  {
    "address": {
      "streetAddress": "1065 2nd Ave",
      "city": "New York",
      "state": "NY",
      "zipcode": "10022",
      "__typename": "Address",
      "neighborhood": null
    },
    "description": "Inspired by Alvar Aaltos iconic vase, Aalto57s sculptural architecture reflects classic concepts of design both inside and out. Each residence in this boutique rental building features clean modern finishes. Amenities such as a landscaped terrace with gas grills, private and group dining areas, sun loungers, and fire feature as well as an indoor rock climbing wall, basketball court, game room, childrens playroom, guest suite, and a fitness center make Aalto57 a home like no other.",
    "photos": [
      "https://photos.zillowstatic.com/fp/0c1099a1882a904acc8cedcd83ebd9dc-p_d.jpg",
      "..."
    ],
    "zipcode": "10022",
    "phone": "646-681-3805",
    "name": "Aalto57",
    "floor_plans": [
      {
        "zpid": "2096631846",
        "__typename": "FloorPlan",
        "availableFrom": "1657004400000",
        "baths": 1,
        "beds": 1,
        "floorPlanUnitPhotos": [],
        "floorplanVRModel": null,
        "maxPrice": 6200,
        "minPrice": 6200,
        "name": "1 Bed/1 Bath-1D",
        ...
      }
    ...
  ]
}]

Cool! Our Zillow scraper can extract various details from the property web pages, including price, address, photos, and property structure. Next, let's explore extracting data from search pages!

How to Find Zillow Properties

Our previous code for scraping Zillow can extract data from a property page. In this section, we'll explore finding real estate listings using Zillow's search bar. Here is how the search system works under the hood:

0:00
/
Inspecting Zillow's search functionality with Chrome Dev tools (accessed via F12 key)

Above, we can see that upon submitting a search query, a background request is sent to Zillow API for search. The search query includes the map coordinates, as well as other comprehensive details. However, few query parameters are actually required:

{
  "searchQueryState":{
    "pagination":{},
    "usersSearchTerm":"New Haven, CT",
    "mapBounds":
      {
        "west":-73.03037621240235,
        "east":-72.82781578759766,
        "south":41.23043771298298,
        "north":41.36611033618769
      },
    },
  "wants": {
    "cat1":["mapResults"]
  },
  "requestId": 2
}

The Zillow search API is really powerful and allows us to find listings in any map area defined by two location points comprised of 4 direction values: north, west, south, and east:

illustration of drawing areas on maps using only two points
with these 4 values we can draw a square or a circle area at any point of the map!

Let's replicate the login for finding properties by location to our Zillow scraping code using the latitude and longitude values:

Python
ScrapFly
import json
import httpx

# we should use browser-like request headers to prevent being instantly blocked
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


url = "https://www.zillow.com/async-create-search-page-state"
body = {
    "searchQueryState": {
        "pagination": {},
        "usersSearchTerm": "New Haven, CT",
        # map coordinates that indicate New Haven city's area
        "mapBounds": {
            "west": -73.03037621240235,
            "east": -72.82781578759766,
            "south": 41.23043771298298,
            "north": 41.36611033618769,
        },
    },
    "wants": {"cat1": ["listResults", "mapResults"], "cat2": ["total"]},
    "requestId": 2,
}
response = httpx.put(url, headers=BASE_HEADERS, data=json.dumps(body))
assert response.status_code == 200, "request has been blocked"
data = response.json()
results = response.json()["cat1"]["searchResults"]["mapResults"]
print(json.dumps(results, indent=2))
print(f"found {len(results)} property results")
import json
from scrapfly import ScrapeConfig, ScrapflyClient

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

url = "https://www.zillow.com/async-create-search-page-state"
body = {
    "searchQueryState": {
        "pagination": {},
        "usersSearchTerm": "New Haven, CT",
        # map coordinates that indicate New Haven city's area
        "mapBounds": {
            "west": -73.03037621240235,
            "east": -72.82781578759766,
            "south": 41.23043771298298,
            "north": 41.36611033618769,
        },
    },
    "wants": {"cat1": ["listResults", "mapResults"], "cat2": ["total"]},
    "requestId": 2,
}

response = scrapfly.scrape(
    ScrapeConfig(
        url,
        asp=True,
        country="US",
        headers={"content-type": "application/json"},
        body=json.dumps(body),
        method="PUT",
    )
)

data = json.loads(response.content)
results = data["cat1"]["searchResults"]["mapResults"]
print(json.dumps(results, indent=2))
print(f"found {len(results)} property results")  

We can successfully replicate the search query precisely. Next, we'll utilize it for the search pages.

How to Scrape Zillow Search Pages?

To scrape Zillow search, we need the geographical location details, which can be challenging to get. Therefore, we'll extract the location's geographical details from an easier user interface: search pages. To illustrate this, go to any search URL on Zillow, like zillow.com/homes/New-Haven,-CT_rb/. You fill find the geographical details hidden in the HTML:

capture of page source of Zillow's search pager
We can see query and geo data of this search hidden in a page source comment

The geographical details exist in the script tag. Let's use it to scrape Zillow data from search pages:

Python
ScrapFly
import random
import asyncio
import json
import httpx
from loguru import logger as log
from parsel import Selector


async def _search(
    query: str,
    session: httpx.AsyncClient,
    filters: dict = None,
    categories=("cat1", "cat2"),
):
    """base search function which is used by sale and rent search functions"""
    html_response = await session.get(f"https://www.zillow.com/homes/{query}_rb/")
    assert html_response.status_code == 403, "request is blocked"
    selector = Selector(html_response.text)
    # find query data in script tags
    script_data = json.loads(
        selector.xpath("//script[@id='__NEXT_DATA__']/text()").get()
    )
    query_data = script_data["props"]["pageProps"]["searchPageState"]["queryState"]
    if filters:
        query_data["filterState"] = filters

    # scrape search API
    url = "https://www.zillow.com/async-create-search-page-state"
    found = []
    # cat1 - Agent Listings
    # cat2 - Other Listings
    for category in categories:
        full_query = {
            "searchQueryState": query_data,
            "wants": {category: ["mapResults"]},
            "requestId": random.randint(2, 10),
        }
        api_response = await session.put(
            url,
            headers={"content-type": "application/json"},
            body=json.dumps(full_query),
        )
        data = api_response.json()
        _total = data["categoryTotals"][category]["totalResultCount"]
        if _total > 500:
            log.warning(f"query has more results ({_total}) than 500 result limit ")
        else:
            log.info(f"found {_total} results for query: {query}")
        map_results = data[category]["searchResults"]["mapResults"]
        found.extend(map_results)
    return found


async def search_sale(query: str, session: httpx.AsyncClient):
    """search properties that are for sale"""
    log.info(f"scraping sale search for: {query}")
    return await _search(query=query, session=session)


async def search_rent(query: str, session: httpx.AsyncClient):
    """search properites that are for rent"""
    log.info(f"scraping rent search for: {query}")
    filters = {
        "isForSaleForeclosure": {"value": False},
        "isMultiFamily": {"value": False},
        "isAllHomes": {"value": True},
        "isAuction": {"value": False},
        "isNewConstruction": {"value": False},
        "isForRent": {"value": True},
        "isLotLand": {"value": False},
        "isManufactured": {"value": False},
        "isForSaleByOwner": {"value": False},
        "isComingSoon": {"value": False},
        "isForSaleByAgent": {"value": False},
    }
    return await _search(
        query=query, session=session, filters=filters, categories=["cat1"]
    )
import asyncio
import json
from random import randint
from scrapfly import ScrapeConfig, ScrapflyClient
from loguru import logger as log

scrapfly = ScrapflyClient(key="Your ScrapFly API key")


async def _search(query: str, filters: dict = None, categories=("cat1", "cat2")):
    """base search function which is used by sale and rent search functions"""
    html_response = await scrapfly.async_scrape(
        ScrapeConfig(
            f"https://www.zillow.com/homes/{query}_rb/", asp=True, country="US"
        )
    )
    script_data = json.loads(
        html_response.selector.xpath("//script[@id='__NEXT_DATA__']/text()").get()
    )
    query_data = script_data["props"]["pageProps"]["searchPageState"]["queryState"]
    if filters:
        query_data["filterState"] = filters
    # scrape search API
    url = "https://www.zillow.com/async-create-search-page-state"
    found = []
    # cat1 - Agent Listings
    # cat2 - Other Listings
    for category in categories:
        full_query = {
            "searchQueryState": query_data,
            "wants": {category: ["mapResults"]},
            "requestId": randint(2, 10),
        }
        api_response = await scrapfly.async_scrape(
            ScrapeConfig(
                url,
                asp=True,
                country="US",
                headers={"content-type": "application/json"},
                body=json.dumps(full_query),
                method="PUT",
            )
        )
        data = json.loads(api_response.content)
        _total = data["categoryTotals"][category]["totalResultCount"]
        if _total > 500:
            log.warning(f"query has more results ({_total}) than 500 result limit ")
        else:
            log.info(f"found {_total} results for query: {query}")
        map_results = data[category]["searchResults"]["mapResults"]
        found.extend(map_results)
    return found


async def search_sale(query: str):
    """search properties that are for sale"""
    log.info(f"scraping sale search for: {query}")
    return await _search(query=query)


async def search_rent(query: str):
    """search properites that are for rent"""
    log.info(f"scraping rent search for: {query}")
    filters = {
        "isForSaleForeclosure": {"value": False},
        "isMultiFamily": {"value": False},
        "isAllHomes": {"value": True},
        "isAuction": {"value": False},
        "isNewConstruction": {"value": False},
        "isForRent": {"value": True},
        "isLotLand": {"value": False},
        "isManufactured": {"value": False},
        "isForSaleByOwner": {"value": False},
        "isComingSoon": {"value": False},
        "isForSaleByAgent": {"value": False},
    }
    return await _search(query=query, filters=filters, categories=["cat1"])


async def run():
    data = await search_rent("New Haven, CT")
    print(json.dumps(data, indent=2))


if __name__ == "__main__":
    asyncio.run(run())

Let's break down the above Zillow scraper code. We use the _search function to request the search page first and parse the HTML for the location details. Then, we use the location details to request the search API, either for sale or rent real estate data.

Executing the above scraping code will extract the following data from Zillow:

Run code and example output
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}

async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        data = await search_rent("New Haven, CT", session)
        print(json.dumps(data, indent=2))


if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "buildingId": "40.609608--73.960045",
    "lotId": 1004524429,
    "price": "From $295,000",
    "latLong": {
      "latitude": 40.609608,
      "longitude": -73.960045
    },
    "minBeds": 1,
    "minBaths": 1.0,
    "minArea": 1200,
    "imgSrc": "https://photos.zillowstatic.com/fp/3c0259c716fc4793a65838aa40af6350-p_e.jpg",
    "hasImage": true,
    "plid": "1611681",
    "isFeaturedListing": false,
    "unitCount": 2,
    "isBuilding": true,
    "address": "1625 E 13th St, Brooklyn, NY",
    "variableData": {},
    "badgeInfo": null,
    "statusType": "FOR_SALE",
    "statusText": "For Rent",
    "listingType": "",
    "isFavorite": false,
    "detailUrl": "/b/1625-e-13th-st-brooklyn-ny-5YGKWY/",
    "has3DModel": false,
    "hasAdditionalAttributions": false,
  },
...
]

The search results provided valuable information about each listing, such as the address, geolocation, and metadata. However, in order to obtain all of the relevant listing details, we must scrape each individual property listing page, which can be found in the detailUrl field.

Note the Zillow search is limited to 500 properties per query. Therefore, we have to use smaller geographical zones to scrape real estate data precisely. For this, refer to the Zillow zip code index page.


Our scraping Zillow code can successfully extract Zillow real estate data from property and search pages. However, running the scrape at scale will lead the website to block our HTTP requests. Let's have a look at avoiding Zillow web scraping blocking next!

Bypass Zillow Scraping Blocking

Creating a Zillow data scraper doesn't seem to be complicated. However, scraping blocking will get in our way, such as in CAPTCHAS or IP address blocking. This is where Scrapfly can lend a hand!

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

For example, here's how we can scrape Zillow without getting blocked. All we have to do is replace out HTTP client with the ScrapFly API cleint, enable the asp parameter, and select a proxy country:

# standard web scraping code
import httpx
from parsel import Selector

response = httpx.get("some zillow.com URL")
selector = Selector(response.text)

# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient

# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="Your ScrapFly API key")

response = scrapfly.scrape(ScrapeConfig(
    url="website URL",
    asp=True, # enable the anti scraping protection to bypass blocking
    country="US", # set the proxy location to a specfic country
    render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))

# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']

FAQ

To wrap this guide up, let's take a look at some frequently asked questions about web scraping Zillow real estate data:

Yes. Zillow's data is publicly available; we're not extracting anything personal or private. Scraping Zillow.com at slow, respectful rates would fall under the ethical scraping definition.
That being said, attention should be paid to GDRP compliance in the EU when scraping personal data of non-agent listings (seller's name, phone number etc). For more, see our Is Web Scraping Legal? article.

Are there public APIs for Zillow?

Yes, but it's extremely limited and not suitable for dataset collection, and there are no Zillow APIs available for public use. Instead, we can scrape Zillow data with Python and httpx.

How to crawl Zillow?

We can easily create a Zillow crawler with the subjects we've covered in this tutorial. Instead of searching for properties explicitly, we can crawl Zillow properties from seed links (any Zillow URLs) and follow the related properties mentioned in a loop. For more on crawling, see How to Crawl the Web with Python.

Are there alternatives for Zillow?

Yes, Redfin is another popular real estate marketplace in the United States. We have covered scraping Redfin in a previous guide. For more guides on real estate target websites, refer to our #realestate blog tag.

Latest Zillow.com Scraper Code
https://github.com/scrapfly/scrapfly-scrapers/

Zillow Scraping Summary

In this guide, we explained scraping real estate data from Zillow.

We searched for real estate properties for sale or rent in any region. We used hidden web data scraping by extracting Zillow's state cache from the HTML page to scrape the property data, such as price and building information, contact details, etc.

For this, we used Python with httpx and parsel packages, and to avoid Zillow scraper blocking, we used ScrapFly's API that smartly configures every web scraper connection to avoid being blocked.

Related Posts

How to Scrape Reddit Posts, Subreddits and Profiles

In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.

How to Scrape LinkedIn in 2024

In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.

How to Scrape SimilarWeb Website Traffic Analytics

In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.