How to Scrape Zillow Real Estate Property Data in Python

article feature image

In this web scraping tutorial, we'll be taking a look at how to scrape Zillow.com - the biggest real estate marketplace in the United States.

In this guide, we'll be scraping rent and sale property information such as pricing info, addresses, photos and phone numbers displayed on Zillow.com property pages.
We'll start with a brief overview of how the website works. Then we'll take a look at how to use the search system to discover properties and, finally, how to scrape all of the property information.

We'll be using Python with a few community packages that'll make this Zillow scraper a breeze - let's dive in!

Latest Zillow.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Why Scrape Zillow.com?

Zillow.com contains a massive real estate dataset: prices, locations, contact information, etc. This is valuable information for market analytics, the study of the housing industry, and a general competitor overview.

So by web scraping zillow, we can have access to the biggest real estate property dataset in the US!

For more on scraping use cases see our extensive write-up Scraping Use Cases

Project Setup

In this tutorial, we'll scrape zillow using Python with two community packages:

  • httpx - HTTP client library which will let us communicate with Zillow.com's servers
  • parsel - HTML parsing library which will help us to parse our web scraped HTML files.

Optionally we'll also use loguru - a pretty logging library that'll help us to keep track of what's going on.

These packages can be easily installed via pip install command:

$ pip install httpx parsel loguru

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is the beautifulsoup package.

Hands on Python Web Scraping Tutorial and Example Project

If you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.

Hands on Python Web Scraping Tutorial and Example Project

How to Scrape Zillow Property Pages

To start, let's take a look at how to scrape zillow property dataset from given Zillow page urls.

First, let's take a look at where the data we want is located in the property page, like this one: zillow.com/b/1625-e-13th-st-brooklyn-ny-5YGKWY/

If we take a look at the property listing's page source, we can see that the property dataset is hidden in the HTML body as a JavaScript variable:

capture of page source of Zillow's property page
We can see property data is available as JSON object in a script tag

This is generally referred to as hidden web data scraping. In this example, zillow's backend stores property dataset to a JavaScript variable so the front-end designers can access it to display it for the end users.

In this particular example, zillow is using Next.js framework.

Let's add property scraping and parsing to our zillow scraper code:

Python
ScrapFly
import asyncio
from typing import List
import httpx
import json
from parsel import Selector

client = httpx.AsyncClient(
    # enable http2
    http2=True,
    # add basic browser like headers to prevent being blocked
    headers={
        "accept-language": "en-US,en;q=0.9",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "accept-language": "en-US;en;q=0.9",
        "accept-encoding": "gzip, deflate, br",
    },
)


async def scrape_properties(urls: List[str]):
    """scrape zillow property pages for property data"""
    to_scrape = [client.get(url) for url in urls]
    results = []
    for response in asyncio.as_completed(to_scrape):
        response = await response
        assert response.status_code == 200, "request has been blocked"
        selector = Selector(response.text)
        data = selector.css("script#__NEXT_DATA__::text").get()
        if data:
            # Option 1: some properties are located in NEXT DATA cache
            data = json.loads(data)
            property_data = json.loads(data["props"]["pageProps"]["componentProps"]["gdpClientCache"])
            property_data = property_data[list(property_data)[0]]['property']
        else:
            # Option 2: other times it's in Apollo cache
            data = selector.css("script#hdpApolloPreloadedData::text").get()
            data = json.loads(json.loads(data)["apiCache"])
            property_data = next(
                v["property"] for k, v in data.items() if "ForSale" in k
            )
        results.append(property_data)
    return results
import asyncio
import json
from typing import List
from scrapfly import ScrapeConfig, ScrapflyClient

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

async def scrape_properties(urls: List[str]):
    """scrape zillow property pages for property data"""
    to_scrape = [ScrapeConfig(url, asp=True, country="US") for url in urls]
    results = []
    async for result in scrapfly.concurrent_scrape(to_scrape):
        data = result.selector.css("script#__NEXT_DATA__::text").get()
        if data:
            # Option 1: some properties are located in NEXT DATA cache
            data = json.loads(data)
            property_data = json.loads(data["props"]["pageProps"]["componentProps"]["gdpClientCache"])
            property_data = property_data[list(property_data)[0]]['property']
        else:
            # Option 2: other times it's in Apollo cache
            data = result.selector.css("script#hdpApolloPreloadedData::text").get()
            data = json.loads(json.loads(data)["apiCache"])
            property_data = next(v["property"] for k, v in data.items() if "ForSale" in k)
        results.append(property_data)
    return results
Run the code
async def run():
    data = await scrape_properties(
            ["https://www.zillow.com/homedetails/1625-E-13th-St-APT-3K-Brooklyn-NY-11229/245001606_zpid/"]
        )
    print(json.dumps(data, indent=2))
if __name__ == "__main__":
    asyncio.run(run())

Above, to scrape data from zillow we defined a small function that takes a list of property URLs. Then we scrape their HTML pages, extract embedded JavaScript state data and parse property info such as address, prices and phone numbers!

Let's run this property scraper and see the results it generates:

Example Output
[
  {
    "address": {
      "streetAddress": "1065 2nd Ave",
      "city": "New York",
      "state": "NY",
      "zipcode": "10022",
      "__typename": "Address",
      "neighborhood": null
    },
    "description": "Inspired by Alvar Aaltos iconic vase, Aalto57s sculptural architecture reflects classic concepts of design both inside and out. Each residence in this boutique rental building features clean modern finishes. Amenities such as a landscaped terrace with gas grills, private and group dining areas, sun loungers, and fire feature as well as an indoor rock climbing wall, basketball court, game room, childrens playroom, guest suite, and a fitness center make Aalto57 a home like no other.",
    "photos": [
      "https://photos.zillowstatic.com/fp/0c1099a1882a904acc8cedcd83ebd9dc-p_d.jpg",
      "..."
    ],
    "zipcode": "10022",
    "phone": "646-681-3805",
    "name": "Aalto57",
    "floor_plans": [
      {
        "zpid": "2096631846",
        "__typename": "FloorPlan",
        "availableFrom": "1657004400000",
        "baths": 1,
        "beds": 1,
        "floorPlanUnitPhotos": [],
        "floorplanVRModel": null,
        "maxPrice": 6200,
        "minPrice": 6200,
        "name": "1 Bed/1 Bath-1D",
        ...
      }
    ...
  ]
}]

Our zillow scraper can scrape real estate data from property pages. Let's scrape search pages!

How to Find Zillow Properties

Now that we know how to scrape a single Zillow property listing we can take a look at how to find the listings using Zillow's Search bar.

Let's take a look at how it functions and how we can use it in Zillow web scraping with Python:

0:00
/
Inspecting Zillow's search functionality with Chrome Dev tools (accessed via F12 key)

Above, we can see that once we submit our search, a background request is being made to zillow's search API. We send a search query with some map coordinates and receive hundreds of listing previews. We can see that to query zillow we only need a few parameter inputs:

{
  "searchQueryState":{
    "pagination":{},
    "usersSearchTerm":"New Haven, CT",
    "mapBounds":
      {
        "west":-73.03037621240235,
        "east":-72.82781578759766,
        "south":41.23043771298298,
        "north":41.36611033618769
      },
    },
  "wants": {
    "cat1":["mapResults"]
  },
  "requestId": 2
}

We can see that this API is really powerful and allows us to find listings in any map area defined by two location points comprised of 4 direction values: north, west, south and east:

illustration of drawing areas on maps using only two points
with these 4 values we can draw a square or a circle area at any point of the map!

This means we can find properties of any location area as long as we know its latitude and longitude. We can replicate this request in our zillow scraper:

Python
ScrapFly
import json
import httpx

# we should use browser-like request headers to prevent being instantly blocked
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


url = "https://www.zillow.com/async-create-search-page-state"
body = {
    "searchQueryState": {
        "pagination": {},
        "usersSearchTerm": "New Haven, CT",
        # map coordinates that indicate New Haven city's area
        "mapBounds": {
            "west": -73.03037621240235,
            "east": -72.82781578759766,
            "south": 41.23043771298298,
            "north": 41.36611033618769,
        },
    },
    "wants": {"cat1": ["listResults", "mapResults"], "cat2": ["total"]},
    "requestId": 2,
}
response = httpx.put(url, headers=BASE_HEADERS, data=json.dumps(body))
assert response.status_code == 200, "request has been blocked"
data = response.json()
results = response.json()["cat1"]["searchResults"]["mapResults"]
print(json.dumps(results, indent=2))
print(f"found {len(results)} property results")
import json
from scrapfly import ScrapeConfig, ScrapflyClient

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

url = "https://www.zillow.com/async-create-search-page-state"
body = {
    "searchQueryState": {
        "pagination": {},
        "usersSearchTerm": "New Haven, CT",
        # map coordinates that indicate New Haven city's area
        "mapBounds": {
            "west": -73.03037621240235,
            "east": -72.82781578759766,
            "south": 41.23043771298298,
            "north": 41.36611033618769,
        },
    },
    "wants": {"cat1": ["listResults", "mapResults"], "cat2": ["total"]},
    "requestId": 2,
}

response = scrapfly.scrape(
    ScrapeConfig(
        url,
        asp=True,
        country="US",
        headers={"content-type": "application/json"},
        body=json.dumps(body),
        method="PUT",
    )
)

data = json.loads(response.content)
results = data["cat1"]["searchResults"]["mapResults"]
print(json.dumps(results, indent=2))
print(f"found {len(results)} property results")  

We can see that we can replicate this search request relatively easily. So, let's take a look at how we can scrape this properly!

To scrape zillow's search, we need geographical location details which are difficult to come up with unless you're familiar with geographical programming. However, there's an easy way to find the location's geographical details by exploring zillow's search page itself.
If we take a look at search URL like zillow.com/homes/New-Haven,-CT_rb/ we can see geographical details hidden away in the HTML body:

capture of page source of Zillow's search pager
We can see query and geo data of this search hidden in a page source comment

We can get the geographical details from hidden script tags to use it with our search based request. Let's see how we can do it in our Zillow scraper code:

Python
ScrapFly
import random
import asyncio
import json
import httpx
from loguru import logger as log
from parsel import Selector


async def _search(
    query: str,
    session: httpx.AsyncClient,
    filters: dict = None,
    categories=("cat1", "cat2"),
):
    """base search function which is used by sale and rent search functions"""
    html_response = await session.get(f"https://www.zillow.com/homes/{query}_rb/")
    assert html_response.status_code == 403, "request is blocked"
    selector = Selector(html_response.text)
    # find query data in script tags
    script_data = json.loads(
        selector.xpath("//script[@id='__NEXT_DATA__']/text()").get()
    )
    query_data = script_data["props"]["pageProps"]["searchPageState"]["queryState"]
    if filters:
        query_data["filterState"] = filters

    # scrape search API
    url = "https://www.zillow.com/async-create-search-page-state"
    found = []
    # cat1 - Agent Listings
    # cat2 - Other Listings
    for category in categories:
        full_query = {
            "searchQueryState": query_data,
            "wants": {category: ["mapResults"]},
            "requestId": random.randint(2, 10),
        }
        api_response = await session.put(
            url,
            headers={"content-type": "application/json"},
            body=json.dumps(full_query),
        )
        data = api_response.json()
        _total = data["categoryTotals"][category]["totalResultCount"]
        if _total > 500:
            log.warning(f"query has more results ({_total}) than 500 result limit ")
        else:
            log.info(f"found {_total} results for query: {query}")
        map_results = data[category]["searchResults"]["mapResults"]
        found.extend(map_results)
    return found


async def search_sale(query: str, session: httpx.AsyncClient):
    """search properties that are for sale"""
    log.info(f"scraping sale search for: {query}")
    return await _search(query=query, session=session)


async def search_rent(query: str, session: httpx.AsyncClient):
    """search properites that are for rent"""
    log.info(f"scraping rent search for: {query}")
    filters = {
        "isForSaleForeclosure": {"value": False},
        "isMultiFamily": {"value": False},
        "isAllHomes": {"value": True},
        "isAuction": {"value": False},
        "isNewConstruction": {"value": False},
        "isForRent": {"value": True},
        "isLotLand": {"value": False},
        "isManufactured": {"value": False},
        "isForSaleByOwner": {"value": False},
        "isComingSoon": {"value": False},
        "isForSaleByAgent": {"value": False},
    }
    return await _search(
        query=query, session=session, filters=filters, categories=["cat1"]
    )
import asyncio
import json
from random import randint
from scrapfly import ScrapeConfig, ScrapflyClient
from loguru import logger as log

scrapfly = ScrapflyClient(key="Your ScrapFly API key")


async def _search(query: str, filters: dict = None, categories=("cat1", "cat2")):
    """base search function which is used by sale and rent search functions"""
    html_response = await scrapfly.async_scrape(
        ScrapeConfig(
            f"https://www.zillow.com/homes/{query}_rb/", asp=True, country="US"
        )
    )
    script_data = json.loads(
        html_response.selector.xpath("//script[@id='__NEXT_DATA__']/text()").get()
    )
    query_data = script_data["props"]["pageProps"]["searchPageState"]["queryState"]
    if filters:
        query_data["filterState"] = filters
    # scrape search API
    url = "https://www.zillow.com/async-create-search-page-state"
    found = []
    # cat1 - Agent Listings
    # cat2 - Other Listings
    for category in categories:
        full_query = {
            "searchQueryState": query_data,
            "wants": {category: ["mapResults"]},
            "requestId": randint(2, 10),
        }
        api_response = await scrapfly.async_scrape(
            ScrapeConfig(
                url,
                asp=True,
                country="US",
                headers={"content-type": "application/json"},
                body=json.dumps(full_query),
                method="PUT",
            )
        )
        data = json.loads(api_response.content)
        _total = data["categoryTotals"][category]["totalResultCount"]
        if _total > 500:
            log.warning(f"query has more results ({_total}) than 500 result limit ")
        else:
            log.info(f"found {_total} results for query: {query}")
        map_results = data[category]["searchResults"]["mapResults"]
        found.extend(map_results)
    return found


async def search_sale(query: str):
    """search properties that are for sale"""
    log.info(f"scraping sale search for: {query}")
    return await _search(query=query)


async def search_rent(query: str):
    """search properites that are for rent"""
    log.info(f"scraping rent search for: {query}")
    filters = {
        "isForSaleForeclosure": {"value": False},
        "isMultiFamily": {"value": False},
        "isAllHomes": {"value": True},
        "isAuction": {"value": False},
        "isNewConstruction": {"value": False},
        "isForRent": {"value": True},
        "isLotLand": {"value": False},
        "isManufactured": {"value": False},
        "isForSaleByOwner": {"value": False},
        "isComingSoon": {"value": False},
        "isForSaleByAgent": {"value": False},
    }
    return await _search(query=query, filters=filters, categories=["cat1"])


async def run():
    data = await search_rent("New Haven, CT")
    print(json.dumps(data, indent=2))


if __name__ == "__main__":
    asyncio.run(run())

Above, we define our search function for scraping zillow's rent and sale searches. The first thing we notice is that the rent and the sale pages use the same search endpoint. The only difference is that the rent search applies extra filtering to filter out sale properties.

Let's run this Zillow web scraper and see what results we receive:

Run code and example output
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}

async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        data = await search_rent("New Haven, CT", session)
        print(json.dumps(data, indent=2))


if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "buildingId": "40.609608--73.960045",
    "lotId": 1004524429,
    "price": "From $295,000",
    "latLong": {
      "latitude": 40.609608,
      "longitude": -73.960045
    },
    "minBeds": 1,
    "minBaths": 1.0,
    "minArea": 1200,
    "imgSrc": "https://photos.zillowstatic.com/fp/3c0259c716fc4793a65838aa40af6350-p_e.jpg",
    "hasImage": true,
    "plid": "1611681",
    "isFeaturedListing": false,
    "unitCount": 2,
    "isBuilding": true,
    "address": "1625 E 13th St, Brooklyn, NY",
    "variableData": {},
    "badgeInfo": null,
    "statusType": "FOR_SALE",
    "statusText": "For Rent",
    "listingType": "",
    "isFavorite": false,
    "detailUrl": "/b/1625-e-13th-st-brooklyn-ny-5YGKWY/",
    "has3DModel": false,
    "hasAdditionalAttributions": false,
  },
...
]

Note: zillow search is limited to 500 properties per search, so we need to search in smaller geographical squares or use Zillow zipcode index, which contains all US zipcodes that essential are small geographical zones!

The search returned a lot of useful preview data about each listing. It contains fields like address, geolocation, and some metadata. Though, to retrieve all the listing data, we need to scrape each property listing page, which we can find in the detailUrl field.


We wrote a quick python scraper that scrapes Zillow's properties from a given query string and then scrapes each property page for property information.

However, to run this scraper at scale without being blocked, we'll take a look at using ScrapFly web scraping API. ScrapFly will help us to scale up our scraper and avoid zillow web scraping blocking.

Avoid Zillow Web Scraping Blocking

Web Scraping Zillow.com data doesn't seem to be too difficult though unfortunately when scraping at scale it's very likely we'll be blocked or need solve catpchas, which will hinder or completely disable our web scraper.

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!

illustration of scrapfly's middleware

ScrapFly offers several powerful features that'll help us to get around Zillow web scraper blocking:

For this we'll be using scrapfly-sdk python package. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Zillow scraper all we need to do is change our httpx session code with scrapfly-sdk client requests:

import httpx

response = httpx.get("some zillow url")
# in ScrapFly SDK becomes
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient("YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
    "some zillow url",
    # we can select specific proxy country
    country="US",
    # and enable anti scraping protection bypass:
    asp=True
))

FAQ

To wrap this guide up, let's take a look at some frequently asked questions about web scraping Zillow data:

Yes. Zillow's data is publicly available; we're not extracting anything personal or private. Scraping Zillow.com at slow, respectful rates would fall under the ethical scraping definition.
That being said, attention should be paid to GDRP compliance in the EU when scraping personal data of non-agent listings (seller's name, phone number etc). For more, see our Is Web Scraping Legal? article.

Does Zillow.com have an API?

Yes, but it's extremely limited and not suitable for dataset collection and there are no Zillow API Python clients available. Instead, we can scrape Zillow data with Python and httpx, which is perfectly legal and easy to do.

How to crawl Zillow?

We can easily create a Zillow crawler with the subjects we've covered in this tutorial. Instead of searching for properties explicitly, we can crawl Zillow properties from seed links (any Zillow URLs) and follow the related properties mentioned in a loop. For more on crawling, see How to Crawl the Web with Python.

Latest Zillow.com Scraper Code
https://github.com/scrapfly/scrapfly-scrapers/

Zillow Scraping Summary

In this tutorial, explained how to write a zillow scraper using Python.

We used search to discover real estate properties for sale or rent in any given region. To scrape the property data, such as price and building information, contact details etc. we used hidden web data scraping by extracting Zillow's state cache from the HTML page.

For this, we used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API that smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly, see our documentation and try it out for free!

Related Posts

How to Scrape Bing Search with Python

In this scrape guide we'll be taking a look at scraping Bing search results. It's the second biggest search engine in the world and it contains a lot of data - all retrievable with a bit a of Python.

How to Scrape G2 Company Data and Reviews

In this scrapeguide we're taking a look at G2.com - one of the biggest digital product metawebsites out there. We'll be scraping product data, reviews and company profiles.

How to Scrape Etsy.com Product, Shop and Search Data

In this scrapeguide we're taking a look at Etsy.com - a popular e-commerce market for hand crafted and vintage items. We'll be using Python and HTML parsing to scrape search and product data.