How to Scrape Zillow.com

article feature image

In this web scraping tutorial we'll be scraping Zillow.com - the biggest real estate marketplace in the United States.

In this guide, we'll take a look at how to collect rent and sale property information such as pricing info, addresses, photos and phone numbers displayed on Zillow.com property pages. We'll start by taking a look at how the website functions, how to use it's search system to discover properties and finally how to scrape all of the property information.

We'll be using Python with few community packages that'll make this web scraper a breeze!

Web Scraping With Python Tutorial

If you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.

Web Scraping With Python Tutorial

Why Scrape Zillow.com?

Zillow.com contains a massive real estate dataset: prices, locations, contact information etc. This is valuable information for market analytics, study of housing industry and competitor overview.

For more on scraping use cases see our extensive write-up Scraping Use Cases

Setup

In this tutorial we'll be using Python with two community packages:

  • httpx - HTTP client library which will let us communicate with Zillow.com's servers
  • parsel - HTML parsing library which will help us to parse our web scraped HTML files.

Optionally we'll also use loguru - a pretty logging library that'll help use keep track of what's going on.

These packages can be easily installed via pip command:

$ pip install httpx parsel loguru

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package.

Finding Properties

We'll start off our scraper by taking a look at how can we find property listings. For this, Zillow provides a powerful search functionality. Let's take a look how it functions and how can we use it in web scraping:

0:00
/
Inspecting Zillow's search functionality with Chrome Dev tools (accessed via F12 key)

Above, we can see that once we submit our search, a background request is being made to Zillow's search API. We send a search query with some map coordinates and receive hundreds of listing previews. We can see that to query Zillow we only need few parameter inputs:

{
  "searchQueryState":{
    "pagination":{},
    "usersSearchTerm":"New Haven, CT",
    "mapBounds":
      {
        "west":-73.03037621240235,
        "east":-72.82781578759766,
        "south":41.23043771298298,
        "north":41.36611033618769
      },
    },
  "wants": {
    "cat1":["mapResults"]
  },
  "requestId": 2
}

We can see that this API is really powerful and allows us to find listings in any map area defined by two locations points comprised of 4 direction values: north, west, south, east:

illustration of drawing areas on maps using only two points
with these 4 values we can draw a square or a circle area at any point of the map!

This means we can find properties of any location area as long as we know it's latitude and longitude. We can replicate this request in our Python scraper:

from urllib.parse import urlencode
import json
import httpx

# we should use browser-like request headers to prevent being instantly blocked
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


url = "https://www.zillow.com/search/GetSearchPageState.htm?"
parameters = {
    "searchQueryState": {
        "pagination": {},
        "usersSearchTerm": "New Haven, CT",
        # map coordinates that indicate New Haven city's area
        "mapBounds": {
            "west": -73.03037621240235,
            "east": -72.82781578759766,
            "south": 41.23043771298298,
            "north": 41.36611033618769,
        },
    },
    "wants": {
        # cat1 stands for agent listings
        "cat1": ["mapResults"]
        # and cat2 for non-agent listings
        # "cat2":["mapResults"]
    },
    "requestId": 2,
}
response = httpx.get(url + urlencode(parameters), headers=BASE_HEADERS)
data = response.json()
results = response.json()["cat1"]["searchResults"]["mapResults"]
print(json.dumps(results, indent=2))
print(f"found {len(results)} property results")

We can see that we can replicate this search request rather easily. So, let's take a full look at how we can scrape this properly!

To scrape Zillow's search we need geographical location details which are difficult to come up with unless you're familiar with geographical programming. However, there's an easy way to find location's geographical details by exploring Zillow's search page itself.
If we take a look at search URL like zillow.com/homes/New-Haven,-CT_rb/ we can see geographical details hidden away in the HTML body:

capture of page source of Zillow's search pager
We can see query and geo data of this search hidden in a page source comment

We can use simple regular expression patterns to extract these details and submit our geographically based search request. Let's see how we can do it in Python scraping code:

from loguru import logger as log
import httpx

async def _search(query:str, session: httpx.AsyncClient, filters: dict=None, categories=("cat1", "cat2")):
    """base search function which is used by sale and rent search functions"""
    html_response = await session.get(f"https://www.zillow.com/homes/{query}_rb/")
    # find query data in search landing page
    query_data = json.loads(re.findall(r'"queryState":(\{.+}),\s*"filter', html_response.text)[0])
    if filters:
        query_data["filterState"] = filters

    # scrape search API
    url = "https://www.zillow.com/search/GetSearchPageState.htm?"
    found = []
    # cat1 - Agent Listings
    # cat2 - Other Listings
    for category in categories:
        full_query = {
            "searchQueryState": query_data,
            "wants": {category: ["mapResults"]},
            "requestId": randint(2, 10),
        }
        api_response = await session.get(url + urlencode(full_query))
        data = api_response.json()
        _total = data["categoryTotals"][category]["totalResultCount"]
        if _total > 500:
            log.warning(f"query has more results ({_total}) than 500 result limit ")
        else:
            log.info(f"found {_total} results for query: {query}")
        map_results = data[category]["searchResults"]["mapResults"]
        found.extend(map_results)
    return found


async def search_sale(query: str, session: httpx.AsyncClient):
    """search properties that are for sale"""
    log.info(f"scraping sale search for: {query}")
    return await _search(query=query, session=session)


async def search_rent(query: str, session: httpx.AsyncClient):
    """search properites that are for rent"""
    log.info(f"scraping rent search for: {query}")
    filters = {
        "isForSaleForeclosure": {"value": False},
        "isMultiFamily": {"value": False},
        "isAllHomes": {"value": True},
        "isAuction": {"value": False},
        "isNewConstruction": {"value": False},
        "isForRent": {"value": True},
        "isLotLand": {"value": False},
        "isManufactured": {"value": False},
        "isForSaleByOwner": {"value": False},
        "isComingSoon": {"value": False},
        "isForSaleByAgent": {"value": False},
    }
    return await _search(query=query, session=session, filters=filters, categories=["cat1"])

Above, we define our search functions for scraping rent and sale search. First thing we notice is that rent and sale pages use the same search just that rent search applies extra filtering to filter out sale properties.

Let's run this scraper and see what results we receive:

Run code and example output
import json
import asyncio

async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        data = await search_rent("New Haven, CT", session)
        print(json.dumps(data, indent=2))


if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "buildingId": "40.609608--73.960045",
    "lotId": 1004524429,
    "price": "From $295,000",
    "latLong": {
      "latitude": 40.609608,
      "longitude": -73.960045
    },
    "minBeds": 1,
    "minBaths": 1.0,
    "minArea": 1200,
    "imgSrc": "https://photos.zillowstatic.com/fp/3c0259c716fc4793a65838aa40af6350-p_e.jpg",
    "hasImage": true,
    "plid": "1611681",
    "isFeaturedListing": false,
    "unitCount": 2,
    "isBuilding": true,
    "address": "1625 E 13th St, Brooklyn, NY",
    "variableData": {},
    "badgeInfo": null,
    "statusType": "FOR_SALE",
    "statusText": "For Rent",
    "listingType": "",
    "isFavorite": false,
    "detailUrl": "/b/1625-e-13th-st-brooklyn-ny-5YGKWY/",
    "has3DModel": false,
    "hasAdditionalAttributions": false,
  },
...
]

Note that Zillow's search is limited to 500 properties per search, so we either need to search in smaller geographical squares or an easier approach to this is to use Zillow's zipcode index which contains all US zipcodes that essentially are small geographical zones!

There's a bit of useful preview data about each listing like address, geolocation and some metadata, though to retrieve all the listing data we need to scrape each individual property listing page which here is defined in the detailUrl field.
So, for our scraper we can discover properties via location name (be it city, zip code etc.), scrape property previews and then pull all detailUrl fields to scrape all property data. Next, let's take a look how can we do that.

Scraping Properties

Now that we found our listing previews we can extract the rest of the listing information by scraping each individual page. To start let's take a look where the data we want is located in the property page like the one we scraped previously: zillow.com/b/1625-e-13th-st-brooklyn-ny-5YGKWY/

If we take a page source of this page (or any other listing) we can see that property data is hidden in the HTML body as a javascript variable:

capture of page source of Zillow's property page
We can see property data is available as JSON object in a script tag

This is generally referred as "javascript state cache" and is used by various javascript front ends for dynamic data rendering. In this particular example, Zillow is using Next.js framework.

Let's add property scraping and parsing to our scraper code:

import json
import asyncio
import httpx

from parsel import Selector
from typing import List

def parse_property(data: dict) -> dict:
    """parse zillow property"""
    # zillow property data is massive, let's take a look just
    # at the basic information to keep this tutorial brief:
    parsed = {
        "address": data["address"],
        "description": data["description"],
        "photos": [photo["url"] for photo in data["galleryPhotos"]],
        "zipcode": data["zipcode"],
        "phone": data["buildingPhoneNumber"],
        "name": data["buildingName"],
        # floor plans include price details, availability etc.
        "floor_plans": data["floorPlans"],
    }
    return parsed


async def scrape_properties(urls: List[str], session: httpx.AsyncClient):
    """scrape zillow properties"""
    async def scrape(url):
        resp = await session.get(url)
        sel = Selector(text=resp.text)
        data = sel.css("script#__NEXT_DATA__::text").get()
        data = json.loads(data)
        return parse_property(data["props"]["initialReduxState"]["gdp"]["building"])

    return await asyncio.gather(*[scrape(url) for url in urls])

Above, we define our scrape_properties function which given a list of property urls will scrape their HTML pages, extract embedded javascript state data and parse property info such as address, prices and phone numbers!

Let's run this property scraper and see the results it generates:

Run code & example output
async def run():
    limits = httpx.Limits(max_connections=5)
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        data = await scrape_properties(
            ["https://www.zillow.com/b/1625-e-13th-st-brooklyn-ny-5YGKWY/"], 
            session=session
        )
        print(json.dumps(data, indent=2))


if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "address": {
      "streetAddress": "1065 2nd Ave",
      "city": "New York",
      "state": "NY",
      "zipcode": "10022",
      "__typename": "Address",
      "neighborhood": null
    },
    "description": "Inspired by Alvar Aaltos iconic vase, Aalto57s sculptural architecture reflects classic concepts of design both inside and out. Each residence in this boutique rental building features clean modern finishes. Amenities such as a landscaped terrace with gas grills, private and group dining areas, sun loungers, and fire feature as well as an indoor rock climbing wall, basketball court, game room, childrens playroom, guest suite, and a fitness center make Aalto57 a home like no other.",
    "photos": [
      "https://photos.zillowstatic.com/fp/0c1099a1882a904acc8cedcd83ebd9dc-p_d.jpg",
      "..."
    ],
    "zipcode": "10022",
    "phone": "646-681-3805",
    "name": "Aalto57",
    "floor_plans": [
      {
        "zpid": "2096631846",
        "__typename": "FloorPlan",
        "availableFrom": "1657004400000",
        "baths": 1,
        "beds": 1,
        "floorPlanUnitPhotos": [],
        "floorplanVRModel": null,
        "maxPrice": 6200,
        "minPrice": 6200,
        "name": "1 Bed/1 Bath-1D",
        ...
      }
    ...
  ]
}]

We wrote a quick python scraper that finds Zillow's properties from a given query string and then scrapes each individual property page for property information. However, to run this scraper at scale without being blocked we'll take a look at using ScrapFly web scraping API which will help us to scale up our scraper and avoid blocking and captcha requests.

ScrapFly - Avoiding Blocking and Captchas

Scraping Zillow.com data doesn't seem to be too difficult though unfortunately when scraping at scale it's very likely we'll be blocked or need solve catpchas, which will hinder or completely disable our web scraper.

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!

illustration of scrapfly's middleware

ScrapFly offers several powerful features that'll help us to get around Zillow's web scraper blocking:

For this we'll be using scrapfly-sdk python package. First, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Zillow web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests.

Full Scraper Code

Let's take a look at how our full scraper code would look with ScrapFly integration:

Full Scraper Code with ScrapFly integration
import asyncio
import json
import re
from random import randint
from typing import List
from urllib.parse import urlencode

from loguru import logger as log
from parsel import Selector
from scrapfly import ScrapeConfig, ScrapflyClient


async def _search(query: str, session: ScrapflyClient, filters: dict = None, categories=("cat1", "cat2")) -> List[dict]:
    """base search function which is used by sale and rent search functions"""
    html_result = await session.async_scrape(
        ScrapeConfig(
            url=f"https://www.zillow.com/homes/{query}_rb/",
            proxy_pool="public_residential_pool",
            country="US",
            asp=True,
        )
    )
    query_data = json.loads(re.findall(r'"queryState":(\{.+}),\s*"filter', html_result.content)[0])
    if filters:
        query_data["filterState"] = filters
    url = "https://www.zillow.com/search/GetSearchPageState.htm?"
    found = []
    # cat1 - Agent Listings
    # cat2 - Other Listings
    for category in categories:
        full_query = {
            "searchQueryState": query_data,
            "wants": {category: ["mapResults"]},
            "requestId": randint(2, 10),
        }
        api_result = await session.async_scrape(
            ScrapeConfig(
                url=url + urlencode(full_query),
                proxy_pool="public_residential_pool",
                country="US",
                asp=True,
            )
        )
        data = json.loads(api_result.content)
        _total = data["categoryTotals"][category]["totalResultCount"]
        if _total > 500:
            log.warning(f"query has more results ({_total}) than 500 result limit ")
        else:
            log.info(f"found {_total} results for query: {query}")
        map_results = data[category]["searchResults"]["mapResults"]
        found.extend(map_results)
    return found


async def search_sale(query: str, session: ScrapflyClient) -> List[dict]:
    """search properties that are for sale"""
    log.info(f"scraping sale search for: {query}")
    return await _search(query=query, session=session)


async def search_rent(query: str, session: ScrapflyClient) -> List[dict]:
    """search properites that are for rent"""
    log.info(f"scraping rent search for: {query}")
    filters = {
        "isForSaleForeclosure": {"value": False},
        "isMultiFamily": {"value": False},
        "isAllHomes": {"value": True},
        "isAuction": {"value": False},
        "isNewConstruction": {"value": False},
        "isForRent": {"value": True},
        "isLotLand": {"value": False},
        "isManufactured": {"value": False},
        "isForSaleByOwner": {"value": False},
        "isComingSoon": {"value": False},
        "isForSaleByAgent": {"value": False},
    }
    return await _search(query=query, session=session, filters=filters, categories=["cat1"])


def parse_property(data: dict) -> dict:
    """parse zillow property"""
    # zillow property data is massive, let's take a look just
    # at the basic information to keep this tutorial brief:
    parsed = {
        "address": data["address"],
        "description": data["description"],
        "photos": [photo["url"] for photo in data["galleryPhotos"]],
        "zipcode": data["zipcode"],
        "phone": data["buildingPhoneNumber"],
        "name": data["buildingName"],
        # floor plans include price details, availability etc.
        "floor_plans": data["floorPlans"],
    }
    return parsed


async def scrape_properties(urls: List[str], session: ScrapflyClient):
    """scrape zillow properties"""

    async def scrape(url):
        result = await session.async_scrape(
            ScrapeConfig(url=url, asp=True, country="US", proxy_pool="public_residential_pool")
        )
        response = result.upstream_result_into_response()
        sel = Selector(text=response.text)
        data = sel.css("script#__NEXT_DATA__::text").get()
        data = json.loads(data)
        return parse_property(data["props"]["initialReduxState"]["gdp"]["building"])

    return await asyncio.gather(*[scrape(url) for url in urls])


async def run():
    with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=10) as session:
        rentals = await search_rent("New Haven, CT", session)
        sales = await search_sale("New Haven, CT", session)
        property_data = await scrape_properties(
            ["https://www.zillow.com/b/aalto57-new-york-ny-5twVDd/"], session=session
        )


if __name__ == "__main__":
    asyncio.run(run())

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping Zillow.com:

Summary

In this tutorial we built a Zillow.com scraper which is capable of using search to discover properties for sale or rent in any given region as well as scraping all of the property data such as price and building information, contact details and so on.

For this we used Python with httpx and parsel packages and to avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Related post

How to Scrape YellowPages.com

Tutorial on how to scrape yellowpages.com business and review data using Python. How to avoid blocking to scrape data at scale and other tips.

How to Scrape Amazon.com

Tutorial on how to scrape Amazon.com's product and review data using Python and how to avoid blocking to scrape this information at scale.

How to Scrape TripAdvisor.com

Tutorial on how to scrape TripAdvisor.com hotel, restaurant, tour and review data using Python and how to avoid blocking to scrape at scale.

How to Scrape Aliexpress.com

Tutorial on how to scrape Aliexpress.com product, review and pricing data using Python. How to avoid blocking to scrape at scale and other tips.