How to Scrape YouTube in 2025
Learn how to scrape YouTube, channel, video, and comment data using Python directly in JSON.
In this web scraping tutorial, we'll look at how to scrape Zillow property data - the biggest real estate marketplace in the United States.
In this Zillow data scraper, we'll extract real estate data, including rent and sale property information, such as prices, addresses, photos, and other website details. We'll start with a brief overview of how the website works and how to navigate it. Then, we'll explain how to use its search system for effective Zillow real estate data discovery, and finally, we'll extract the full property details. Let's get started!
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Zillow.com contains a massive real estate dataset: prices, locations, contact information, etc. This is valuable information for market analytics, the study of the housing industry, and a general competitor overview.
This means that by web scraping Zillow, we have access to the biggest real estate market in the US!
For further details on data scraping use cases, refer to our extensive guide.
In this tutorial, we'll scrape Zillow using Python with two community packages:
Optionally, we'll also use loguru, a logging library that will allow us to track our Zillow data scraper.
These packages can be installed using the following pip command:
$ pip install httpx parsel loguru
Alternatively, feel free to replace httpx
with any other HTTP client package, such as requests, as we'll only send basic HTTP requests. As for parsel
, another great alternative is the beautifulsoup package.
If you're new to web scraping with Python, we recommend checking out our full introduction tutorial with the common best practices.
To start, let's explore scraping Zillow data from property pages. First, let's locate the data on the HTML from a given Zillow page, like this one.
To scrape this page data, we can parse every detail using XPath or CSS selectors. However, there is a better approach: hidden web data. To find this data, follow the below steps:
F12
key.//script[@id='__NEXT_DATA__']
.After following the above steps, you will find the property dataset hidden in the JavaScript variable with the above XPath selector:
The above real estate data is the same on the page but before getting rendered into the HTML, commonly known as hidden web data.
Learn what hidden data is through some common examples. You will also learn how to scrape it using regular expressions and other clever parsing algorithms.
Let's power our Zillow data scraper with requesting and parsing logic for property pages:
import asyncio
from typing import List
import httpx
import json
from parsel import Selector
client = httpx.AsyncClient(
# enable http2
http2=True,
# add basic browser like headers to prevent being blocked
headers={
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
},
)
async def scrape_properties(urls: List[str]):
"""scrape zillow property pages for property data"""
to_scrape = [client.get(url) for url in urls]
results = []
for response in asyncio.as_completed(to_scrape):
response = await response
assert response.status_code == 200, "request has been blocked"
selector = Selector(response.text)
data = selector.css("script#__NEXT_DATA__::text").get()
if data:
# Option 1: some properties are located in NEXT DATA cache
data = json.loads(data)
property_data = json.loads(data["props"]["pageProps"]["componentProps"]["gdpClientCache"])
property_data = property_data[list(property_data)[0]]['property']
else:
# Option 2: other times it's in Apollo cache
data = selector.css("script#hdpApolloPreloadedData::text").get()
data = json.loads(json.loads(data)["apiCache"])
property_data = next(
v["property"] for k, v in data.items() if "ForSale" in k
)
results.append(property_data)
return results
import asyncio
import json
from typing import List
from scrapfly import ScrapeConfig, ScrapflyClient
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
async def scrape_properties(urls: List[str]):
"""scrape zillow property pages for property data"""
to_scrape = [ScrapeConfig(url, asp=True, country="US") for url in urls]
results = []
async for result in scrapfly.concurrent_scrape(to_scrape):
data = result.selector.css("script#__NEXT_DATA__::text").get()
if data:
# Option 1: some properties are located in NEXT DATA cache
data = json.loads(data)
property_data = json.loads(data["props"]["pageProps"]["componentProps"]["gdpClientCache"])
property_data = property_data[list(property_data)[0]]['property']
else:
# Option 2: other times it's in Apollo cache
data = result.selector.css("script#hdpApolloPreloadedData::text").get()
data = json.loads(json.loads(data)["apiCache"])
property_data = next(v["property"] for k, v in data.items() if "ForSale" in k)
results.append(property_data)
return results
async def run():
data = await scrape_properties(
["https://www.zillow.com/homedetails/1625-E-13th-St-APT-3K-Brooklyn-NY-11229/245001606_zpid/"]
)
print(json.dumps(data, indent=2))
if __name__ == "__main__":
asyncio.run(run())
Let's break down the above code for scraping Zillow. We start by defining an httpx
client with standard browser headers to avoid blocking. Then, we define a scrape_properties
function, which does the following:
script
tag with the JSON data.Here is what the extracted data from Zillow looks like:
[
{
"address": {
"streetAddress": "1065 2nd Ave",
"city": "New York",
"state": "NY",
"zipcode": "10022",
"__typename": "Address",
"neighborhood": null
},
"description": "Inspired by Alvar Aaltos iconic vase, Aalto57s sculptural architecture reflects classic concepts of design both inside and out. Each residence in this boutique rental building features clean modern finishes. Amenities such as a landscaped terrace with gas grills, private and group dining areas, sun loungers, and fire feature as well as an indoor rock climbing wall, basketball court, game room, childrens playroom, guest suite, and a fitness center make Aalto57 a home like no other.",
"photos": [
"https://photos.zillowstatic.com/fp/0c1099a1882a904acc8cedcd83ebd9dc-p_d.jpg",
"..."
],
"zipcode": "10022",
"phone": "646-681-3805",
"name": "Aalto57",
"floor_plans": [
{
"zpid": "2096631846",
"__typename": "FloorPlan",
"availableFrom": "1657004400000",
"baths": 1,
"beds": 1,
"floorPlanUnitPhotos": [],
"floorplanVRModel": null,
"maxPrice": 6200,
"minPrice": 6200,
"name": "1 Bed/1 Bath-1D",
...
}
...
]
}]
Cool! Our Zillow scraper can extract various details from the property web pages, including price, address, photos, and property structure. Next, let's explore extracting data from search pages!
Our previous code for scraping Zillow can extract data from a property page. In this section, we'll explore finding real estate listings using Zillow's search bar. Here is how the search system works under the hood:
Above, we can see that upon submitting a search query, a background request is sent to Zillow API for search. The search query includes the map coordinates, as well as other comprehensive details. However, few query parameters are actually required:
{
"searchQueryState":{
"pagination":{},
"usersSearchTerm":"New Haven, CT",
"mapBounds":
{
"west":-73.03037621240235,
"east":-72.82781578759766,
"south":41.23043771298298,
"north":41.36611033618769
},
},
"wants": {
"cat1":["mapResults"]
},
"requestId": 2
}
The Zillow search API is really powerful and allows us to find listings in any map area defined by two location points comprised of 4 direction values: north, west, south, and east:
Let's replicate the login for finding properties by location to our Zillow scraping code using the latitude and longitude values:
import json
import httpx
# we should use browser-like request headers to prevent being instantly blocked
BASE_HEADERS = {
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
url = "https://www.zillow.com/async-create-search-page-state"
body = {
"searchQueryState": {
"pagination": {},
"usersSearchTerm": "New Haven, CT",
# map coordinates that indicate New Haven city's area
"mapBounds": {
"west": -73.03037621240235,
"east": -72.82781578759766,
"south": 41.23043771298298,
"north": 41.36611033618769,
},
},
"wants": {"cat1": ["listResults", "mapResults"], "cat2": ["total"]},
"requestId": 2,
}
response = httpx.put(url, headers=BASE_HEADERS, data=json.dumps(body))
assert response.status_code == 200, "request has been blocked"
data = response.json()
results = response.json()["cat1"]["searchResults"]["mapResults"]
print(json.dumps(results, indent=2))
print(f"found {len(results)} property results")
import json
from scrapfly import ScrapeConfig, ScrapflyClient
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
url = "https://www.zillow.com/async-create-search-page-state"
body = {
"searchQueryState": {
"pagination": {},
"usersSearchTerm": "New Haven, CT",
# map coordinates that indicate New Haven city's area
"mapBounds": {
"west": -73.03037621240235,
"east": -72.82781578759766,
"south": 41.23043771298298,
"north": 41.36611033618769,
},
},
"wants": {"cat1": ["listResults", "mapResults"], "cat2": ["total"]},
"requestId": 2,
}
response = scrapfly.scrape(
ScrapeConfig(
url,
asp=True,
country="US",
headers={"content-type": "application/json"},
body=json.dumps(body),
method="PUT",
)
)
data = json.loads(response.content)
results = data["cat1"]["searchResults"]["mapResults"]
print(json.dumps(results, indent=2))
print(f"found {len(results)} property results")
We can successfully replicate the search query precisely. Next, we'll utilize it for the search pages.
To scrape Zillow search, we need the geographical location details, which can be challenging to get. Therefore, we'll extract the location's geographical details from an easier user interface: search pages. To illustrate this, go to any search URL on Zillow, like zillow.com/homes/New-Haven,-CT_rb/. You fill find the geographical details hidden in the HTML:
The geographical details exist in the script tag. Let's use it to scrape Zillow data from search pages:
import json
import httpx
import random
import asyncio
from loguru import logger as log
from parsel import Selector
def create_search_payload(
query_data: dict, filters: dict = None, page_number: int = None
):
"""create a search payload for Zillow's search API"""
payload = {
"searchQueryState": query_data,
"wants": {"cat1": ["listResults", "mapResults"], "cat2": ["total"]},
"requestId": random.randint(2, 10),
}
if filters:
query_data["filterState"] = filters
if page_number:
payload["searchQueryState"]["pagination"] = {"currentPage": page_number}
return json.dumps(payload)
async def _search(
query: str,
session: httpx.AsyncClient,
filters: dict = None,
max_scrape_pages: int = None,
):
"""base search function which is used by sale and rent search functions"""
html_response = await session.get(f"https://www.zillow.com/homes/{query}_rb/")
assert html_response.status_code == 403, "request is blocked"
selector = Selector(html_response.text)
# find query data in script tags
try:
script_data = json.loads(
selector.xpath("//script[@id='__NEXT_DATA__']/text()").get()
)
except:
log.error("request is blocked, use Scrapfly code tab")
return
query_data = script_data["props"]["pageProps"]["searchPageState"]["queryState"]
# scrape search API
url = "https://www.zillow.com/async-create-search-page-state"
search_data = []
api_response = await session.put(
url,
headers={"content-type": "application/json"},
body=create_search_payload(query_data=query_data, filters=filters),
)
data = api_response.json()
property_data = data["cat1"]["searchResults"]["listResults"]
search_data.extend(property_data)
_total_pages = data["cat1"]["searchList"]["totalPages"]
# if no pagination data, return
if _total_pages == 1:
log.success(f"scraped {len(search_data)} properties from search pages")
return search_data
# else paginate remaining pages
if max_scrape_pages and max_scrape_pages < _total_pages:
_total_pages = max_scrape_pages
log.info(f"scraping search pagination, {_total_pages} more pages remaining")
to_scrape = [
await session.put(
url,
headers={"content-type": "application/json"},
body=create_search_payload(
query_data=query_data, filters=filters, page_number=page
),
)
for page in range(2, _total_pages + 1)
]
for response in asyncio.as_completed(to_scrape):
response = await response
data = api_response.json()
property_data = data["cat1"]["searchResults"]["listResults"]
search_data.extend(property_data)
log.success(f"scraped {len(search_data)} properties from search pages")
return search_data
# Example usages 👇
async def search_sale(query: str, session: httpx.AsyncClient):
"""search properties that are for sale"""
log.info(f"scraping sale search for: {query}")
return await _search(query=query, session=session, max_scrape_pages=3)
async def search_rent(query: str, session: httpx.AsyncClient):
"""search properites that are for rent"""
log.info(f"scraping rent search for: {query}")
filters = {
"isForSaleForeclosure": {"value": False},
"isMultiFamily": {"value": False},
"isAllHomes": {"value": True},
"isAuction": {"value": False},
"isNewConstruction": {"value": False},
"isForRent": {"value": True},
"isLotLand": {"value": False},
"isManufactured": {"value": False},
"isForSaleByOwner": {"value": False},
"isComingSoon": {"value": False},
"isForSaleByAgent": {"value": False},
}
return await _search(
query=query, session=session, filters=filters, max_scrape_pages=3
)
BASE_HEADERS = {
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
async def run():
limits = httpx.Limits(max_connections=5)
async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
data = await search_rent("New Haven, CT", session)
print(json.dumps(data, indent=2))
if __name__ == "__main__":
asyncio.run(run())
import json
import random
import asyncio
from typing import List
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
BASE_CONFIG = {
# Zillow.com requires Anti Scraping Protection bypass feature:
"asp": True,
"country": "US",
}
def create_search_payload(
query_data: dict, filters: dict = None, page_number: int = None
):
"""create a search payload for Zillow's search API"""
payload = {
"searchQueryState": query_data,
"wants": {"cat1": ["listResults", "mapResults"], "cat2": ["total"]},
"requestId": random.randint(2, 10),
}
if filters:
query_data["filterState"] = filters
if page_number:
payload["searchQueryState"]["pagination"] = {"currentPage": page_number}
return json.dumps(payload)
async def _search(
query: str, filters: dict = None, max_scrape_pages: int = None
) -> List[dict]:
"""base search function which is used by sale and rent search functions"""
search_data = []
url = f"https://www.zillow.com/homes/{query}_rb/"
log.info(f"scraping search: {url}")
# first scrape the search HTML page and find query variables for this search
html_result = await SCRAPFLY.async_scrape(ScrapeConfig(url, **BASE_CONFIG))
script_data = json.loads(
html_result.selector.xpath("//script[@id='__NEXT_DATA__']/text()").get()
)
query_data = script_data["props"]["pageProps"]["searchPageState"]["queryState"]
# then scrape Zillow's backend API for all query results:
_backend_url = "https://www.zillow.com/async-create-search-page-state"
api_result = await SCRAPFLY.async_scrape(
ScrapeConfig(
_backend_url,
**BASE_CONFIG,
headers={"content-type": "application/json"},
body=create_search_payload(query_data=query_data, filters=filters),
method="PUT",
)
)
data = json.loads(api_result.content)
property_data = data["cat1"]["searchResults"]["listResults"]
search_data.extend(property_data)
_total_pages = data["cat1"]["searchList"]["totalPages"]
# if no pagination data, return
if _total_pages == 1:
log.success(f"scraped {len(search_data)} properties from search pages")
return search_data
# else paginate remaining pages
if max_scrape_pages and max_scrape_pages < _total_pages:
_total_pages = max_scrape_pages
log.info(f"scraping search pagination, {_total_pages} more pages remaining")
to_scrape = [
ScrapeConfig(
_backend_url,
**BASE_CONFIG,
headers={"content-type": "application/json"},
body=create_search_payload(
query_data=query_data, filters=filters, page_number=page
),
method="PUT",
)
for page in range(2, _total_pages + 1)
]
async for result in SCRAPFLY.concurrent_scrape(to_scrape):
property_data = json.loads(result.content)["cat1"]["searchResults"][
"listResults"
]
search_data.extend(property_data)
log.success(f"scraped {len(search_data)} properties from search pages")
return search_data
# Example usages 👇
async def search_sale(query: str):
"""search properties that are for sale"""
log.info(f"scraping sale search for: {query}")
return await _search(query=query, max_scrape_pages=3)
async def search_rent(query: str):
"""search properites that are for rent"""
log.info(f"scraping rent search for: {query}")
filters = {
"isForSaleForeclosure": {"value": False},
"isMultiFamily": {"value": False},
"isAllHomes": {"value": True},
"isAuction": {"value": False},
"isNewConstruction": {"value": False},
"isForRent": {"value": True},
"isLotLand": {"value": False},
"isManufactured": {"value": False},
"isForSaleByOwner": {"value": False},
"isComingSoon": {"value": False},
"isForSaleByAgent": {"value": False},
}
return await _search(query=query, filters=filters, max_scrape_pages=3)
async def run():
data = await search_rent("New Haven, CT")
print(json.dumps(data, indent=2))
if __name__ == "__main__":
asyncio.run(run())
Let's break down the above Zillow scraper code. We use the _search
function to request the search page first and parse the HTML for the location details. Then, we use the location details to request the search API, either for sale or rent real estate data.
Executing the above scraping code will extract the following data from Zillow:
[
{
"zpid": "2052939967",
"rawHomeStatusCd": "ForSale",
"marketingStatusSimplifiedCd": "For Sale by Agent",
"imgSrc": "https://photos.zillowstatic.com/fp/2a7635a58d8e19762acc923e6c938551-p_e.jpg",
"hasImage": true,
"detailUrl": "/homedetails/755-757-Sutter-St-San-Francisco-CA-94109/2052939967_zpid/",
"statusType": "FOR_SALE",
"statusText": "Multi-family home for sale",
"price": "$7,800,000",
"priceLabel": "$7.80M",
"address": "755-757 Sutter St, San Francisco, CA 94109",
"baths": 0.0,
"area": 22572,
"latLong": {
"latitude": 37.78844,
"longitude": -122.41274
},
"variableData": {
"type": "TIME_ON_INFO",
"text": "1 hour ago",
"data": {
"isRead": null,
"isFresh": false
}
},
"hdpData": {
"homeInfo": {
"zpid": 2052939967,
"streetAddress": "755-757 Sutter St",
"zipcode": "94109",
"city": "San Francisco",
"state": "CA",
"latitude": 37.78844,
"longitude": -122.41274,
"price": 7800000.0,
"bathrooms": 0.0,
"livingArea": 22572.0,
"homeType": "MULTI_FAMILY",
"homeStatus": "FOR_SALE",
"daysOnZillow": -1,
"isFeatured": false,
"shouldHighlight": false,
"listing_sub_type": {
"is_FSBA": true
},
"isUnmappable": false,
"isPreforeclosureAuction": false,
"homeStatusForHDP": "FOR_SALE",
"priceForHDP": 7800000.0,
"timeOnZillow": 5508000,
"isNonOwnerOccupied": true,
"isPremierBuilder": false,
"isZillowOwned": false,
"currency": "USD",
"country": "USA",
"lotAreaValue": 5568.0,
"lotAreaUnit": "sqft",
"isShowcaseListing": false
}
},
"isUserClaimingOwner": false,
"isUserConfirmedClaim": false,
"pgapt": "ForSale",
"sgapt": "For Sale (Broker)",
"shouldShowZestimateAsPrice": false,
"has3DModel": false,
"hasVideo": false,
"isHomeRec": false,
"hasAdditionalAttributions": true,
"isFeaturedListing": false,
"isShowcaseListing": false,
"listingType": "",
"isFavorite": false,
"visited": false,
"info6String": "Dustin Dolby DRE #01963487",
"brokerName": "Colliers International",
"timeOnZillow": 5508000
},
...
]
The search results provided valuable information about each listing, such as the address, geolocation, and metadata. However, in order to obtain all of the relevant listing details, we must scrape each individual property listing page, which can be found in the detailUrl
field.
Note the Zillow search is limited to 500 properties per query. Therefore, we have to use smaller geographical zones to scrape real estate data precisely. For this, refer to the Zillow zip code index page.
Our scraping Zillow code can successfully extract Zillow real estate data from property and search pages. However, running the scrape at scale will lead the website to block our HTTP requests. Let's have a look at avoiding Zillow web scraping blocking next!
Creating a Zillow data scraper doesn't seem to be complicated. However, scraping blocking will get in our way, such as in CAPTCHAS or IP address blocking. This is where Scrapfly can lend a hand!
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
For example, here's how we can scrape Zillow without getting blocked. All we have to do is replace out HTTP client with the ScrapFly API cleint, enable the asp
parameter, and select a proxy country:
# standard web scraping code
import httpx
from parsel import Selector
response = httpx.get("some zillow.com URL")
selector = Selector(response.text)
# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient
# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
response = scrapfly.scrape(ScrapeConfig(
url="website URL",
asp=True, # enable the anti scraping protection to bypass blocking
country="US", # set the proxy location to a specfic country
render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))
# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']
To wrap this guide up, let's take a look at some frequently asked questions about web scraping Zillow real estate data:
Yes. Zillow's data is publicly available; we're not extracting anything personal or private. Scraping Zillow.com at slow, respectful rates would fall under the ethical scraping definition.
That being said, attention should be paid to GDRP compliance in the EU when scraping personal data of non-agent listings (seller's name, phone number etc). For more, see our Is Web Scraping Legal? article.
Yes, but it's extremely limited and not suitable for dataset collection, and there are no Zillow APIs available for public use. Instead, we can scrape Zillow data with Python and httpx.
We can easily create a Zillow crawler with the subjects we've covered in this tutorial. Instead of searching for properties explicitly, we can crawl Zillow properties from seed links (any Zillow URLs) and follow the related properties mentioned in a loop. For more on crawling, see How to Crawl the Web with Python.
Yes, Redfin is another popular real estate marketplace in the United States. We have covered scraping Redfin in a previous guide. For more guides on real estate target websites, refer to our #realestate blog tag.
In this guide, we explained scraping real estate data from Zillow.
We searched for real estate properties for sale or rent in any region. We used hidden web data scraping by extracting Zillow's state cache from the HTML page to scrape the property data, such as price and building information, contact details, etc.
For this, we used Python with httpx
and parsel
packages, and to avoid Zillow scraper blocking, we used ScrapFly's API that smartly configures every web scraper connection to avoid being blocked.