How to Scrape Redfin Real Estate Property Data in Python
17 min read
In this web scraping tutorial, we'll be taking a look at how to scrape Redfin - a popular real estate listing web page.
We'll be scraping real estate data such as pricing info, addresses, photos and phone numbers displayed on Redfin property pages.
To scrape Redfin properties we'll be using hidden web data scraping method. We'll also take a look at how to find real estate properties using Redfin's search and sitemap system to collect the entire real estate dataset available on the website.
Finally, we'll also cover property tracking by continuously scraping for newly listed or updated - giving us an upper hand in real estate bidding. We'll be using Python with a few community libraries - Let's dive in!
Redfin.com is one of the biggest real estate websites in the United States making it the biggest public real estate dataset out there. Containing fields like real estate prices, listing locations and sale dates and general property information.
This is valuable information for market analytics, the study of the housing industry, and a general competitor overview. By web scraping Redfin we can easily have access to a major real estate dataset.
See our Scraping Use Cases guide for more.
We can scrape Redfin for several popular real estate datafields and targets:
Properties for sale
Land for sale
Open house events
Properties for rent
Real estate agent info
In this guide, we'll cover focus on scraping real estate property (rent and sale), though everything we'll learn can be easily applied to other pages.
Setup
In this tutorial, we'll be using Python with two community packages:
httpx - HTTP client library which will let us communicate with Redfin.com's servers
parsel - HTML parsing library which will help us to parse our web scraped HTML files.
jmespath - JSON parsing library. Allows to write XPath like rules for JSON.
These packages can be easily installed via the pip install command:
$ pip install httpx parsel jmespath
Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is the beautifulsoup package.
Scraping Property Data
To start let's take a look at how to scrape property data of a single listing page.
Redfin is using Next.js for rendering its pages. We can take advantage of this fact and scrape the hidden web data instead of parsing the HTML directly. This might appear a bit complex so if you're unfamiliar with hidden web data scraping see our introduction article:
Redfin's hidden dataset contains all of the property data and more. In this scenario the property data is located in a javascript variable __reactServerState.InitialContext:
if we click view source and scroll to the bottom we can see script element with page cache
To extract the whole dataset we will:
Find the script element which contains this javascript variable
Use regular expressions to find the variable's value
Load it as Python dictionary and clean up the dataset
Let's see it in action:
import json
import asyncio
from httpx import AsyncClient, Response
from parsel import Selector
session = AsyncClient(headers={
# use same headers as a popular web browser (Chrome on Windows in this case)
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
})
def extract_cache(react_initial_context):
"""extract microservice cache data from the react server agent"""
result = {}
for name, cache in react_initial_context["ReactServerAgent.cache"]["dataCache"].items():
# first we retrieve cached response and see whether it's a success
try:
cache_response = cache["res"]
except KeyError: # empty cache
continue
if cache_response.get("status") != 200:
print("skipping non 200 cache")
continue
# then extract cached response body and interpret it as a JSON
cache_data = cache_response.get("body", {}).get("payload")
if not cache_data:
cache_data = json.loads(cache_response["text"].split("&&", 1)[-1]).get("payload")
if not cache_data:
# skip empty caches
continue
# for Redfin we can cleanup cache names for home data endpoints:
if "/home/details" in name:
name = name.split("/home/details/")[-1]
result[name.replace("/", "")] = cache_data
# ^note: we sanitize name to avoid slashes as they are not allowed in JMESPath
return result
def parse_property(response: Response):
selector = Selector(response.text)
script = selector.xpath('//script[contains(.,"ServerState.InitialContext")]/text()').get()
initial_context = re.findall(r"ServerState.InitialContext = (\{.+\});", script)
if not initial_context:
print(f"page {response.url} is not a property listing page")
return
return extract_cache(json.loads(initial_context[0]))
async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
to_scrape = [session.get(url) for url in urls]
properties = []
for response in asyncio.as_completed(to_scrape):
properties.append(parse_property(await response))
return properties
Run Code
To run our scraper all we have to do is call the asyncio coroutine:
Above, we used httpx to retrieve the HTML page and load it as a parsel.Selector. Then, we find the script element which contains the javascript cache variable. To extract the cache we use a simple regular expression that captures text between InitialContext keyword and }; character.
This results in a colossal Redfin property data and since this is an internal web dataset it's full of technical datafields - we had to do a bit of cleanup. Let's parse it!
Parsing Data
The dataset we scraped is huge and contains loads of useless information. To parse it down to something we can digest we'll be using JMESPath - a popular JSON parsing syntax.
JMESPath is a bit similar to XPath or CSS selectors but for JSON. Using it, we can create path rules of where to find the data fields we want to keep.
For example, for price we'll be using JMESPath path:
aboveTheFold.addressSectionInfo.priceInfo.amount
Let's take a look at the whole parser:
from typing import TypedDict
import jmespath
class PropertyResult(TypedDict):
"""type hint for property result. i.e. Defines what fields are expected in property dataset"""
photos: List[str]
videos: List[str]
price: int
info: Dict[str, str]
amenities: List[Dict[str, str]]
records: Dict[str, str]
history: Dict[str, str]
floorplan: Dict[str, str]
activity: Dict[str, str]
def parse_redfin_proprety_cache(data_cache) -> PropertyResult:
"""parse Redfin's cache data for proprety information"""
# here we define field name to JMESPath mapping
parse_map = {
# from top area of the page: basic info, videos and photos
"photos": "aboveTheFold.mediaBrowserInfo.photos[*].photoUrls.fullScreenPhotoUrl",
"videos": "aboveTheFold.mediaBrowserInfo.videos[*].videoUrl",
"price": "aboveTheFold.addressSectionInfo.priceInfo.amount",
"info": """aboveTheFold.addressSectionInfo.{
bed_num: beds,
bath_numr: baths,
full_baths_num: numFullBaths,
sqFt: sqFt,
year_built: yearBuitlt,
city: city,
state: state,
zip: zip,
country_code: countryCode,
fips: fips,
apn: apn,
redfin_age: timeOnRedfin,
cumulative_days_on_market: cumulativeDaysOnMarket,
property_type: propertyType,
listing_type: listingType,
url: url
}
""",
# from bottom area of the page: amenities, records and event history
"amenities": """belowTheFold.amenitiesInfo.superGroups[].amenityGroups[].amenityEntries[].{
name: amenityName, values: amenityValues
}""",
"records": "belowTheFold.publicRecordsInfo",
"history": "belowTheFold.propertyHistoryInfo",
# other: sometimes there are floorplans
"floorplan": r"listingfloorplans.floorPlans",
# and there's always internal Redfin performance info: views, saves, etc.
"activity": "activityInfo",
}
results = {}
for key, path in parse_map.items():
value = jmespath.search(path, data_cache)
results[key] = value
return results
We've reduced thousands of lines long Redfin property dataset to just a few dozen most important fields using JMESPath and Python.
We can see how easy it is to scrape modern websites with modern scraping tools - to scrape a Redfin property we used only a few lines of Python code. Next, let's take a look at how can we find listings to scrape.
Finding Properties
There are several ways to find listings on Redfin for scraping. Though the most obvious and fastest way is to use Redfin's sitemaps.
Redfin offers an extensive sitemap system that contains sitemaps for listings by US state, neighborhood, school district and so on. For that let's take a look at the /robots.txt page, specifically the sitemap section.
For example, there are sitemaps for all location directories:
⌚ Note that this sitemap is using UTC-8 timezone. It's indicated by the last number of the datetime string: -08.00.
To scrape these Redfin feeds in Python we'll be using httpx and parsel libraries we've used before:
import arrow # for handling datetime: pip install arrow
from httpx import AsyncClient
session = AsyncClient(headers={
# use same headers as a popular web browser (Chrome on Windows in this case)
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
})
async def scrape_feed(url) -> Dict[str, datetime]:
"""scrape Redfin sitemap and return url:datetime dictionary"""
result = await session.get(url)
selector = Selector(text=result.text, type="xml")
results = {}
for item in selector.xpath("//url"):
url = item.xpath("loc/text()").get()
pub_date = item.xpath("lastmod/text()").get(%Y-%m-%dT%H:%M:%S.%f%z)
results[url] = arrow.get(pub_date).datetime
return results
We can then use the Python Redfin scraper we wrote earlier to scrape these URLs for property datasets.
Avoiding Blocking with ScrapFly
Scraping Redfin.com seems very straight-forward though, when scraping at scale our scrapers are very likely to be blocked or asked to solve captchas.
Redfin.com can block web scrapers: 'our usage analysis algorithms think that you might be a robot'
To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!
ScrapFly offers several powerful features that'll help us to get around Redfin's web scraper blocking:
To take advantage of ScrapFly's API in our Redfin.com web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests:
import httpx
response = httpx.get("some redfin.com url")
# in ScrapFly SDK becomes
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient("YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
"some Redfin.ocm url",
# we can select specific proxy country
country="US",
# and enable anti scraping protection bypass:
asp=True
))
For more on how to scrape Redfin.com using ScrapFly, see the Full Scraper Code section.
FAQ
To wrap this guide up, let's take a look at some frequently asked questions about web scraping Redfin data:
Is it legal to scrape Redfin.com?
Yes. Redfin.com's data is available publically; we're not collecting anything private. Scraping Redfin at slow, respectful rates would fall under the ethical scraping definition.
That being said, attention should be paid to GDRP compliance in the EU when storing personal data such as seller's name, phone number etc. For more, see our Is Web Scraping Legal? article.
Does Redfin.com have an API?
No, there's no Redfin API for real estate data though, Redfin does publish market summary datasets in their data-center section. For detailed property data we can scrape Redfin data using Python.
How to crawl Redfin.com?
Like scraping we can also crawl redfin.com by following related rental pages listed on every property page. To write a Redfin crawler see the related properties field in datasets scraped in this tutorial.
Summary
In this tutorial, we built a Redfin scraper in Python with a few free community packages. We started by taking a look at how to scrape a single property page by extracting hidden web cache data.
To parse property data we used JMESPath JSON parsing language to write a few simple rules which reduced scraped dataset to vital property data fields.
Finally, to find property listings and track new/updated ones we explored Redfin's sitemap system.
For this Redfin data scraper we used Python with httpx, parsel and jmespath packages. To avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked.
For more about ScrapFly, see our documentation and try it out for FREE!
Full Scraper Code
Here's the full Redfin web scraper code with ScrapFly integration:
💙 This code should only be used as a reference. To scrape data from Redfin at scale you'll need some error handling, logging and retrying logic
import asyncio
import json
import re
from datetime import datetime
from pathlib import Path
from typing import Dict, List
import arrow
import jmespath
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient
from typing_extensions import TypedDict
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=2)
def extract_cache(react_initial_context):
"""extract microservice cache data from the react server agent"""
result = {}
for name, cache in react_initial_context["ReactServerAgent.cache"]["dataCache"].items():
# first we retrieve cached response and see whether it's a success
try:
cache_response = cache["res"]
except KeyError: # empty cache
continue
if cache_response.get("status") != 200:
print("skipping non 200 cache")
continue
# then extract cached response body and interpret it as a JSON
cache_data = cache_response.get("body", {}).get("payload")
if not cache_data:
cache_data = json.loads(cache_response["text"].split("&&", 1)[-1]).get("payload")
if not cache_data:
# skip empty caches
continue
# for Redfin we can cleanup cache names for home data endpoints:
if "/home/details" in name:
name = name.split("/home/details/")[-1]
result[name.replace("/", "")] = cache_data
# ^note: we sanitize name to avoid slashes as they are not allowed in JMESPath
return result
class PropertyResult(TypedDict):
"""type hint for property result. i.e. Defines what fields are expected in property dataset"""
photos: List[str]
videos: List[str]
price: int
info: Dict[str, str]
amenities: List[Dict[str, str]]
records: Dict[str, str]
history: Dict[str, str]
floorplan: Dict[str, str]
activity: Dict[str, str]
def parse_redfin_proprety_cache(data_cache) -> PropertyResult:
"""parse Redfin's cache data for proprety information"""
# here we define field name to JMESPath mapping
parse_map = {
# from top area of the page: basic info, videos and photos
"photos": "aboveTheFold.mediaBrowserInfo.photos[*].photoUrls.fullScreenPhotoUrl",
"videos": "aboveTheFold.mediaBrowserInfo.videos[*].videoUrl",
"price": "aboveTheFold.addressSectionInfo.priceInfo.amount",
"info": """aboveTheFold.addressSectionInfo.{
bed_num: beds,
bath_numr: baths,
full_baths_num: numFullBaths,
sqFt: sqFt,
year_built: yearBuitlt,
city: city,
state: state,
zip: zip,
country_code: countryCode,
fips: fips,
apn: apn,
redfin_age: timeOnRedfin,
cumulative_days_on_market: cumulativeDaysOnMarket,
property_type: propertyType,
listing_type: listingType,
url: url
}
""",
# from bottom area of the page: amenities, records and event history
"amenities": """belowTheFold.amenitiesInfo.superGroups[].amenityGroups[].amenityEntries[].{
name: amenityName, values: amenityValues
}""",
"records": "belowTheFold.publicRecordsInfo",
"history": "belowTheFold.propertyHistoryInfo",
# other: sometimes there are floorplans
"floorplan": r"listingfloorplans.floorPlans",
# and there's always internal Redfin performance info: views, saves, etc.
"activity": "activityInfo",
}
results = {}
for key, path in parse_map.items():
value = jmespath.search(path, data_cache)
results[key] = value
return results
def parse_property(result: ScrapeApiResponse) -> PropertyResult:
script = result.selector.xpath('//script[contains(.,"ServerState.InitialContext")]/text()').get()
initial_context = re.findall(r"ServerState.InitialContext = (\{.+\});", script)
if not initial_context:
print(f"page {result.context['url']} is not a property listing page")
return
return parse_redfin_proprety_cache(extract_cache(json.loads(initial_context[0])))
async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
to_scrape = [ScrapeConfig(url=url, asp=True, country="US", cache=True) for url in urls]
properties = []
async for result in scrapfly.concurrent_scrape(to_scrape):
properties.append(parse_property(result))
return properties
async def scrape_feed(url) -> Dict[str, datetime]:
"""Scrape Redfin sitemap for URLs"""
result = await scrapfly.async_scrape(ScrapeConfig(url=url, country="US", cache=True, asp=True))
results = {}
for item in result.selector.xpath("//url"):
url = item.xpath("loc/text()").get()
pub_date = item.xpath("lastmod/text()").get()
results[url] = arrow.get(pub_date).datetime
return results
async def example_run():
urls = [
"https://www.redfin.com/FL/Cape-Coral/402-SW-28th-St-33914/home/61856041",
"https://www.redfin.com/FL/Cape-Coral/4202-NW-16th-Ter-33993/home/62053611",
"https://www.redfin.com/FL/Cape-Coral/1415-NW-38th-Pl-33993/home/62079956",
"https://www.redfin.com/FL/Cape-Coral/1026-NE-34th-Ln-33909/home/67830364",
"https://www.redfin.com/FL/Cape-Coral/1022-NE-34th-Ln-33909/home/62069246",
"https://www.redfin.com/FL/Cape-Coral/4132-NE-21st-Ave-33909/home/67818227",
"https://www.redfin.com/FL/Cape-Coral/2115-NW-8th-Ter-33993/home/62069405",
"https://www.redfin.com/FL/Cape-Coral/1451-Weeping-Willow-Ct-33909/home/178539244",
"https://www.redfin.com/FL/Cape-Coral/1449-Weeping-Willow-Ct-33909/home/178539243",
"https://www.redfin.com/FL/Cape-Coral/5431-SW-6th-Ave-33914/home/61888403",
"https://www.redfin.com/FL/Cape-Coral/1445-Weeping-Willow-Ct-33909/home/178539241",
]
feed = await scrape_feed("https://www.redfin.com/stingray/api/gis-cms/city-sitemap/CA/San-Francisco?channel=buy")
asyncio.run(scrape_feed("https://www.redfin.com/newest_listings.xml"))
if __name__ == "__main__":
asyncio.run(example_run())
In this scrape guide we'll be taking a look at scraping RightMove.co.uk - one of the most popular real estate listing websites in the United Kingdom. We'll be scraping hidden web data and backend APIs directly using Python.
In this scrape guide we'll be taking a look at how to scrape Google Search - the biggest index of public web. We'll cover dynamic HTML parsing and SERP collection itself.
In this scrape guide we'll be taking a look at Ebay.com - the biggest peer-to-peer e-commerce portal in the world. We'll be scraping product details and product search.