How to Scrape Goat.com for Fashion Apparel Data in Python
Goat.com is a rising storefront for luxury fashion apparel items. It's known for high quality apparel data so in this tutorial we'll take a look how to scrape it using Python.
In this web scraping tutorial, we'll be taking a look at how to scrape Redfin - a popular real estate listing web page.
We'll be scraping real estate data such as pricing info, addresses, photos and phone numbers displayed on Redfin property pages.
To scrape Redfin properties we'll be using hidden web data scraping method. We'll also take a look at how to find real estate properties using Redfin's search and sitemap system to collect the entire real estate dataset available on the website.
Finally, we'll also cover property tracking by continuously scraping for newly listed or updated - giving us an upper hand in real estate bidding. We'll be using Python with a few community libraries - Let's dive in!
If you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.
Redfin.com is one of the biggest real estate websites in the United States making it the biggest public real estate dataset out there. Containing fields like real estate prices, listing locations and sale dates and general property information.
This is valuable information for market analytics, the study of the housing industry, and a general competitor overview. By web scraping Redfin we can easily have access to a major real estate dataset.
See our Scraping Use Cases guide for more.
For more real estate scrape guides see our hub article which covers scraping of Zillow, Realtor.com, Idealista and other popular platforms.
We can scrape Redfin for several popular real estate datafields and targets:
In this guide, we'll cover focus on scraping real estate property (rent and sale), though everything we'll learn can be easily applied to other pages.
In this tutorial, we'll be using Python with three community packages:
These packages can be easily installed via the pip install
command:
$ pip install httpx parsel jmespath
Alternatively, feel free to swap httpx
out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel
, another great alternative is the beautifulsoup package.
To start let's take a look at how to scrape property data of a single listing page.
Redfin is using Next.js for rendering its pages. We can take advantage of this fact and scrape the hidden web data instead of parsing the HTML directly. This might appear a bit complex so if you're unfamiliar with hidden web data scraping see our introduction article:
Introduction to scraping hidden web data - what is it and best ways to parse it in Python
Redfin's hidden dataset contains all of the property data and more. In this scenario the property data is located in a javascript variable __reactServerState.InitialContext
:
To extract the whole dataset we will:
script
element which contains this javascript variableLet's see it in action:
import json
import asyncio
from httpx import AsyncClient, Response
from parsel import Selector
session = AsyncClient(headers={
# use same headers as a popular web browser (Chrome on Windows in this case)
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
})
def extract_cache(react_initial_context):
"""extract microservice cache data from the react server agent"""
result = {}
for name, cache in react_initial_context["ReactServerAgent.cache"]["dataCache"].items():
# first we retrieve cached response and see whether it's a success
try:
cache_response = cache["res"]
except KeyError: # empty cache
continue
if cache_response.get("status") != 200:
print("skipping non 200 cache")
continue
# then extract cached response body and interpret it as a JSON
cache_data = cache_response.get("body", {}).get("payload")
if not cache_data:
cache_data = json.loads(cache_response["text"].split("&&", 1)[-1]).get("payload")
if not cache_data:
# skip empty caches
continue
# for Redfin we can cleanup cache names for home data endpoints:
if "/home/details" in name:
name = name.split("/home/details/")[-1]
result[name.replace("/", "")] = cache_data
# ^note: we sanitize name to avoid slashes as they are not allowed in JMESPath
return result
def parse_property(response: Response):
selector = Selector(response.text)
script = selector.xpath('//script[contains(.,"ServerState.InitialContext")]/text()').get()
initial_context = re.findall(r"ServerState.InitialContext = (\{.+\});", script)
if not initial_context:
print(f"page {response.url} is not a property listing page")
return
return extract_cache(json.loads(initial_context[0]))
async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
to_scrape = [session.get(url) for url in urls]
properties = []
for response in asyncio.as_completed(to_scrape):
properties.append(parse_property(await response))
return properties
To run our scraper all we have to do is call the asyncio coroutine:
urls = [
"https://www.redfin.com/FL/Cape-Coral/402-SW-28th-St-33914/home/61856041",
"https://www.redfin.com/FL/Cape-Coral/4202-NW-16th-Ter-33993/home/62053611",
"https://www.redfin.com/FL/Cape-Coral/1415-NW-38th-Pl-33993/home/62079956",
]
if __name__ == "__main__":
asyncio.run(scrape_properties(urls))
Above, we used httpx
to retrieve the HTML page and load it as a parsel.Selector
. Then, we find the script element which contains the javascript cache variable. To extract the cache we use a simple regular expression that captures text between InitialContext
keyword and };
character.
This results in a colossal Redfin property data and since this is an internal web dataset it's full of technical datafields - we had to do a bit of cleanup. Let's parse it!
The dataset we scraped is huge and contains loads of useless information. To parse it down to something we can digest we'll be using JMESPath - a popular JSON parsing syntax.
JMESPath is a bit similar to XPath or CSS selectors but for JSON. Using it, we can create path rules of where to find the data fields we want to keep.
For example, for price we'll be using JMESPath path:
aboveTheFold.addressSectionInfo.priceInfo.amount
Let's take a look at the whole parser:
from typing import TypedDict
import jmespath
class PropertyResult(TypedDict):
"""type hint for property result. i.e. Defines what fields are expected in property dataset"""
photos: List[str]
videos: List[str]
price: int
info: Dict[str, str]
amenities: List[Dict[str, str]]
records: Dict[str, str]
history: Dict[str, str]
floorplan: Dict[str, str]
activity: Dict[str, str]
def parse_redfin_proprety_cache(data_cache) -> PropertyResult:
"""parse Redfin's cache data for proprety information"""
# here we define field name to JMESPath mapping
parse_map = {
# from top area of the page: basic info, videos and photos
"photos": "aboveTheFold.mediaBrowserInfo.photos[*].photoUrls.fullScreenPhotoUrl",
"videos": "aboveTheFold.mediaBrowserInfo.videos[*].videoUrl",
"price": "aboveTheFold.addressSectionInfo.priceInfo.amount",
"info": """aboveTheFold.addressSectionInfo.{
bed_num: beds,
bath_numr: baths,
full_baths_num: numFullBaths,
sqFt: sqFt,
year_built: yearBuitlt,
city: city,
state: state,
zip: zip,
country_code: countryCode,
fips: fips,
apn: apn,
redfin_age: timeOnRedfin,
cumulative_days_on_market: cumulativeDaysOnMarket,
property_type: propertyType,
listing_type: listingType,
url: url
}
""",
# from bottom area of the page: amenities, records and event history
"amenities": """belowTheFold.amenitiesInfo.superGroups[].amenityGroups[].amenityEntries[].{
name: amenityName, values: amenityValues
}""",
"records": "belowTheFold.publicRecordsInfo",
"history": "belowTheFold.propertyHistoryInfo",
# other: sometimes there are floorplans
"floorplan": r"listingfloorplans.floorPlans",
# and there's always internal Redfin performance info: views, saves, etc.
"activity": "activityInfo",
}
results = {}
for key, path in parse_map.items():
value = jmespath.search(path, data_cache)
results[key] = value
return results
{
"photos": [
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_0.jpg",
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_1_0.jpg",
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_2_0.jpg",
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_3_0.jpg",
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_4_0.jpg",
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_5_0.jpg",
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_6_0.jpg",
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_7_0.jpg",
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_8_0.jpg",
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_9_0.jpg",
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_10_0.jpg",
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_11_0.jpg",
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_12_0.jpg"
],
"videos": [],
"price": 311485,
"info": {
"bed_num": 3,
"bath_numr": 2.5,
"full_baths_num": 2,
"sqFt": {
"displayLevel": 1,
"value": 1636
},
"year_built": null,
"city": "Cape Coral",
"state": "FL",
"zip": "33909",
"country_code": "US",
"fips": "12071",
"apn": "304324C2137060780",
"redfin_age": 489909873,
"cumulative_days_on_market": 6,
"property_type": 13,
"listing_type": 1,
"url": "/FL/Cape-Coral/1445-Weeping-Willow-Ct-33909/home/178539241"
},
"amenities": [
{
"name": "Parking",
"values": [
"2+ Spaces",
"Driveway Paved"
]
},
{
"name": "Amenities",
"values": [
"Basketball",
"Business Center",
"Clubhouse",
"Community Pool",
"Community Room",
"Community Spa/Hot tub",
"Exercise Room",
"Pickleball",
"Play Area",
"Sidewalk",
"Tennis Court",
"Underground Utility",
"Volleyball"
]
},
... // trucated for blog
],
"records": {
"basicInfo": {
"propertyTypeName": "Townhouse",
"lotSqFt": 1965,
"apn": "304324C2137060780",
"propertyLastUpdatedDate": 1669759070845,
"displayTimeZone": "US/Eastern"
},
"taxInfo": {},
"allTaxInfo": [],
"addressInfo": {
"isFMLS": false,
"street": "1445 Weeping Willow Ct",
"city": "Cape Coral",
"state": "FL",
"zip": "33909",
"countryCode": "US"
},
"mortgageCalculatorInfo": {
"displayLevel": 1,
"dataSourceId": 192,
"listingPrice": 311485,
"downPaymentPercentage": 20.0,
"monthlyHoaDues": 312,
"propertyTaxRate": 1.29,
"homeInsuranceRate": 1.17,
"mortgageInsuranceRate": 0.75,
"creditScore": 740,
"loanType": 1,
"mortgageRateInfo": {
"fifteenYearFixed": 5.725,
"fiveOneArm": 5.964,
"thirtyYearFixed": 6.437,
"isFromBankrate": true
},
"countyId": 471,
"stateId": 19,
"countyName": "Lee County",
"stateName": "Florida",
"mortgageRatesPageLinkText": "View all rates",
"baseMortgageRatesPageURL": "/mortgage-rates?location=33909&locationType=4&locationId=14465",
"zipCode": "33909",
"isCoop": false
},
"countyUrl": "/county/471/FL/Lee-County",
"countyName": "Lee County",
"countyIsActive": true,
"sectionPreviewText": "County data refreshed on 11/29/2022"
},
"history": {
"isHistoryStillGrowing": true,
"hasAdminContent": false,
"hasLoginContent": false,
"dataSourceId": 192,
"canSeeListing": true,
"listingIsNull": false,
"hasPropertyHistory": true,
"showLogoInLists": false,
"definitions": [],
"displayTimeZone": "US/Eastern",
"isAdminOnlyView": false,
"events": [
{
"isEventAdminOnly": false,
"price": 311485,
"isPriceAdminOnly": false,
"eventDescription": "Listed",
"mlsDescription": "Active",
"source": "BEARMLS",
"sourceId": "222084966",
"dataSourceDisplay": {
"dataSourceId": 192,
"dataSourceDescription": "Bonita Springs Association of Realtors (BEARMLS)",
"dataSourceName": "BEARMLS",
"shouldShowLargerLogo": false
},
"priceDisplayLevel": 1,
"historyEventType": 1,
"eventDate": 1669708800000
}
],
"mediaBrowserInfoBySourceId": {},
"addressInfo": {
"isFMLS": false,
"street": "1445 Weeping Willow Ct",
"city": "Cape Coral",
"state": "FL",
"zip": "33909",
"countryCode": "US"
},
"isFMLS": false,
"historyHasHiddenRows": false,
"priceEstimates": {
"displayLevel": 1,
"priceHomeUrl": "/what-is-my-home-worth?estPropertyId=178539241&src=ldp-estimates"
},
"sectionPreviewText": "Details will be added when we have them"
},
"floorplan": [
"https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_1_0.jpg",
],
"activity": {
"viewCount": 28,
"favoritesCount": 1,
"totalFavoritesCount": 1,
"xOutCount": 0,
"totalXOutCount": 0,
"tourCount": 0,
"totalTourCount": 0,
"addressInfo": {
"isFMLS": false,
"street": "1445 Weeping Willow Ct",
"city": "Cape Coral",
"state": "FL",
"zip": "33909",
"countryCode": "US"
},
"sectionPreviewText": "1 people favorited this home"
}
}
We've reduced thousands of lines long Redfin property dataset to just a few dozen most important fields using JMESPath and Python.
We can see how easy it is to scrape modern websites with modern scraping tools - to scrape a Redfin property we used only a few lines of Python code. Next, let's take a look at how can we find listings to scrape.
There are several ways to find listings on Redfin for scraping. Though the most obvious and fastest way is to use Redfin's sitemaps.
Redfin offers an extensive sitemap system that contains sitemaps for listings by US state, neighborhood, school district and so on. For that let's take a look at the /robots.txt page, specifically the sitemap section.
For example, there are sitemaps for all location directories:
And for rental data:
Finally, we have sitemaps for non-listing objects such Agents.
To keep track of new Redfin listings we can use sitemap feeds for the newest and updated listings:
To find new listings and updates we'll be scraping these two sitemaps which provide a listing URL and timestamp when it was listed or updated:
<url>
<loc>https://www.redfin.com/NH/Boscawen/1-Sherman-Dr-03303/home/96531826</loc>
<lastmod>2022-12-01T00:53:20.426-08:00</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
⌚ Note that this sitemap is using UTC-8 timezone. It's indicated by the last number of the datetime string: -08.00.
To scrape these Redfin feeds in Python we'll be using httpx
and parsel
libraries we've used before:
import arrow # for handling datetime: pip install arrow
from httpx import AsyncClient
session = AsyncClient(headers={
# use same headers as a popular web browser (Chrome on Windows in this case)
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
})
async def scrape_feed(url) -> Dict[str, datetime]:
"""scrape Redfin sitemap and return url:datetime dictionary"""
result = await session.get(url)
selector = Selector(text=result.text, type="xml")
results = {}
for item in selector.xpath("//url"):
url = item.xpath("loc/text()").get()
pub_date = item.xpath("lastmod/text()").get(%Y-%m-%dT%H:%M:%S.%f%z)
results[url] = arrow.get(pub_date).datetime
return results
We can then use the Python Redfin scraper we wrote earlier to scrape these URLs for property datasets.
Scraping Redfin.com seems very straight-forward though, when scraping at scale our scrapers are very likely to be blocked or asked to solve captchas.
To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us!
ScrapFly offers several powerful features that'll help us to get around Redfin's web scraper blocking:
For this, we'll be using the scrapfly-sdk python package and the Anti Scraping Protection Bypass feature. First, let's install scrapfly-sdk
using pip:
$ pip install scrapfly-sdk
To take advantage of ScrapFly's API in our Redfin.com web scraper all we need to do is change our httpx
session code with scrapfly-sdk
client requests:
import httpx
response = httpx.get("some redfin.com url")
# in ScrapFly SDK becomes
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient("YOUR SCRAPFLY KEY")
result = client.scrape(ScrapeConfig(
"some Redfin.ocm url",
# we can select specific proxy country
country="US",
# and enable anti scraping protection bypass:
asp=True
))
For more on how to scrape Redfin.com using ScrapFly, see the Full Scraper Code section.
To wrap this guide up, let's take a look at some frequently asked questions about web scraping Redfin data:
Yes. Redfin.com's data is available publically; we're not collecting anything private. Scraping Redfin at slow, respectful rates would fall under the ethical scraping definition.
That being said, attention should be paid to GDRP compliance in the EU when storing personal data such as seller's name, phone number etc. For more, see our Is Web Scraping Legal? article.
No, there's no Redfin API for real estate data though, Redfin does publish market summary datasets in their data-center section. For detailed property data we can scrape Redfin data using Python.
Like scraping we can also crawl redfin.com by following related rental pages listed on every property page. To write a Redfin crawler see the related properties field in datasets scraped in this tutorial.
In this tutorial, we built a Redfin scraper in Python with a few free community packages. We started by taking a look at how to scrape a single property page by extracting hidden web cache data.
To parse property data we used JMESPath JSON parsing language to write a few simple rules which reduced scraped dataset to vital property data fields.
Finally, to find property listings and track new/updated ones we explored Redfin's sitemap system.
For this Redfin data scraper we used Python with httpx, parsel and jmespath packages. To avoid being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked.
For more about ScrapFly, see our documentation and try it out for FREE!
Here's the full Redfin web scraper code with ScrapFly integration:
💙 This code should only be used as a reference. To scrape data from Redfin at scale you'll need some error handling, logging and retrying logic
import asyncio
import json
import re
from datetime import datetime
from pathlib import Path
from typing import Dict, List
import arrow
import jmespath
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient
from typing_extensions import TypedDict
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=2)
def extract_cache(react_initial_context):
"""extract microservice cache data from the react server agent"""
result = {}
for name, cache in react_initial_context["ReactServerAgent.cache"]["dataCache"].items():
# first we retrieve cached response and see whether it's a success
try:
cache_response = cache["res"]
except KeyError: # empty cache
continue
if cache_response.get("status") != 200:
print("skipping non 200 cache")
continue
# then extract cached response body and interpret it as a JSON
cache_data = cache_response.get("body", {}).get("payload")
if not cache_data:
cache_data = json.loads(cache_response["text"].split("&&", 1)[-1]).get("payload")
if not cache_data:
# skip empty caches
continue
# for Redfin we can cleanup cache names for home data endpoints:
if "/home/details" in name:
name = name.split("/home/details/")[-1]
result[name.replace("/", "")] = cache_data
# ^note: we sanitize name to avoid slashes as they are not allowed in JMESPath
return result
class PropertyResult(TypedDict):
"""type hint for property result. i.e. Defines what fields are expected in property dataset"""
photos: List[str]
videos: List[str]
price: int
info: Dict[str, str]
amenities: List[Dict[str, str]]
records: Dict[str, str]
history: Dict[str, str]
floorplan: Dict[str, str]
activity: Dict[str, str]
def parse_redfin_proprety_cache(data_cache) -> PropertyResult:
"""parse Redfin's cache data for proprety information"""
# here we define field name to JMESPath mapping
parse_map = {
# from top area of the page: basic info, videos and photos
"photos": "aboveTheFold.mediaBrowserInfo.photos[*].photoUrls.fullScreenPhotoUrl",
"videos": "aboveTheFold.mediaBrowserInfo.videos[*].videoUrl",
"price": "aboveTheFold.addressSectionInfo.priceInfo.amount",
"info": """aboveTheFold.addressSectionInfo.{
bed_num: beds,
bath_numr: baths,
full_baths_num: numFullBaths,
sqFt: sqFt,
year_built: yearBuitlt,
city: city,
state: state,
zip: zip,
country_code: countryCode,
fips: fips,
apn: apn,
redfin_age: timeOnRedfin,
cumulative_days_on_market: cumulativeDaysOnMarket,
property_type: propertyType,
listing_type: listingType,
url: url
}
""",
# from bottom area of the page: amenities, records and event history
"amenities": """belowTheFold.amenitiesInfo.superGroups[].amenityGroups[].amenityEntries[].{
name: amenityName, values: amenityValues
}""",
"records": "belowTheFold.publicRecordsInfo",
"history": "belowTheFold.propertyHistoryInfo",
# other: sometimes there are floorplans
"floorplan": r"listingfloorplans.floorPlans",
# and there's always internal Redfin performance info: views, saves, etc.
"activity": "activityInfo",
}
results = {}
for key, path in parse_map.items():
value = jmespath.search(path, data_cache)
results[key] = value
return results
def parse_property(result: ScrapeApiResponse) -> PropertyResult:
script = result.selector.xpath('//script[contains(.,"ServerState.InitialContext")]/text()').get()
initial_context = re.findall(r"ServerState.InitialContext = (\{.+\});", script)
if not initial_context:
print(f"page {result.context['url']} is not a property listing page")
return
return parse_redfin_proprety_cache(extract_cache(json.loads(initial_context[0])))
async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
to_scrape = [ScrapeConfig(url=url, asp=True, country="US", cache=True) for url in urls]
properties = []
async for result in scrapfly.concurrent_scrape(to_scrape):
properties.append(parse_property(result))
return properties
async def scrape_feed(url) -> Dict[str, datetime]:
"""Scrape Redfin sitemap for URLs"""
result = await scrapfly.async_scrape(ScrapeConfig(url=url, country="US", cache=True, asp=True))
results = {}
for item in result.selector.xpath("//url"):
url = item.xpath("loc/text()").get()
pub_date = item.xpath("lastmod/text()").get()
results[url] = arrow.get(pub_date).datetime
return results
async def example_run():
urls = [
"https://www.redfin.com/FL/Cape-Coral/402-SW-28th-St-33914/home/61856041",
"https://www.redfin.com/FL/Cape-Coral/4202-NW-16th-Ter-33993/home/62053611",
"https://www.redfin.com/FL/Cape-Coral/1415-NW-38th-Pl-33993/home/62079956",
"https://www.redfin.com/FL/Cape-Coral/1026-NE-34th-Ln-33909/home/67830364",
"https://www.redfin.com/FL/Cape-Coral/1022-NE-34th-Ln-33909/home/62069246",
"https://www.redfin.com/FL/Cape-Coral/4132-NE-21st-Ave-33909/home/67818227",
"https://www.redfin.com/FL/Cape-Coral/2115-NW-8th-Ter-33993/home/62069405",
"https://www.redfin.com/FL/Cape-Coral/1451-Weeping-Willow-Ct-33909/home/178539244",
"https://www.redfin.com/FL/Cape-Coral/1449-Weeping-Willow-Ct-33909/home/178539243",
"https://www.redfin.com/FL/Cape-Coral/5431-SW-6th-Ave-33914/home/61888403",
"https://www.redfin.com/FL/Cape-Coral/1445-Weeping-Willow-Ct-33909/home/178539241",
]
feed = await scrape_feed("https://www.redfin.com/stingray/api/gis-cms/city-sitemap/CA/San-Francisco?channel=buy")
asyncio.run(scrape_feed("https://www.redfin.com/newest_listings.xml"))
if __name__ == "__main__":
asyncio.run(example_run())