How to Scrape StockX e-commerce Data with Python
In this first entry in our fashion data web scraping series we'll be taking a look at StockX.com - a marketplace that treats apparel as stocks and how to scrape it all.
In this tutorial, we'll take a look at how to scrape AngelList (angel.co) - a major directory for startup data and job listing information in the tech industry.
Angel.co contains data fields like: company overview, employee information, culture overview, funding details and job listings.
AngelList is notoriously challenging to scrape as it uses many anti-scrape protection tools, so to scrape AngelList we'll be using Python with ScrapFly SDK, which will make this task a breeze. Let's dive in!
Angel.co contains loads of data related to tech startups. By scraping details like company information, employee data, company culture, funding and jobs we can create powerful business intelligence datasets. This can be used for competitive advantage or general market analysis. Job data and company contacts are also used to generate business leads by recruiters for growth hacking.
For more on scraping use cases see our extensive web scraping use case article
As we'll see later on - angel.co is a pretty easy target to scrape. So all we need is a modern version of python (3.7+) and scrapfly-sdk package - which will allow us to bypass vast anti-scraping technologies used by Angel.co to retrieve the public HTML data.
Optionally, for this tutorial, we'll also use loguru - a pretty logging library that'll help us keep track of what's going on via nice colorful logs.
These packages can be easily installed via pip
command:
$ pip install scrapfly-sdk loguru
ScrapFly offers several powerful features that'll help us to get around web scraper blocking via features like:
Angel.co uses many anti-scrape protection technologies to prevent automated access to their public data. So, to access it we'll be using ScrapFly's Anti Scraping Protection Bypass feature which can be enabled for any request in the python SDK:
from scrapfly import ScrapflyClient, ScrapeConfig
client = ScrapflyClient(key='YOUR_SCRAPFLY_KEY')
result = client.scrape(ScrapeConfig(
url="https://angel.co/company/moxion-power-co",
# we need to enable Anti Scraping Protection bypass with a keyword argument:
asp=True,
))
We'll be using this technique for every page we'll be scraping in this tutorial, let's take a look at how it all adds up!
Let's start our AngelList scraper by taking a look at scraping AngelList's search system. This will allow us to find companies and jobs listed on the website.
The are several ways to find these details on angel.co but we'll take a look at the two most popular ones: searching by role and/or location:
/role/<role name>
endpoint can be used, for example: angel.co/role/python-developer/location/<location name>
endpoint is used, for example: angel.co/location/france/role/l/<role name>/<location name>
endpoint, for example: https://angel.co/role/l/python-developer/san-franciscoIn the video above we see URL progression of the search - now let's replicate it in our scraper code!
To scrape the search first let's take a look at the contents of a single search page. Where is the wanted data located and how can we extract it from the HTML page?
If we take a look at a search page like angel.co/role/l/python-developer/san-francisco and view-source of the page we can see search result data embedded in a javascript variable:
we can see data tucked away in a script node
This is a common pattern for GraphQL-powered websites where page cache is stored as JSON in HTML. Angel.co in particular, is powered by Apollo graphQL
For a complete detailed guide on what is GraphQL and how to scrape it see our in-depth tutorial on scraping GraphQL with Python.
This is super convenient for our AngelList web scraper because we don't need to parse the HTML and can pick up all of the data at once. Let's see how to scrape this:
import json
import asyncio
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient
from loguru import logger as log
def extract_apollo_state(result: ScrapeApiResponse):
"""extract apollo state graph from a page"""
data = result.selector.css("script#__NEXT_DATA__::text").get()
data = json.loads(data)
graph = data["props"]["pageProps"]["apolloState"]["data"]
return graph
async def scrape_search(session: ScrapflyClient, role: str = "", location: str = ""):
"""scrape angel.co search"""
# angel.co has 3 types of search urls: for roles, for locations and for combination of both
if role and location:
url = f"https://angel.co/role/l/{role}/{location}"
elif role:
url = f"https://angel.co/role/{role}"
elif location:
url = f"https://angel.co/location/{location}"
else:
raise ValueError("need to pass either role or location argument to scrape search")
log.info(f'scraping search of "{role}" in "{location}"')
scrape = ScrapeConfig(
url=url, # url to scrape
asp=True, # this will enable anti-scraping protection bypass
)
result = await session.async_scrape(scrape)
graph = extract_apollo_state(result)
return graph
Let's run this code and see the results it generates:
if __name__ == "__main__":
with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=2) as session:
result = asyncio.run(scrape_search(session, role="python-developer"))
print(json.dumps(result, indent=2, ensure_ascii=False))
{
"props": {
"pageProps": {
"page": null,
"role": "python-developer",
"apollo": null,
"apolloState": {
"data": {
...
"StartupResult:6427941": {
"id": "6427941",
"badges": [
{
"type": "id",
"generated": false,
"id": "Badge:ACTIVELY_HIRING",
"typename": "Badge"
}
],
"companySize": "SIZE_11_50",
...
"JobListingSearchResult:2275832": {
"autoPosted": false,
"atsSource": null,
"description": "**Company: Capitalmind**\n\nAt Capitalmind we ...",
"jobType": "full-time",
"liveStartAt": 1656420205,
"locationNames": {
"type": "json",
"json": ["Bengaluru"]
},
"primaryRoleTitle": "DevOps",
"remote": false,
"slug": "python-developer",
"title": "Python Developer",
"compensation": "₹50,000 – ₹1L",
...
The first thing we can notice is that there are a lot of results in a very complicated format. The data we receive here is a data graph which is a data storage format where various data objects are connected by references. To make better sense of this, let's parse it into a familiar, flat structure instead:
def unpack_node_references(node, graph, debug=False):
"""
unpacks references in a graph node to a flat node structure:
>>> unpack_node_references({"field": {"id": "reference1", "type": "id"}}, graph={"reference1": {"foo": "bar"}})
{'field': {'foo': 'bar'}}
"""
def flatten(value):
try:
if value["type"] != "id":
return value
except (KeyError, TypeError):
return value
data = deepcopy(graph[value["id"]])
# flatten nodes too:
if data.get("node"):
data = flatten(data["node"])
if debug:
data["__reference"] = value["id"]
return data
node = flatten(node)
for key, value in node.items():
if isinstance(value, list):
node[key] = [flatten(v) for v in value]
elif isinstance(value, dict):
node[key] = unpack_node_references(value, graph)
return node
Above, we defined a function to flatten complex graph structures. It works by replacing all references with data itself. In our case, we want to get the Company object from the graph set and all of the related objects like jobs, people etc.:
process of converting graph into a flat structure
In the illustration above, we can visualize reference unpacking better.
Next, let's add this graph parsing to our scraper as well as paging ability so we can collect nicely formatted company data from
all of the job pages:
class JobData(TypedDict):
"""type hint for scraped job result data"""
id: str
title: str
slug: str
remtoe: bool
primaryRoleTitle: str
locationNames: Dict
liveStartAt: int
jobType: str
description: str
# there are more fields, but these are basic ones
class CompanyData(TypedDict):
"""type hint for scraped company result data"""
id: str
badges: list
companySize: str
highConcept: str
highlightedJobListings: List[JobData]
logoUrl: str
name: str
slug: str
# there are more fields, but these are basic ones
async def scrape_search(session: ScrapflyClient, role: str = "", location: str = "") -> List[CompanyData]:
"""scrape angel.co search"""
# angel.co has 3 types of search urls: for roles, for locations and for combination of both
if role and location:
url = f"https://angel.co/role/l/{role}/{location}"
elif role:
url = f"https://angel.co/role/{role}"
elif location:
url = f"https://angel.co/location/{location}"
else:
raise ValueError("need to pass either role or location argument to scrape search")
async def scrape_search_page(page_numbers: List[int]) -> Tuple[List[CompanyData], Dict]:
"""scrape search pages concurrently"""
companies = []
log.info(f"scraping search of {role} in {location}; pages {page_numbers}")
search_meta = None
async for result in session.concurrent_scrape(
[ScrapeConfig(url + f"?page={page}", asp=True, cache=True) for page in page_numbers]
):
graph = extract_apollo_state(result)
search_meta = graph[next(key for key in graph if "seoLandingPageJobSearchResults" in key)]
companies.extend(
[unpack_node_references(graph[key], graph) for key in graph if key.startswith("StartupResult")]
)
return companies, search_meta
# scrape first page
first_page_companies, pagination_meta = await scrape_search_page([1])
# scrape other pages
pages_to_scrape = list(range(2, pagination_meta["pageCount"] + 1))
other_page_companies, _ = await scrape_search_page(pages_to_scrape)
return first_page_companies + other_page_companies
if __name__ == "__main__":
with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=2) as session:
result = asyncio.run(scrape_search(session, role="python-developer"))
print(json.dumps(result, indent=2, ensure_ascii=False))
[
{
"id": "6427941",
"badges": [
{
"id": "ACTIVELY_HIRING",
"name": "ACTIVELY_HIRING_BADGE",
"label": "Actively Hiring",
"tooltip": "Actively processing applications",
"avatarUrl": null,
"rating": null,
"__typename": "Badge"
}
],
"companySize": "SIZE_11_50",
"highConcept": "India's First Digital Asset Management Company",
"highlightedJobListings": [
{
"autoPosted": false,
"atsSource": null,
"description": "**Company: Capitalmind**\n\nAt Capitalmind <...truncacted...>",
"jobType": "full-time",
"liveStartAt": 1656420205,
"locationNames": {
"type": "json",
"json": [
"Bengaluru"
]
},
"primaryRoleTitle": "DevOps",
"remote": false,
"slug": "python-developer",
"title": "Python Developer",
"compensation": "₹50,000 – ₹1L",
"id": "2275832",
"isBookmarked": false,
"__typename": "JobListingSearchResult"
}
],
"logoUrl": "https://photos.angel.co/startups/i/6427941-9e4960b31904ccbcfe7e3235228ceb41-medium_jpg.jpg?buster=1539167505",
"name": "Capitalmind",
"slug": "capitalmindamc",
"__typename": "StartupResult"
},
...
]
If you're having troubles executing this code see the Full Scraper Code section for full code
Our updated scraper now is capable of scraping all search pages and flattening graph data to something more readable. We could further parse it to get rid of unwanted fields but we'll leave this up to you.
One thing to notice here is that the company and job data is not complete. While there's a lot of data here, there's even more of it in the full dataset available on the /company/
endpoint pages. Next, let's take a look at how can we scrape that!
Company pages on angel.co contain even more details than we can see during search. For example, if we take a look at a page like angel.co/company/moxion-power-co we can see much more data available in the visible part of the page:
Example of company profile page on angel.co
We can apply the same scraping techniques we used in scraping search for company pages as well. Let's take a look how:
def parse_company(result: ScrapeApiResponse) -> CompanyData:
"""parse company data from angel.co company page"""
graph = extract_apollo_state(result)
company = None
for key in graph:
if key.startswith("Startup:"):
company = graph[key]
break
else:
raise ValueError("no embedded company data could be found")
return unpack_node_references(company, graph)
async def scrape_companies(company_ids: List[str], session: ScrapflyClient) -> List[CompanyData]:
"""scrape angel.co companies"""
urls = [f"https://angel.co/company/{company_id}/jobs" for company_id in company_ids]
companies = []
async for result in session.concurrent_scrape([ScrapeConfig(url, asp=True, cache=True) for url in urls]):
companies.append(parse_company(result))
return companies
if __name__ == "__main__":
with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=2) as session:
result = await scrape_companies(["moxion-power-co"], session=session)
print(json.dumps(result[0], indent=2, ensure_ascii=False))
{
"id": "8281817",
"__typename": "Startup",
"slug": "moxion-power-co",
"completenessScore": 92,
"currentUserCanEditProfile": false,
"currentUserCanRecruitForStartup": false,
"completeness": {"score": 95},
"name": "Moxion Power",
"logoUrl": "https://photos.angel.co/startups/i/8281817-91faf535f176a41dc39259fc232d1b4e-medium_jpg.jpg?buster=1619536432",
"highConcept": "Zero-Emissions Temporary Power as a Service",
"hiring": true,
"isOperating": null,
"companySize": "SIZE_11_50",
"totalRaisedAmount": 13225000,
"companyUrl": "https://www.moxionpower.com/",
"twitterUrl": "https://twitter.com/moxionpower",
"blogUrl": "",
"facebookUrl": "",
"linkedInUrl": "https://www.linkedin.com/company/moxion-power-co/",
"productHuntUrl": "",
"public": true,
"published": true,
"quarantined": false,
"isShell": false,
"isIncubator": false,
"currentUserCanUpdateInvestors": false,
"jobPreamble": "Moxion is looking to hire a diverse team across several disciplines, currently focusing on engineering and production.",
"jobListingsConnection({\"after\":\"MA==\",\"filters\":{\"jobTypes\":[],\"locationIds\":[],\"roleIds\":[]},\"first\":20})": {
"totalPageCount": 3,
"pageSize": 20,
"edges": [
{
"id": "2224735",
"public": true,
"primaryRoleTitle": "Product Designer",
"primaryRoleParent": "Designer",
"liveStartAt": 1653724125,
"descriptionSnippet": "<ul>\n<li>Conduct user research to drive design decisions</li>\n<li>Design graphics to be vinyl printed onto physical hardware and signage</li>\n</ul>\n",
"title": "Senior UI/UX Designer",
"slug": "senior-ui-ux-designer",
"jobType": "full_time",
...
}
Just by adding a few lines of code, we collect each company's job, employee, culture and funding details. Because we used a generic way of scraping Apollo Graphql powered websites like angel.co we can apply this to many other pages with ease!
Let's wrap this up by taking a look at the full scraper code and some other tips and tricks when it comes to scraping this target.
To wrap this guide up let's take a look at some frequently asked questions about web scraping angel.co:
Yes. AngelList data is publicly available, and we're not extracting anything private. Scraping angel.co at slow, respectful rates would fall under the ethical scraping definition. That being said, attention should be paid to GDRP compliance in the EU when scraping personal data such as people's (employee) identifiable data. For more, see our Is Web Scraping Legal? article.
Finding company pages without job listings is a bit more difficult since angel.co doesn't provide a site directory or a sitemap for crawlers.
For this angel.co/search endpoint can be used. Alternatively, we can take advantage of public search indexes such as google.com or bing.com using queries like: site:angel.co inurl:/company/
In this tutorial, we built an angel.co scraper. We've taken a look at how to discover company pages through AngelList's search functionality. Then, we wrote a generic dataset parser for GraphQL-powered websites that we applied to angel.co search result and company data parsing.
For this, we used Python with a few community packages included in the scrapfly-sdk and to prevent being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!
Finally, let's put everything together: finding companies using search and scraping their info and review data with ScrapFly integration:
import asyncio
import json
from copy import deepcopy
from pathlib import Path
from typing import Dict, List, Tuple
from loguru import logger as log
from parsel import Selector
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient
from typing_extensions import TypedDict
def unpack_node_references(node: Dict, graph: Dict, debug: bool = False) -> Dict:
"""
unpacks references in a graph node to a flat node structure:
>>> unpack_node_references({"field": {"id": "reference1", "type": "id"}}, graph={"reference1": {"foo": "bar"}})
{'field': {'foo': 'bar'}}
"""
def flatten(value):
try:
if value["type"] != "id":
return value
except (KeyError, TypeError):
return value
data = deepcopy(graph[value["id"]])
# flatten nodes too:
if data.get("node"):
data = flatten(data["node"])
if debug:
data["__reference"] = value["id"]
return data
node = flatten(node)
for key, value in node.items():
if isinstance(value, list):
node[key] = [flatten(v) for v in value]
elif isinstance(value, dict):
node[key] = unpack_node_references(value, graph)
return node
def extract_apollo_state(result: ScrapeApiResponse):
"""extract apollo state graph from a page"""
data = result.selector.css("script#__NEXT_DATA__::text").get()
data = json.loads(data)
graph = data["props"]["pageProps"]["apolloState"]["data"]
return graph
class JobData(TypedDict):
"""type hint for scraped job result data"""
id: str
title: str
slug: str
remtoe: bool
primaryRoleTitle: str
locationNames: Dict
liveStartAt: int
jobType: str
description: str
# there are more fields, but these are basic ones
class CompanyData(TypedDict):
"""type hint for scraped company result data"""
id: str
badges: list
companySize: str
highConcept: str
highlightedJobListings: List[JobData]
logoUrl: str
name: str
slug: str
# there are more fields, but these are basic ones
def parse_company(result: ScrapeApiResponse) -> CompanyData:
"""parse company data from angel.co company page"""
graph = extract_apollo_state(result)
for key in graph:
if key.startswith("Startup:"):
company = graph[key]
break
else:
raise ValueError("no embedded company data could be found")
return unpack_node_references(company, graph)
async def scrape_companies(company_ids: List[str], session: ScrapflyClient) -> List[CompanyData]:
"""scrape angel.co companies"""
log.info(f"scraping {len(company_ids)} companies: {company_ids}")
urls = [f"https://angel.co/company/{company_id}/jobs" for company_id in company_ids]
companies = []
async for result in session.concurrent_scrape([ScrapeConfig(url, asp=True, cache=True) for url in urls]):
companies.append(parse_company(result))
return companies
async def scrape_search(session: ScrapflyClient, role: str = "", location: str = "") -> List[CompanyData]:
"""scrape angel.co search"""
# angel.co has 3 types of search urls: for roles, for locations and for combination of both
if role and location:
url = f"https://angel.co/role/l/{role}/{location}"
elif role:
url = f"https://angel.co/role/{role}"
elif location:
url = f"https://angel.co/location/{location}"
else:
raise ValueError("need to pass either role or location argument to scrape search")
async def scrape_search_page(page_numbers: List[int]) -> Tuple[List[CompanyData], Dict]:
"""scrape search pages concurrently"""
companies = []
log.info(f"scraping search of {role} in {location}; pages {page_numbers}")
search_meta = None
async for result in session.concurrent_scrape(
[ScrapeConfig(url + f"?page={page}", asp=True, cache=True) for page in page_numbers]
):
graph = extract_apollo_state(result)
search_meta = graph[next(key for key in graph if "seoLandingPageJobSearchResults" in key)]
companies.extend(
[unpack_node_references(graph[key], graph) for key in graph if key.startswith("StartupResult")]
)
return companies, search_meta
# scrape first page
first_page_companies, pagination_meta = await scrape_search_page([1])
# scrape other pages
pages_to_scrape = list(range(2, pagination_meta["pageCount"] + 1))
other_page_companies, _ = await scrape_search_page(pages_to_scrape)
return first_page_companies + other_page_companies
async def run():
with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=2) as session:
result_search = await scrape_search(session=session, role="python-developer")
result_companies = await scrape_companies(["moxion-power-co"], session=session)
if __name__ == "__main__":
asyncio.run(run())