How to Scrape Angel.co Company Data and Job Listings

article feature image

In this tutorial, we'll take a look at how to scrape AngelList (angel.co) - a major directory for startup data and job listing information in the tech industry.

Angel.co contains data fields like: company overview, employee information, culture overview, funding details and job listings.

AngelList is notoriously challenging to scrape as it uses many anti-scrape protection tools, so to scrape AngelList we'll be using Python with ScrapFly SDK, which will make this task a breeze. Let's dive in!

Why Scrape Angel.co?

Angel.co contains loads of data related to tech startups. By scraping details like company information, employee data, company culture, funding and jobs we can create powerful business intelligence datasets. This can be used for competitive advantage or general market analysis. Job data and company contacts are also used to generate business leads by recruiters for growth hacking.

For more on scraping use cases see our extensive web scraping use case article

Setup

As we'll see later on - angel.co is a pretty easy target to scrape. So all we need is a modern version of python (3.7+) and scrapfly-sdk package - which will allow us to bypass vast anti-scraping technologies used by Angel.co to retrieve the public HTML data.

Optionally, for this tutorial, we'll also use loguru - a pretty logging library that'll help us keep track of what's going on via nice colorful logs.

These packages can be easily installed via pip command:

$ pip install scrapfly-sdk loguru

Why are we using ScrapFly?

illustration of scrapfly's middleware

ScrapFly offers several powerful features that'll help us to get around web scraper blocking via features like:

Angel.co uses many anti-scrape protection technologies to prevent automated access to their public data. So, to access it we'll be using ScrapFly's Anti Scraping Protection Bypass feature which can be enabled for any request in the python SDK:

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key='YOUR_SCRAPFLY_KEY')
result = client.scrape(ScrapeConfig(
    url="https://angel.co/company/moxion-power-co",
    # we need to enable Anti Scraping Protection bypass with a keyword argument:
    asp=True,
))

We'll be using this technique for every page we'll be scraping in this tutorial, let's take a look at how it all adds up!

Finding Companies and Jobs

Let's start our AngelList scraper by taking a look at scraping AngelList's search system. This will allow us to find companies and jobs listed on the website.

The are several ways to find these details on angel.co but we'll take a look at the two most popular ones: searching by role and/or location:

0:00
/
we can see url changes when we adjust our search

In the video above we see URL progression of the search - now let's replicate it in our scraper code!

To scrape the search first let's take a look at the contents of a single search page. Where is the wanted data located and how can we extract it from the HTML page?

If we take a look at a search page like angel.co/role/l/python-developer/san-francisco and view-source of the page we can see search result data embedded in a javascript variable:

illustration of embedded cache on angel.co search

we can see data tucked away in a script node

This is a common pattern for GraphQL-powered websites where page cache is stored as JSON in HTML. Angel.co in particular, is powered by Apollo graphQL

Web Scraping Graphql with Python

For a complete detailed guide on what is GraphQL and how to scrape it see our in-depth tutorial on scraping GraphQL with Python.

Web Scraping Graphql with Python

This is super convenient for our AngelList web scraper because we don't need to parse the HTML and can pick up all of the data at once. Let's see how to scrape this:

import json
import asyncio
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient
from loguru import logger as log


def extract_apollo_state(result: ScrapeApiResponse):
    """extract apollo state graph from a page"""
    data = result.selector.css("script#__NEXT_DATA__::text").get()
    data = json.loads(data)
    graph = data["props"]["pageProps"]["apolloState"]["data"]
    return graph


async def scrape_search(session: ScrapflyClient, role: str = "", location: str = ""):
    """scrape angel.co search"""
    # angel.co has 3 types of search urls: for roles, for locations and for combination of both
    if role and location:
        url = f"https://angel.co/role/l/{role}/{location}"
    elif role:
        url = f"https://angel.co/role/{role}"
    elif location:
        url = f"https://angel.co/location/{location}"
    else:
        raise ValueError("need to pass either role or location argument to scrape search")
        
    log.info(f'scraping search of "{role}" in "{location}"')
    scrape = ScrapeConfig(
        url=url,  # url to scrape
        asp=True,  # this will enable anti-scraping protection bypass
    )
    result = await session.async_scrape(scrape)
    graph = extract_apollo_state(result)
    return graph

Let's run this code and see the results it generates:

Run code and example output
if __name__ == "__main__":
    with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=2) as session:
        result = asyncio.run(scrape_search(session, role="python-developer"))
        print(json.dumps(result, indent=2, ensure_ascii=False))
{
    "props": {
        "pageProps": {
            "page": null,
            "role": "python-developer",
            "apollo": null,
            "apolloState": {
                "data": {
                    ...
                    "StartupResult:6427941": {
                        "id": "6427941",
                        "badges": [
                            {
                                "type": "id",
                                "generated": false,
                                "id": "Badge:ACTIVELY_HIRING",
                                "typename": "Badge"
                            }
                        ],
                        "companySize": "SIZE_11_50",
                        ...
                    "JobListingSearchResult:2275832": {
                        "autoPosted": false,
                        "atsSource": null,
                        "description": "**Company: Capitalmind**\n\nAt Capitalmind we ...",
                        "jobType": "full-time",
                        "liveStartAt": 1656420205,
                        "locationNames": {
                            "type": "json",
                            "json": ["Bengaluru"]
                        },
                        "primaryRoleTitle": "DevOps",
                        "remote": false,
                        "slug": "python-developer",
                        "title": "Python Developer",
                        "compensation": "₹50,000 – ₹1L",
                    ...

The first thing we can notice is that there are a lot of results in a very complicated format. The data we receive here is a data graph which is a data storage format where various data objects are connected by references. To make better sense of this, let's parse it into a familiar, flat structure instead:

def unpack_node_references(node, graph, debug=False):
    """
    unpacks references in a graph node to a flat node structure:

    >>> unpack_node_references({"field": {"id": "reference1", "type": "id"}}, graph={"reference1": {"foo": "bar"}})
    {'field': {'foo': 'bar'}}
    """

    def flatten(value):
        try:
            if value["type"] != "id":
                return value
        except (KeyError, TypeError):
            return value
        data = deepcopy(graph[value["id"]])
        # flatten nodes too:
        if data.get("node"):
            data = flatten(data["node"])
        if debug:
            data["__reference"] = value["id"]
        return data

    node = flatten(node)

    for key, value in node.items():
        if isinstance(value, list):
            node[key] = [flatten(v) for v in value]
        elif isinstance(value, dict):
            node[key] = unpack_node_references(value, graph)
    return node

Above, we defined a function to flatten complex graph structures. It works by replacing all references with data itself. In our case, we want to get the Company object from the graph set and all of the related objects like jobs, people etc.:

illustration of graph unpacking

process of converting graph into a flat structure

In the illustration above, we can visualize reference unpacking better.

Next, let's add this graph parsing to our scraper as well as paging ability so we can collect nicely formatted company data from
all of the job pages:


class JobData(TypedDict):
    """type hint for scraped job result data"""
    id: str
    title: str
    slug: str
    remtoe: bool
    primaryRoleTitle: str
    locationNames: Dict
    liveStartAt: int
    jobType: str
    description: str
    # there are more fields, but these are basic ones


class CompanyData(TypedDict):
    """type hint for scraped company result data"""
    id: str
    badges: list
    companySize: str
    highConcept: str
    highlightedJobListings: List[JobData]
    logoUrl: str
    name: str
    slug: str
    # there are more fields, but these are basic ones


async def scrape_search(session: ScrapflyClient, role: str = "", location: str = "") -> List[CompanyData]:
    """scrape angel.co search"""
    # angel.co has 3 types of search urls: for roles, for locations and for combination of both
    if role and location:
        url = f"https://angel.co/role/l/{role}/{location}"
    elif role:
        url = f"https://angel.co/role/{role}"
    elif location:
        url = f"https://angel.co/location/{location}"
    else:
        raise ValueError("need to pass either role or location argument to scrape search")

    async def scrape_search_page(page_numbers: List[int]) -> Tuple[List[CompanyData], Dict]:
        """scrape search pages concurrently"""
        companies = []
        log.info(f"scraping search of {role} in {location}; pages {page_numbers}")
        search_meta = None
        async for result in session.concurrent_scrape(
            [ScrapeConfig(url + f"?page={page}", asp=True, cache=True) for page in page_numbers]
        ):
            graph = extract_apollo_state(result)
            search_meta = graph[next(key for key in graph if "seoLandingPageJobSearchResults" in key)]
            companies.extend(
                [unpack_node_references(graph[key], graph) for key in graph if key.startswith("StartupResult")]
            )
        return companies, search_meta

    # scrape first page
    first_page_companies, pagination_meta = await scrape_search_page([1])
    # scrape other pages
    pages_to_scrape = list(range(2, pagination_meta["pageCount"] + 1))
    other_page_companies, _ = await scrape_search_page(pages_to_scrape)
    return first_page_companies + other_page_companies
Run code and example output
if __name__ == "__main__":
    with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=2) as session:
        result = asyncio.run(scrape_search(session, role="python-developer"))
        print(json.dumps(result, indent=2, ensure_ascii=False))
[
  {
    "id": "6427941",
    "badges": [
      {
        "id": "ACTIVELY_HIRING",
        "name": "ACTIVELY_HIRING_BADGE",
        "label": "Actively Hiring",
        "tooltip": "Actively processing applications",
        "avatarUrl": null,
        "rating": null,
        "__typename": "Badge"
      }
    ],
    "companySize": "SIZE_11_50",
    "highConcept": "India's First Digital Asset Management Company",
    "highlightedJobListings": [
      {
        "autoPosted": false,
        "atsSource": null,
        "description": "**Company: Capitalmind**\n\nAt Capitalmind <...truncacted...>",
        "jobType": "full-time",
        "liveStartAt": 1656420205,
        "locationNames": {
          "type": "json",
          "json": [
            "Bengaluru"
          ]
        },
        "primaryRoleTitle": "DevOps",
        "remote": false,
        "slug": "python-developer",
        "title": "Python Developer",
        "compensation": "₹50,000 – ₹1L",
        "id": "2275832",
        "isBookmarked": false,
        "__typename": "JobListingSearchResult"
      }
    ],
    "logoUrl": "https://photos.angel.co/startups/i/6427941-9e4960b31904ccbcfe7e3235228ceb41-medium_jpg.jpg?buster=1539167505",
    "name": "Capitalmind",
    "slug": "capitalmindamc",
    "__typename": "StartupResult"
  },
...
]

If you're having troubles executing this code see the Full Scraper Code section for full code

Our updated scraper now is capable of scraping all search pages and flattening graph data to something more readable. We could further parse it to get rid of unwanted fields but we'll leave this up to you.

One thing to notice here is that the company and job data is not complete. While there's a lot of data here, there's even more of it in the full dataset available on the /company/ endpoint pages. Next, let's take a look at how can we scrape that!

Scraping Companies and Jobs

Company pages on angel.co contain even more details than we can see during search. For example, if we take a look at a page like angel.co/company/moxion-power-co we can see much more data available in the visible part of the page:

example of a company profile page on angel.co

Example of company profile page on angel.co

We can apply the same scraping techniques we used in scraping search for company pages as well. Let's take a look how:

def parse_company(result: ScrapeApiResponse) -> CompanyData:
    """parse company data from angel.co company page"""
    graph = extract_apollo_state(result)
    company = None
    for key in graph:
        if key.startswith("Startup:"):
            company = graph[key]
            break
    else:
        raise ValueError("no embedded company data could be found")
    return unpack_node_references(company, graph)


async def scrape_companies(company_ids: List[str], session: ScrapflyClient) -> List[CompanyData]:
    """scrape angel.co companies"""
    urls = [f"https://angel.co/company/{company_id}/jobs" for company_id in company_ids]
    companies = []
    async for result in session.concurrent_scrape([ScrapeConfig(url, asp=True, cache=True) for url in urls]):
        companies.append(parse_company(result))
    return companies
Run code and example output
if __name__ == "__main__":
    with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=2) as session:
        result = await scrape_companies(["moxion-power-co"], session=session)
        print(json.dumps(result[0], indent=2, ensure_ascii=False))
{
  "id": "8281817",
  "__typename": "Startup",
  "slug": "moxion-power-co",
  "completenessScore": 92,
  "currentUserCanEditProfile": false,
  "currentUserCanRecruitForStartup": false,
  "completeness": {"score": 95},
  "name": "Moxion Power",
  "logoUrl": "https://photos.angel.co/startups/i/8281817-91faf535f176a41dc39259fc232d1b4e-medium_jpg.jpg?buster=1619536432",
  "highConcept": "Zero-Emissions Temporary Power as a Service",
  "hiring": true,
  "isOperating": null,
  "companySize": "SIZE_11_50",
  "totalRaisedAmount": 13225000,
  "companyUrl": "https://www.moxionpower.com/",
  "twitterUrl": "https://twitter.com/moxionpower",
  "blogUrl": "",
  "facebookUrl": "",
  "linkedInUrl": "https://www.linkedin.com/company/moxion-power-co/",
  "productHuntUrl": "",
  "public": true,
  "published": true,
  "quarantined": false,
  "isShell": false,
  "isIncubator": false,
  "currentUserCanUpdateInvestors": false,
  "jobPreamble": "Moxion is looking to hire a diverse team across several disciplines, currently focusing on engineering and production.",
  "jobListingsConnection({\"after\":\"MA==\",\"filters\":{\"jobTypes\":[],\"locationIds\":[],\"roleIds\":[]},\"first\":20})": {
    "totalPageCount": 3,
    "pageSize": 20,
    "edges": [
      {
        "id": "2224735",
        "public": true,
        "primaryRoleTitle": "Product Designer",
        "primaryRoleParent": "Designer",
        "liveStartAt": 1653724125,
        "descriptionSnippet": "<ul>\n<li>Conduct user research to drive design decisions</li>\n<li>Design graphics to be vinyl printed onto physical hardware and signage</li>\n</ul>\n",
        "title": "Senior UI/UX Designer",
        "slug": "senior-ui-ux-designer",
        "jobType": "full_time",
  ...
}

Just by adding a few lines of code, we collect each company's job, employee, culture and funding details. Because we used a generic way of scraping Apollo Graphql powered websites like angel.co we can apply this to many other pages with ease!

Let's wrap this up by taking a look at the full scraper code and some other tips and tricks when it comes to scraping this target.

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping angel.co:

Yes. AngelList data is publicly available, and we're not extracting anything private. Scraping angel.co at slow, respectful rates would fall under the ethical scraping definition. That being said, attention should be paid to GDRP compliance in the EU when scraping personal data such as people's (employee) identifiable data. For more, see our Is Web Scraping Legal? article.

How to find all company pages on angel.co?

Finding company pages without job listings is a bit more difficult since angel.co doesn't provide a site directory or a sitemap for crawlers.

For this angel.co/search endpoint can be used. Alternatively, we can take advantage of public search indexes such as google.com or bing.com using queries like: site:angel.co inurl:/company/

Summary

In this tutorial, we built an angel.co scraper. We've taken a look at how to discover company pages through AngelList's search functionality. Then, we wrote a generic dataset parser for GraphQL-powered websites that we applied to angel.co search result and company data parsing.

For this, we used Python with a few community packages included in the scrapfly-sdk and to prevent being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Full Scraper Code

Finally, let's put everything together: finding companies using search and scraping their info and review data with ScrapFly integration:

import asyncio
import json
from copy import deepcopy
from pathlib import Path
from typing import Dict, List, Tuple

from loguru import logger as log
from parsel import Selector
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient
from typing_extensions import TypedDict


def unpack_node_references(node: Dict, graph: Dict, debug: bool = False) -> Dict:
    """
    unpacks references in a graph node to a flat node structure:

    >>> unpack_node_references({"field": {"id": "reference1", "type": "id"}}, graph={"reference1": {"foo": "bar"}})
    {'field': {'foo': 'bar'}}
    """

    def flatten(value):
        try:
            if value["type"] != "id":
                return value
        except (KeyError, TypeError):
            return value
        data = deepcopy(graph[value["id"]])
        # flatten nodes too:
        if data.get("node"):
            data = flatten(data["node"])
        if debug:
            data["__reference"] = value["id"]
        return data

    node = flatten(node)

    for key, value in node.items():
        if isinstance(value, list):
            node[key] = [flatten(v) for v in value]
        elif isinstance(value, dict):
            node[key] = unpack_node_references(value, graph)
    return node


def extract_apollo_state(result: ScrapeApiResponse):
    """extract apollo state graph from a page"""
    data = result.selector.css("script#__NEXT_DATA__::text").get()
    data = json.loads(data)
    graph = data["props"]["pageProps"]["apolloState"]["data"]
    return graph


class JobData(TypedDict):
    """type hint for scraped job result data"""
    id: str
    title: str
    slug: str
    remtoe: bool
    primaryRoleTitle: str
    locationNames: Dict
    liveStartAt: int
    jobType: str
    description: str
    # there are more fields, but these are basic ones


class CompanyData(TypedDict):
    """type hint for scraped company result data"""
    id: str
    badges: list
    companySize: str
    highConcept: str
    highlightedJobListings: List[JobData]
    logoUrl: str
    name: str
    slug: str
    # there are more fields, but these are basic ones


def parse_company(result: ScrapeApiResponse) -> CompanyData:
    """parse company data from angel.co company page"""
    graph = extract_apollo_state(result)
    for key in graph:
        if key.startswith("Startup:"):
            company = graph[key]
            break
    else:
        raise ValueError("no embedded company data could be found")
    return unpack_node_references(company, graph)


async def scrape_companies(company_ids: List[str], session: ScrapflyClient) -> List[CompanyData]:
    """scrape angel.co companies"""
    log.info(f"scraping {len(company_ids)} companies: {company_ids}")
    urls = [f"https://angel.co/company/{company_id}/jobs" for company_id in company_ids]
    companies = []
    async for result in session.concurrent_scrape([ScrapeConfig(url, asp=True, cache=True) for url in urls]):
        companies.append(parse_company(result))
    return companies


async def scrape_search(session: ScrapflyClient, role: str = "", location: str = "") -> List[CompanyData]:
    """scrape angel.co search"""
    # angel.co has 3 types of search urls: for roles, for locations and for combination of both
    if role and location:
        url = f"https://angel.co/role/l/{role}/{location}"
    elif role:
        url = f"https://angel.co/role/{role}"
    elif location:
        url = f"https://angel.co/location/{location}"
    else:
        raise ValueError("need to pass either role or location argument to scrape search")

    async def scrape_search_page(page_numbers: List[int]) -> Tuple[List[CompanyData], Dict]:
        """scrape search pages concurrently"""
        companies = []
        log.info(f"scraping search of {role} in {location}; pages {page_numbers}")
        search_meta = None
        async for result in session.concurrent_scrape(
            [ScrapeConfig(url + f"?page={page}", asp=True, cache=True) for page in page_numbers]
        ):
            graph = extract_apollo_state(result)
            search_meta = graph[next(key for key in graph if "seoLandingPageJobSearchResults" in key)]
            companies.extend(
                [unpack_node_references(graph[key], graph) for key in graph if key.startswith("StartupResult")]
            )
        return companies, search_meta

    # scrape first page
    first_page_companies, pagination_meta = await scrape_search_page([1])
    # scrape other pages
    pages_to_scrape = list(range(2, pagination_meta["pageCount"] + 1))
    other_page_companies, _ = await scrape_search_page(pages_to_scrape)
    return first_page_companies + other_page_companies


async def run():
    with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=2) as session:
        result_search = await scrape_search(session=session, role="python-developer")
        result_companies = await scrape_companies(["moxion-power-co"], session=session)

if __name__ == "__main__":
    asyncio.run(run())

Related Posts

How to Scrape StockX e-commerce Data with Python

In this first entry in our fashion data web scraping series we'll be taking a look at StockX.com - a marketplace that treats apparel as stocks and how to scrape it all.

How to Scrape Twitter with Python

With the news of Twitter dropping free API access we're taking a look at web scraping Twitter using Python for free. In this tutorial we'll cover two methods: using Playwright and Twitter's hidden graphql API.

How to Scrape RightMove Real Estate Property Data with Python

In this scrape guide we'll be taking a look at scraping RightMove.co.uk - one of the most popular real estate listing websites in the United Kingdom. We'll be scraping hidden web data and backend APIs directly using Python.