How to Scrape Crunchbase Company and People Data (2023 Update)

article feature image

In this tutorial, we'll take a look at how to scrape Crunchbase - the biggest public resource of financial information of various public and private companies and investments.

Crunchbase contains thousands of company profiles which include investment data, funding information, leadership positions, mergers, news and industry trends.

To scrape Crunchbase, we'll be using a hidden web data web scraping approach using Python with an HTTP client library.

We'll be focusing mostly on capturing company data though the generic scraping algorithms we'll learn can be easily applied to other Crunchbase areas such as people or acquisition data with very little effort. Let's dive in!

Latest Crunchbase.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Why Scrape Crunchbase.com?

Crunchbase has an enormous business dataset that can be used in a variety of forms of market analytics and business intelligence. For example, the company dataset contains the company's summary details (like description, website and address), public financial information (like acquisitions, investments and) as well as leadership and used technology data.

Additionally, Crunchbase data contains a lot of data points used in lead generation like the company's contact details, leadership's social profiles and events aggregation.

For more on scraping use cases see our extensive web scraping use case article

Project Setup

In this tutorial we'll be using Python and two major community packages:

  • httpx - HTTP client library which will let us communicate with crunchbase.com's servers
  • parsel - HTML parsing library though we'll be doing very little HTML parsing in this tutorial and will be mostly working with JSON data directly instead.

Optionally we'll also use loguru - a pretty logging library that'll help us keep track of what's going on via nice colorful logs.

These packages can be easily installed via pip command:

$ pip install httpx parsel loguru

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package.

Hands on Python Web Scraping Tutorial and Example Project

If you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.

Hands on Python Web Scraping Tutorial and Example Project

Available Crunchbase Targets

Crunchbase contains several data types: acquisitions, people, events, hubs, funding rounds and companies. In this tutorial, we'll focus on company and people data though we'll be using generic parsing techniques which can be applied to all of the Crunchbase pages.

crunchbase discovery page
Crunchbase.com/discovery page shows all available dataset types

You can explore available data types by taking a look at the crunchbase.com/discover page.

Finding Crunchbase Companies and People

To start scraping Crunchbase.com content we need to find a way to find all of the company or people URLs. Crunchbase does offer a search system however, it's only for its premium users. So, how do we find these targets?

Since Crunchbase wants to be crawled and indexed by search engines it offers a sitemap directory that contains all of its target URLs. Let's start by taking a look at crunchbase.com/robots.txt endpoint:

User-agent: *
Allow: /v4/md/applications/crunchbase
Disallow: /login
<...>
Sitemap: https://www.crunchbase.com/www-sitemaps/sitemap-index.xml

The /robots.txt page indicates crawling suggestions for various web crawlers (like Google etc). We can see that there's a sitemap index that contains indexes for various target pages:

<?xml version='1.0' encoding='UTF-8'?>
<sitemapindex xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <sitemap>
      <loc>https://www.crunchbase.com/www-sitemaps/sitemap-acquisitions-2.xml.gz</loc>
      <lastmod>2022-07-06T06:05:33.000Z</lastmod>
	</sitemap>
	<...>
    <sitemap>
      <loc>https://www.crunchbase.com/www-sitemaps/sitemap-events-0.xml.gz</loc>
      <lastmod>2022-07-06T06:09:30.000Z</lastmod>
	</sitemap>
	<...>
    <sitemap>
      <loc>https://www.crunchbase.com/www-sitemaps/sitemap-funding_rounds-9.xml.gz</loc>
      <lastmod>2022-07-06T06:10:49.000Z</lastmod>
    </sitemap>
	<...>
	<sitemap>
      <loc>https://www.crunchbase.com/www-sitemaps/sitemap-hubs-1.xml.gz</loc>
      <lastmod>2022-07-06T06:05:10.000Z</lastmod>
    </sitemap>
	<...>
	<sitemap>
      <loc>https://www.crunchbase.com/www-sitemaps/sitemap-organizations-42.xml.gz</loc>
      <lastmod>2022-07-06T06:10:35.000Z</lastmod>
	</sitemap>
	<...>
	<sitemap>
      <loc>https://www.crunchbase.com/www-sitemaps/sitemap-people-29.xml.gz</loc>
      <lastmod>2022-07-06T06:09:25.000Z</lastmod>
    </sitemap>
</sitemapindex>

We can see that this page contains sitemap index pages for acquisitions, events, funding rounds, hubs as well as companies (aka organizations) and people.

Each sitemap index can contain a maximum of 50 000 urls, so currently using this index we can find over 2 million companies and almost 1.5 million people!

Further, we can see that there's also the last update date indicated by the <lastmod> node, so we also have the information for when this index was updated the last time.
Let's take a look at how can we scrape all of this.

Scraping Sitemaps

To scrape sitemaps we'll download the sitemap indexes using our httpx client and parse the URLs using parsel:

import gzip
from datetime import datetime
from typing import Iterator, List, Literal, Tuple

import httpx
from loguru import logger as log
from parsel import Selector


async def _scrape_sitemap_index(session: httpx.AsyncClient) -> List[str]:
    """scrape Crunchbase Sitemap index for all sitemap urls"""
    log.info("scraping sitemap index for sitemap urls")
    response = await session.get("https://www.crunchbase.com/www-sitemaps/sitemap-index.xml")
    sel = Selector(text=response.text)
    urls = sel.xpath("//sitemap/loc/text()").getall()
    log.info(f"found {len(urls)} sitemaps")
    return urls


def parse_sitemap(response) -> Iterator[Tuple[str, datetime]]:
    """parse sitemap for location urls and their last modification times"""
    sel = Selector(text=gzip.decompress(response.content).decode())
    urls = sel.xpath("//url")
    log.info(f"found {len(urls)} in sitemap {response.url}")
    for url_node in urls:
        url = url_node.xpath("loc/text()").get()
        last_modified = datetime.fromisoformat(url_node.xpath("lastmod/text()").get().strip("Z"))
        yield url, last_modified


async def discover_target(
    target: Literal["organizations", "people"], session: httpx.AsyncClient, min_last_modified=None
):
    """discover url from a specific sitemap type"""
    sitemap_urls = await _scrape_sitemap_index(session)
    urls = [url for url in sitemap_urls if target in url]
    log.info(f"found {len(urls)} matching sitemap urls (from total of {len(sitemap_urls)})")
    for url in urls:
        log.info(f"scraping sitemap: {url}")
        response = await session.get(url)
        for url, mod_time in parse_sitemap(response):
            if min_last_modified and mod_time < min_last_modified:
                continue  # skip
            yield url

Above, our code retrieves the central sitemap index and collects all sitemap URLs. Then, we scrape each sitemap URL matching either people or organization (aka company) patterns. Let's run this code and see the values it produces:

Run code and example output
# append this to the previous code snippet to run it:
import asyncio
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    async with httpx.AsyncClient(
        limits=httpx.Limits(max_connections=5), timeout=httpx.Timeout(15.0), headers=BASE_HEADERS, http2=True
    ) as session:
        print('discovering companies:')
        async for url in discover_target("organization", session):
            print(url)
        print('discovering people:')
        async for url in discover_target("people", session):
            print(url)


if __name__ == "__main__":
    asyncio.run(run())
discovering companies:
INFO     | _scrape_sitemap_index - scraping sitemap index for sitemap urls
INFO     | _scrape_sitemap_index - found 89 sitemaps
INFO     | discover_target - found 43 matching sitemap urls (from total of 89)
INFO     | discover_target - scraping sitemap: https://www.crunchbase.com/www-sitemaps/sitemap-organizations-0.xml.gz
INFO     | parse_sitemap - found 50000 in sitemap https://www.crunchbase.com/www-sitemaps/sitemap-organizations-0.xml.gz
https://www.crunchbase.com/organization/tesla
<...>
discovering people:
INFO     | _scrape_sitemap_index - scraping sitemap index for sitemap urls
INFO     | _scrape_sitemap_index - found 89 sitemaps
INFO     | discover_target - found 30 matching sitemap urls (from total of 89)
INFO     | discover_target - scraping sitemap: https://www.crunchbase.com/www-sitemaps/sitemap-people-0.xml.gz
INFO     | parse_sitemap - found 50000 in sitemap https://www.crunchbase.com/www-sitemaps/sitemap-people-0.xml.gz
https://www.crunchbase.com/person/john-doe
<...>

We can see that by exploring Crunchbase sitemap we can easily and quickly discover profiles listed on the website. Now that we can find the company and people page URLs let's take a look at how can we scrape this public data.

Scraping Crunchbase Companies

The Crunchbase company page contains a lot of data scattered through multiple pages:

company page
The company data seems to be scattered through multiple pages

However, instead of parsing the HTML we can dig into the page source and we can see that the same data is also available in the page's app state variable:

page source of Crunchbase company page
dataset present in the page source (note: unquoted for readability, in the original source some characters are escaped)

We can see that a <script id="client-app-state"> node contains a large JSON file with a lot of the same details we see on the page. Since Crunchbase is using Angular javascript front-end framework it stores the HTML data in page state cache which we can extract directly instead of parsing the HTML page. Let's take a look at how we can do that:

import json
from typing import Dict, List, TypedDict

from parsel import Selector

class CompanyData(TypedDict):
    """Type hint for data returned by Crunchbase company page parser"""

    organization: Dict
    employees: List[Dict]

def _parse_organization_data(data: Dict) -> Dict:
    """example that parses main company details from the whole company dataset"""
    properties = data['properties']
    cards = data['cards']
    parsed = {
        # theres meta data in the properties field:
        "name": properties['title'],
        "id": properties['identifier']['permalink'],
        "logo": "https://res.cloudinary.com/crunchbase-production/image/upload/" + properties['identifier']['image_id'],
        "description": properties['short_description'],
        # but most of the data is in the cards field:
        "semrush_global_rank": cards['semrush_summary']['semrush_global_rank'],
        "semrush_visits_latest_month": cards['semrush_summary']['semrush_visits_latest_month'],
        # etc... There's much more data!
    }
    return parsed

def _parse_employee_data(data: Dict) -> List[Dict]:
    """example that parses employee details from the whole employee dataset"""
    parsed = []
    for person in data['entities']:
        parsed.append({
            "name": person['properties']['name'],
            "linkedin": person['properties'].get('linkedin'),
            "job_levels": person['properties'].get('job_levels'),
            "job_departments": person['properties'].get('job_departments'),
            # etc...
        })
    return parsed

def _unescape_angular(text):
    """Helper function to unescape Angular quoted text"""
    ANGULAR_ESCAPE = {
        "&a;": "&",
        "&q;": '"',
        "&s;": "'",
        "&l;": "<",
        "&g;": ">",
    }
    for from_, to in ANGULAR_ESCAPE.items():
        text = text.replace(from_, to)
    return text


def parse_company(response) -> CompanyData:
    """parse company page for company and employee data"""

    sel = Selector(text=response.text)
    app_state_data = _unescape_angular(sel.css("script#client-app-state::text").get())
    app_state_data = json.loads(app_state_data)
    # there are multiple caches:
    cache_keys = list(app_state_data["HttpState"])
    # Organization data can be found in this cache:
    data_cache_key = next(key for key in cache_keys if "entities/organizations/" in key)
    # Some employee/contact data can be found in this key:
    people_cache_key = next(key for key in cache_keys if "/data/searches/contacts" in key)

    organization = app_state_data["HttpState"][data_cache_key]["data"]
    employees = app_state_data["HttpState"][people_cache_key]["data"]
    return {
        "organization": _parse_organization_data(organization),
        "employees": _parse_employee_data(employees),
    }

async def scrape_company(company_id: str, session: httpx.AsyncClient) -> CompanyData:
    """scrape crunchbase company page for organization and employee data"""
    # note: we use /people tab because it contains the most data:
    url = f"https://www.crunchbase.com/organization/{company_id}/people"
    response = await session.get(url)
    return parse_company(response)
Run code and example output
# append this to the previous code snippet to run it:
import asyncio
import json

async def run():
    async with httpx.AsyncClient(
        limits=httpx.Limits(max_connections=5), timeout=httpx.Timeout(15.0), headers=BASE_HEADERS, http2=True
    ) as session:
        data = await scrape_company("tesla-motors", session=session)
        print(json.dumps(data, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    asyncio.run(run())
{
  "organization": {
    "name": "Tesla",
    "id": "tesla-motors",
    "logo": "https://res.cloudinary.com/crunchbase-production/image/upload/v1459804290/mkxozts4fsvkj73azuls.png",
    "description": "Tesla Motors specializes in developing a full range of electric vehicles.",
    "semrush_global_rank": 3462,
    "semrush_visits_latest_month": 34638116
  },
  "employees": [
    {
      "name": "Kenneth Rogers",
      "linkedin": "kenneth-rogers-07a7b149",
      "job_levels": [
        "l_500_exec"
      ],
      "job_departments": [
        "management"
      ]
    },
    ...
  ]
}

Above we define our company scraper which as you can see is mostly parsing code. Let's quickly unpack our process here:

  1. We retrieve organizations' "people" tab page e.g. /organization/tesla-motors. We use this page because all of the organization sub-pages (aka tabs) contain the same cache except the people's page in addition also contains some employee data.
  2. We find cache data in <script id="app-state-data"> and unquote it as it uses a special Angular quotation.
  3. Load that us as JSON to Python's dictionary and select a few important fields from the dataset. Note, there's a lot of data in the cache - most of what's visible on the page and more - but for this demonstration, we stick to a few essential fields.

As you can see, since we're scraping Angular cache directly instead of parsing HTML we can easily pick up the entire dataset in just a few lines of code! Can we apply this to scraping other data types hosted on Crunchbase?

Scraping Other Crunchbase Data Types

Crunchbase contains details not only of companies but of industry news, investors (people), funding rounds and acquisitions. Because we chose to approach parsing through Angular cache rather than HTML itself we can easily adapt our parser to extract data set from these other endpoints as well:

import json
from typing import Dict, List, TypedDict

from parsel import Selector


class PersonData(TypedDict):
    id: str
    name: str


def parse_person(response) -> PersonData:
    """parse person/investor profile from Crunchbase person's page"""
    sel = Selector(text=response.text)
    app_state_data = _unescape_angular(sel.css("script#client-app-state::text").get())
    app_state_data = json.loads(app_state_data)
    cache_keys = list(app_state_data["HttpState"])
    dataset_key = next(key for key in cache_keys if "data/entities" in key)
    dataset = app_state_data["HttpState"][dataset_key]["data"]
    parsed = {
        # we can get metadata from properties field:
        "title": dataset['properties']['title'],
        "description": dataset['properties']['short_description'],
        "type": dataset['properties']['layout_id'],
        # the rest of the data can be found in the cards field:
        "investing_overview": dataset['cards']['investor_overview_headline'],
        "socials": {k: v['value'] for k, v in dataset['cards']['overview_fields2'].items()},
        "positions": [{
            "started": job.get('started_on', {}).get('value'),
            "title": job['title'],
            "org": job['organization_identifier']['value'],
            # etc.
        } for job in dataset['cards']['current_jobs_image_list']],
        # etc... there are many more fields to parse
    }
    return parsed


async def scrape_person(person_id: str, session: httpx.AsyncClient) -> PersonData:
    """scrape Crunchbase.com investor's profile"""
    url = f"https://www.crunchbase.com/person/{person_id}"
    response = await session.get(url)
    return parse_person(response)

The example above uses the same technique we used to scrape company data to scrape investor data. By extracting data from Angular app state we can scrape the dataset of any Crunchbase endpoint with just a few lines of code!

Bypass Blocking with ScrapFly

We looked at how to Scrape Crunchbase.com though, when scraping at scale we are likely to be either blocked or start serving captchas to solve, which will hinder or completely disable our web scraper.

crunchbase.com blocked page
Crunchbase.com verification page: please verify you are human

To get around this, let's take advantage of ScrapFly API which can avoid all of these blocks for us just with a few extra lines of Python code!

illustration of scrapfly's middleware

ScrapFly offers several powerful features that'll help us to get around 's web scraper blocking:

For this, we'll be using scrapfly-sdk python package. To start, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Crunchbase web scraper all we need to do is change our httpx session code with scrapfly-sdk client requests. For scraping Crunchbase we'll be using [Anti Scraping Protection Bypass] feature which can be enabled via asp=True argument. For example, let's take a look how can we use ScrapFly to scrape a single company page:

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key='YOUR_SCRAPFLY_KEY')
result = client.scrape(ScrapeConfig(
    url="https://www.crunchbase.com/organization/tesla-motors/people",
    # we need to enable Anti Scraping Protection bypass with a keyword argument:
    asp=True,
))

For more, see the Full Scrape Code section.

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping Crunchbase.com:

Yes. Crunchbase data is publicly available, and we're not extracting anything private. Scraping Crunchbase.com at slow, respectful rates would fall under the ethical scraping definition. That being said, attention should be paid to GDRP compliance in the EU when scraping personal data such as people's (investor) data. For more, see our Is Web Scraping Legal? article.

Can you crawl Crunchbase.com?

Yes, there are many ways to crawl crunchbase. However, crawling is unnecessary as Crunchbase has a rich sitemap infrastructure. For more, see Finding Companies and People section

Latest Crunchbase.com Scraper Code
https://github.com/scrapfly/scrapfly-scrapers/

Summary

In this tutorial, we built a Crunchbase scraper. We've taken a look at how to discover the company and people pages through Crunchbase's sitemap functionality. Then, we wrote a generic dataset parser for Angular-powered websites like Crunchbase itself and put it to use for scraping company and people data.

For this, we used Python with a few community packages like httpx and to prevent being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Related Posts

How to scrape Threads by Meta using Python (2023-08 Update)

Guide how to scrape Threads - new social media network by Meta and Instagram - using Python and popular libraries like Playwright and background request capture techniques.

How to Scrape Goat.com for Fashion Apparel Data in Python

Goat.com is a rising storefront for luxury fashion apparel items. It's known for high quality apparel data so in this tutorial we'll take a look how to scrape it using Python.

How to Scrape Fashionphile for Second Hand Fashion Data

In this fashion scrapeguide we'll be taking a look at Fashionphile - another major 2nd hand luxury fashion marketplace. We'll be using Python and hidden web data scraping to grap all of this data in just few lines of code.