How to Scrape Crunchbase in 2024

How to Scrape Crunchbase in 2024

In this tutorial, we'll explain how to scrape Crunchbase - the most extensive public resource for financial information of various public and private companies and investments.

Crunchbase contains thousands of company profiles, which include investment data, funding information, leadership positions, mergers, news and industry trends.

To scrape Crunchbase, we'll be using the hidden web data web scraping approach using Python with an HTTP client library.

Mostly, we'll focus on capturing company data through generic scraping techinques we'll learn, which can be easily applied to other Crunchbase areas such as people or acquisition data with a minimum effort. Let's dive in!

Latest Crunchbase.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Why Scrape Crunchbase.com?

Crunchbase has an enormous business dataset that can be used in various forms of market analytics and business intelligence research. For example, the company dataset contains the company's summary details (like description, website and address), public financial information (like acquisitions and investments) and used technology data.

Additionally, Crunchbase data contains many data points used in lead generation, like the company's contact details, leadership's social profiles, and event aggregation.

For more on scraping use cases see our extensive web scraping use case article

Project Setup

To scrape Crunchbase, we'll be using Python and two major community packages:

  • httpx - HTTP client library which will let us communicate with crunchbase.com's servers
  • parsel - HTML parsing library though we'll be doing very little HTML parsing in this tutorial and will be mostly working with JSON data directly instead.

Optionally we'll also use loguru - a pretty logging library that'll help us keep track of what's going on through nice colorful logs.

These packages can be easily installed using the pip command:

$ pip install httpx parsel loguru

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package.

Hands on Python Web Scraping Tutorial and Example Project

If you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.

Hands on Python Web Scraping Tutorial and Example Project

Available Crunchbase Targets

Crunchbase contains several data types: acquisitions, people, events, hubs and funding rounds. Our Crunchbase scraper will focus on the company and people data. That being said, the same techincal concepts can be applied to other pages on Crunchbase.

crunchbase discovery page
Crunchbase.com/discovery page shows all available dataset types

You can explore available data types by taking a look at the Crunchbase discovery pages.

Finding Crunchbase Companies and People

To start scraping Crunchbase.com content, we need to find a way to find all of the company or people URLs. Altough Crunchbase offer a search system, it's only available for premium users. So, how do we find these targets?

Crunchbase offers a sitemap directory that contains all of its target URLs to be crawled and indexed by search engines. Let's start by taking a look at the crunchbase.com/robots.txt endpoint:

User-agent: *
Allow: /v4/md/applications/crunchbase
Disallow: /login
<...>
Sitemap: https://www.crunchbase.com/www-sitemaps/sitemap-index.xml

The /robots.txt page indicates crawling suggestions for various web crawlers, such as Google. We can see that there's a sitemap index that contains indexes for various target pages in an XML format:

<?xml version='1.0' encoding='UTF-8'?>
<sitemapindex xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <sitemap>
      <loc>https://www.crunchbase.com/www-sitemaps/sitemap-acquisitions-2.xml.gz</loc>
      <lastmod>2022-07-06T06:05:33.000Z</lastmod>
	</sitemap>
	<...>
    <sitemap>
      <loc>https://www.crunchbase.com/www-sitemaps/sitemap-events-0.xml.gz</loc>
      <lastmod>2022-07-06T06:09:30.000Z</lastmod>
	</sitemap>
	<...>
    <sitemap>
      <loc>https://www.crunchbase.com/www-sitemaps/sitemap-funding_rounds-9.xml.gz</loc>
      <lastmod>2022-07-06T06:10:49.000Z</lastmod>
    </sitemap>
	<...>
	<sitemap>
      <loc>https://www.crunchbase.com/www-sitemaps/sitemap-hubs-1.xml.gz</loc>
      <lastmod>2022-07-06T06:05:10.000Z</lastmod>
    </sitemap>
	<...>
	<sitemap>
      <loc>https://www.crunchbase.com/www-sitemaps/sitemap-organizations-42.xml.gz</loc>
      <lastmod>2022-07-06T06:10:35.000Z</lastmod>
	</sitemap>
	<...>
	<sitemap>
      <loc>https://www.crunchbase.com/www-sitemaps/sitemap-people-29.xml.gz</loc>
      <lastmod>2022-07-06T06:09:25.000Z</lastmod>
    </sitemap>
</sitemapindex>

We can see that this page contains sitemap index pages for acquisitions, events, funding rounds, hubs as well as companies and people.

Each sitemap index can contain a maximum of 50 000 urls, so currently using this index we can find over 2 million companies and almost 1.5 million people!

Further, we can also access that last update date indicated by the <lastmod> node. So, we also have the information for when this index was updated the last time. Next, let's explore how to scrape this XML data.

Scraping Sitemaps

To scrape sitemaps, we'll download the sitemap indexes using our httpx client and parse the URLs using parsel:

import gzip
from datetime import datetime
from typing import Iterator, List, Literal, Tuple

import httpx
from loguru import logger as log
from parsel import Selector


async def _scrape_sitemap_index(session: httpx.AsyncClient) -> List[str]:
    """scrape Crunchbase Sitemap index for all sitemap urls"""
    log.info("scraping sitemap index for sitemap urls")
    response = await session.get("https://www.crunchbase.com/www-sitemaps/sitemap-index.xml")
    sel = Selector(text=response.text)
    urls = sel.xpath("//sitemap/loc/text()").getall()
    log.info(f"found {len(urls)} sitemaps")
    return urls


def parse_sitemap(response) -> Iterator[Tuple[str, datetime]]:
    """parse sitemap for location urls and their last modification times"""
    sel = Selector(text=gzip.decompress(response.content).decode())
    urls = sel.xpath("//url")
    log.info(f"found {len(urls)} in sitemap {response.url}")
    for url_node in urls:
        url = url_node.xpath("loc/text()").get()
        last_modified = datetime.fromisoformat(url_node.xpath("lastmod/text()").get().strip("Z"))
        yield url, last_modified


async def discover_target(
    target: Literal["organizations", "people"], session: httpx.AsyncClient, min_last_modified=None
):
    """discover url from a specific sitemap type"""
    sitemap_urls = await _scrape_sitemap_index(session)
    urls = [url for url in sitemap_urls if target in url]
    log.info(f"found {len(urls)} matching sitemap urls (from total of {len(sitemap_urls)})")
    for url in urls:
        log.info(f"scraping sitemap: {url}")
        response = await session.get(url)
        for url, mod_time in parse_sitemap(response):
            if min_last_modified and mod_time < min_last_modified:
                continue  # skip
            yield url

Above, our code retrieves the central sitemap index and collects all sitemap URLs. Then, we scrape each sitemap URL matching either people or organization patterns. Let's run this Crunchbase scraper code and see the data returned:

Run code and example output
# append this to the previous code snippet to run it:
import asyncio
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    async with httpx.AsyncClient(
        limits=httpx.Limits(max_connections=5), timeout=httpx.Timeout(15.0), headers=BASE_HEADERS, http2=True
    ) as session:
        print('discovering companies:')
        async for url in discover_target("organization", session):
            print(url)
        print('discovering people:')
        async for url in discover_target("people", session):
            print(url)


if __name__ == "__main__":
    asyncio.run(run())
discovering companies:
INFO     | _scrape_sitemap_index - scraping sitemap index for sitemap urls
INFO     | _scrape_sitemap_index - found 89 sitemaps
INFO     | discover_target - found 43 matching sitemap urls (from total of 89)
INFO     | discover_target - scraping sitemap: https://www.crunchbase.com/www-sitemaps/sitemap-organizations-0.xml.gz
INFO     | parse_sitemap - found 50000 in sitemap https://www.crunchbase.com/www-sitemaps/sitemap-organizations-0.xml.gz
https://www.crunchbase.com/organization/tesla
<...>
discovering people:
INFO     | _scrape_sitemap_index - scraping sitemap index for sitemap urls
INFO     | _scrape_sitemap_index - found 89 sitemaps
INFO     | discover_target - found 30 matching sitemap urls (from total of 89)
INFO     | discover_target - scraping sitemap: https://www.crunchbase.com/www-sitemaps/sitemap-people-0.xml.gz
INFO     | parse_sitemap - found 50000 in sitemap https://www.crunchbase.com/www-sitemaps/sitemap-people-0.xml.gz
https://www.crunchbase.com/person/john-doe
<...>

Cool! By exploring Crunchbase sitemap, we can successfully discover pages on the website. Next, let's explore how to scrape Crunchbase for this data using the URLs we got.

Scraping Crunchbase Companies

The Crunchbase company page contains various data spread across multiple pages:

company page
The company data seems to be scattered through multiple pages

However, instead of parsing the HTML we can dig into the page source and we can see that the same data is also available in the page's app state variable:

page source of Crunchbase company page
dataset present in the page source (note: unquoted for readability, in the original source some characters are escaped)

We can see that a <script id="client-app-state"> node contains a large JSON file with a lot of the same details we see on the page. Since Crunchbase is using the Angular JavaScript front-end framework, it stores the HTML data in page state cache , which we can extract directly instead of parsing the HTML page. Let's take a look at how we can apply that:

import json
import httpx
import asyncio
from typing import Dict, List, TypedDict

from parsel import Selector

class CompanyData(TypedDict):
    """Type hint for data returned by Crunchbase company page parser"""

    organization: Dict
    employees: List[Dict]

def _parse_organization_data(data: Dict) -> Dict:
    """example that parses main company details from the whole company dataset"""
    properties = data['properties']
    cards = data['cards']
    parsed = {
        # theres meta data in the properties field:
        "name": properties['title'],
        "id": properties['identifier']['permalink'],
        "logo": "https://res.cloudinary.com/crunchbase-production/image/upload/" + properties['identifier']['image_id'],
        "description": properties['short_description'],
        # but most of the data is in the cards field:
        "semrush_global_rank": cards['semrush_summary']['semrush_global_rank'],
        "semrush_visits_latest_month": cards['semrush_summary']['semrush_visits_latest_month'],
        # etc... There's much more data!
    }
    return parsed

def _parse_employee_data(data: Dict) -> List[Dict]:
    """example that parses employee details from the whole employee dataset"""
    parsed = []
    for person in data['entities']:
        parsed.append({
            "name": person['properties']['name'],
            "linkedin": person['properties'].get('linkedin'),
            "job_levels": person['properties'].get('job_levels'),
            "job_departments": person['properties'].get('job_departments'),
            # etc...
        })
    return parsed

def _unescape_angular(text):
    """Helper function to unescape Angular quoted text"""
    ANGULAR_ESCAPE = {
        "&a;": "&",
        "&q;": '"',
        "&s;": "'",
        "&l;": "<",
        "&g;": ">",
    }
    for from_, to in ANGULAR_ESCAPE.items():
        text = text.replace(from_, to)
    return text


def parse_company(response) -> CompanyData:
    """parse company page for company and employee data"""
    sel = Selector(text=response.text)
    app_state_data = sel.css("script#ng-state::text").get()
    if not app_state_data:
        app_state_data = _unescape_angular(sel.css("script#client-app-state::text").get() or "")
    app_state_data = json.loads(app_state_data)
    # there are multiple caches:
    cache_keys = list(app_state_data["HttpState"])
    # Organization data can be found in this cache:
    data_cache_key = next(key for key in cache_keys if "entities/organizations/" in key)
    # Some employee/contact data can be found in this key:
    people_cache_key = next(key for key in cache_keys if "/data/searches/contacts" in key)

    organization = app_state_data["HttpState"][data_cache_key]["data"]
    employees = app_state_data["HttpState"][people_cache_key]["data"]
    return {
        "organization": _parse_organization_data(organization),
        "employees": _parse_employee_data(employees),
    }


async def scrape_company(company_id: str, session: httpx.AsyncClient) -> CompanyData:
    """scrape crunchbase company page for organization and employee data"""
    # note: we use /people tab because it contains the most data:
    url = f"https://www.crunchbase.com/organization/{company_id}/people"
    response = await session.get(url)
    return parse_company(response)
Run code and example output
# append this to the previous code snippet to run it:
BASE_HEADERS = {
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}

async def run():
    async with httpx.AsyncClient(
        limits=httpx.Limits(max_connections=5), timeout=httpx.Timeout(15.0), headers=BASE_HEADERS, http2=True
    ) as session:
        data = await scrape_company("tesla-motors", session=session)
        print(json.dumps(data, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    asyncio.run(run())
{
  "organization": {
    "name": "Tesla",
    "id": "tesla-motors",
    "logo": "https://res.cloudinary.com/crunchbase-production/image/upload/v1459804290/mkxozts4fsvkj73azuls.png",
    "description": "Tesla Motors specializes in developing a full range of electric vehicles.",
    "semrush_global_rank": 3462,
    "semrush_visits_latest_month": 34638116
  },
  "employees": [
    {
      "name": "Kenneth Rogers",
      "linkedin": "kenneth-rogers-07a7b149",
      "job_levels": [
        "l_500_exec"
      ],
      "job_departments": [
        "management"
      ]
    },
    ...
  ]
}

Above we define our company scraper which as you can see is mostly parsing code. Let's quickly unpack our process here:

  1. We retrieve organizations' "people" tab page e.g. /organization/tesla-motors. We use this page because all of the organization sub-pages (aka tabs) contain the same cache except the people's page in addition also contains some employee data.
  2. We find cache data in <script id="app-state-data"> and unquote it as it uses a special Angular quotation.
  3. Load that us as JSON to Python's dictionary and select a few important fields from the dataset. Note, there's a lot of data in the cache - most of what's visible on the page and more - but for this demonstration, we stick to a few essential fields.

As you can see, since we're scraping Angular cache directly instead of parsing HTML we can easily pick up the entire dataset in just a few lines of code! Can we apply this to scraping other data types hosted on Crunchbase?

Scraping Other Crunchbase Data Types

Crunchbase contains details not only of companies but of industry news, investors (people), funding rounds and acquisitions. Because we chose to approach parsing through Angular cache rather than HTML itself we can easily adapt our parser to extract data set from these other endpoints as well:

import json
from typing import Dict, List, TypedDict

from parsel import Selector


class PersonData(TypedDict):
    id: str
    name: str


def parse_person(response) -> PersonData:
    """parse person/investor profile from Crunchbase person's page"""
    sel = Selector(text=response.text)
    app_state_data = sel.css("script#ng-state::text").get()
    if not app_state_data:
        app_state_data = _unescape_angular(sel.css("script#client-app-state::text").get() or "")
    app_state_data = json.loads(app_state_data)
    cache_keys = list(app_state_data["HttpState"])
    dataset_key = next(key for key in cache_keys if "data/entities" in key)
    dataset = app_state_data["HttpState"][dataset_key]["data"]
    parsed = {
        # we can get metadata from properties field:
        "title": dataset['properties']['title'],
        "description": dataset['properties']['short_description'],
        "type": dataset['properties']['layout_id'],
        # the rest of the data can be found in the cards field:
        "investing_overview": dataset['cards']['investor_overview_headline'],
        "socials": {k: v['value'] for k, v in dataset['cards']['overview_fields2'].items()},
        "positions": [{
            "started": job.get('started_on', {}).get('value'),
            "title": job['title'],
            "org": job['organization_identifier']['value'],
            # etc.
        } for job in dataset['cards']['current_jobs_image_list']],
        # etc... there are many more fields to parse
    }
    return parsed


async def scrape_person(person_id: str, session: httpx.AsyncClient) -> PersonData:
    """scrape Crunchbase.com investor's profile"""
    url = f"https://www.crunchbase.com/person/{person_id}"
    response = await session.get(url)
    return parse_person(response)

The example above uses the same technique we used to scrape company data to scrape investor data. By extracting data from Angular app state we can scrape the dataset of any Crunchbase endpoint with just a few lines of code!

Bypass Blocking with ScrapFly

We looked at how to Scrape Crunchbase.com. However, when scraping at scale we are likely to be either blocked or start serving captchas to solve. This will hinder or completely disable our web scraper.

crunchbase.com blocked page
Crunchbase.com verification page: please verify you are human

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

To scrape Crunchbase with scrapfly-sdk we can start by installing scrapfly-sdk package using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Crunchbase scraper, all we need to do is change our httpx session code with scrapfly-sdk client requests. For avoiding Crunchbase scraping blocking, we'll use the anti scraping protection bypass feature, which can be enabled using asp=True argument. For example, let's take a look how can we use ScrapFly to scrape a single company page:

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key='YOUR_SCRAPFLY_KEY')
result = client.scrape(ScrapeConfig(
    url="https://www.crunchbase.com/organization/tesla-motors/people",
    # we need to enable Anti Scraping Protection bypass with a keyword argument:
    asp=True,
))

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping Crunchbase:

Yes. Crunchbase data is publicly available, and we're not extracting anything private. Scraping Crunchbase.com at slow, respectful rates would fall under the ethical scraping definition. That being said, attention should be paid to GDRP compliance in the EU when scraping personal data such as people's (investor) data. For more, see our Is Web Scraping Legal? article.

Can you crawl Crunchbase.com?

Yes, there are many ways to crawl crunchbase. However, crawling is unnecessary as Crunchbase has a rich sitemap infrastructure. For more, see Finding Companies and People section

Latest Crunchbase.com Scraper Code
https://github.com/scrapfly/scrapfly-scrapers/

Summary

In this tutorial, we built a Crunchbase scraper. We've taken a look at how to discover the company and people pages through Crunchbase's sitemap functionality. Then, we wrote a generic dataset parser for Angular-powered websites like Crunchbase itself and put it to use for scraping company and people data.

For this, we used Python with a few community packages like httpx and to prevent being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!

Related Posts

How to Scrape Reddit Posts, Subreddits and Profiles

In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.

How to Scrape LinkedIn in 2024

In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.

How to Scrape SimilarWeb Website Traffic Analytics

In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.