How to Scrape Glassdoor (2023 update)

article feature image

Glassdoor is mostly known for company reviews from past and current employees though it contains much more data like company metadata, salary information and reviews. This makes Glassdoor a great public data target for web scraping!

In this hands-on web scraping tutorial, we'll be taking a look at glassdoor.com and how can we scrape company information, job listings and reviews. We'll do this in Python using a few popular community packages, so let's dive in.

Latest Glassdoor.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Project Setup

In this tutorial, we'll be using Python and a couple of popular community packages:

  • httpx - an HTTP client library that will let us communicate with amazon.com's servers
  • parsel - an HTML parsing library though we'll be doing very little HTML parsing in this tutorial and will be mostly working with JSON data directly instead.

These packages can be easily installed via pip command:

$ pip install httpx parsel 

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package.

Dealing with Glassdoor Overlay

When browsing Glassdoor for a while we are sure to encounter an overlay that requests users to log in:

glassdoor overlay example
After few pages advertisement to register/login appears

All of the content is still there just covered up by the overlay. When scraping our parsing tools will still be able to find this data:

import httpx
from parse import Selector

response = httpx.get(
    "https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm"
)
selector = Selector(response.text)
# find description in the HTML:
print(selector.css('[data-test="employerDescription"]::text').get())
# will print:
# eBay is where the world goes to shop, sell, and give. Every day, our professionals connect millions of buyers and sellers around the globe, empowering people and creating opportunity. We're on a mission to build a better, more connected form of commerce that benefits individuals

That being said, while we're developing our web scraper we want to see and inspect the web page. We can easily remove the overlay with a little bit of javascript:

function addGlobalStyle(css) {
    var head, style;
    head = document.getElementsByTagName('head')[0];
    if (!head) { return; }
    style = document.createElement('style');
    style.type = 'text/css';
    style.innerHTML = css;
    head.appendChild(style);
}

addGlobalStyle("#HardsellOverlay {display:none !important;}");
addGlobalStyle("body {overflow:auto !important; position: initial !important}");

window.addEventListener("scroll", event => event.stopPropagation(), true);
window.addEventListener("mousemove", event => event.stopPropagation(), true);

This script sets a few global CSS styles to hide the overlay and it can be executed either through the web browser's developer tools console (F12 key, console tab).

Alternatively, it can be added to the bookmarks toolbar as a bookmarklet, simply drag this link: glassdoor overlay remover to your bookmarks toolbar and click it to get rid of the overlay at any time.

Selecting Region

Glassdoor operates all around the world and most of its content is region-aware. For example, if we're looking at Ebay's Glassdoor profile on the glassdoor.co.uk website we'll see only job listings relevant to the United Kingdom.

To select the region when web scraping we can either supply a cookie with the selected region's ID:

from parsel import Selector
import httpx

france_location_cookie = {"tldp": "6"}
response = httpx.get(
    "https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm",
    cookies=france_location_cookie,
    follow_redirects=True,
)
selector = Selector(response.text)
# find employee count in the HTML:
print(selector.css('[data-test="employer-size"]::text').get())
# will print:
# Plus de 10 000 employés
How to get country IDs?

All country IDs are present in every glassdoor page HTML which can be extracted with a simple regular expressions pattern:

import re
import json
import httpx

response = httpx.get(
    "https://www.glassdoor.com/",
    follow_redirects=True,
)
country_data = re.findall(r'"countryMenu\\":.+?(\[.+?\])', response.text)[0].replace('\\', '')
country_data = json.loads(country_data)
for country in country_data:
    print(f"{country['textKey']}: {country['id']}")

Note that these IDs are unlikely to change so here's the full output:

Argentina: 13
Australia: 5
Belgique (Français): 15
België (Nederlands): 14
Brasil: 9
Canada (English): 3
Canada (Français): 19
Deutschland: 7
España: 8
France: 6
Hong Kong: 20
India: 4
Ireland: 18
Italia: 23
México: 12
Nederland: 10
New Zealand: 21
Schweiz (Deutsch): 16
Singapore: 22
Suisse (Français): 17
United Kingdom: 2
United States: 1
Österreich: 11

Some pages however use URL parameters which we'll cover more in the web scraping sections of the article.

Scraping Glassdoor Company Data

In this tutorial, we'll focus on scraping company information such as company overview, job listings, reviews etc. That being said, the techniques covered in this section can be applied to almost any other data page on glassdoor.com

Company IDs

Before we can scrape any specific company data we need to know their internal Glassdoor ID and name. For that, we can use Glassdoor search page recommendations.

For example, if we search for "eBay" we'll see a list of companies with their IDs:

glassdoor company search highlight
typing in text gives us suggestions revealing company IDs

To scrape this in Python we can use the typeahead API endpoint:

Python
ScrapFly
import json
import httpx


def find_companies(query: str):
    """find company Glassdoor ID and name by query. e.g. "ebay" will return "eBay" with ID 7853"""
    result = httpx.get(
        url=f"https://www.glassdoor.com/searchsuggest/typeahead?numSuggestions=8&source=GD_V2&version=NEW&rf=full&fallback=token&input={query}",
    )
    data = json.loads(result.content)
    return data[0]["suggestion"], data[0]["employerId"]

print(find_companies("ebay"))
["eBay", "7853"]
import json
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient

client = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=10)
BASE_CONFIG = {"country": "CA", "asp": True, "cookies": {"tldp": "1"}}

async def find_companies(query: str):
    """find company Glassdoor ID and name by query. e.g. "ebay" will return "eBay" with ID 7853"""
    result = await client.async_scrape(
        ScrapeConfig(
            url=f"https://www.glassdoor.com/searchsuggest/typeahead?numSuggestions=8&source=GD_V2&version=NEW&rf=full&fallback=token&input={query}",
            country="US",
            asp=True,
            cookies={"tldp":"1"},  # sets location to US
        )
    )
    data = json.loads(result.content)
    return data[0]["suggestion"], data[0]["employerId"]

print(find_companies("ebay"))
["eBay", "7853"]

Now that we can easily retrieve company name id and numeric id we can start scraping company job listings, reviews, salaries etc.

Company Overview

Let's start our scraper by scraping company overview data:

glassdoor company overview highlight
We'll be extracting highlighted data and much more

To scrape these details all we need company page URL or generate the URL ourselves using company ID name and number.

Python
ScrapFly
import httpx
from parsel import Selector

company_name = "eBay"
company_id = "7671"

url = f"https://www.glassdoor.com/Overview/Working-at-{company_name}-EI_IE{company_id}.htm"
response = httpx.get(
    url, 
    cookies={"tldp": "1"},  # use cookies to force US location
    follow_redirects=True
)  
sel = Selector(response.text)
print(sel.css("h1::text").get())
from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=10)

company_name = "eBay"
company_id = "7671"
url = f"https://www.glassdoor.com/Overview/Working-at-{company_name}-EI_IE{company_id}.htm"
result = client.scrape(ScrapeConfig(url, country="US", cookies={"tldp": "1"}))
print(result.selector.css("h1 ::text").get())

To parse company data we can either parse the rendered HTML using traditional HTML parsing tools like BeautifulSoup. However, since Glassdoor is using Apollo Graphql to power their website we can extract hidden JSON web data from the page source.

Advantage of scraping hidden web data that it's a full dataset of all the data that is available on the page. This means we can extract even more data than it's visible on the page and it's already structured for us.

Let's take a look how can we do this with Python:

Python
ScrapFly
import re
import httpx
import json


def extract_apollo_state(html):
    """Extract apollo graphql state data from HTML source"""
    data = re.findall('apolloState":\s*({.+})};', html)[0]
    return json.loads(data)


def scrape_overview(company_name: str, company_id: int) -> dict:
    url = f"https://www.glassdoor.com/Overview/Worksgr-at-{company_name}-EI_IE{company_id}.htm"
    response = httpx.get(url, cookies={"tldp": "1"}, follow_redirects=True) 
    apollo_state = extract_apollo_state(response.text)
    return next(v for k, v in apollo_state.items() if k.startswith("Employer:"))


print(json.dumps(scrape_overview("7853"), indent=2))
import re
import json
from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=10)

def extract_apollo_state(html):
    """Extract apollo graphql state data from HTML source"""
    data = re.findall('apolloState":\s*({.+})};', html)[0]
    return json.loads(data)


def scrape_overview(company_name: str, company_id: str) -> dict:
    url = f"https://www.glassdoor.com/Overview/Working-at-{company_name}-EI_IE{company_id}.htm"
    result = client.scrape(ScrapeConfig(url, country="US", cookies={"tldp": "1"}))
    apollo_state = extract_apollo_state(result.content)
    return next(v for k, v in apollo_state.items() if k.startswith("Employer:"))


print(json.dumps(scrape_overview("eBay", "7853"), indent=2))
Example Output
{
  "__typename": "Employer",
  "id": 7853,
  "awards({\"limit\":200,\"onlyFeatured\":false})": [
    {
      "__typename": "EmployerAward",
      "awardDetails": null,
      "name": "Best Places to Work",
      "source": "Glassdoor",
      "year": 2022,
      "featured": true
    },
    "... truncated for preview ..."
  ],
  "shortName": "eBay",
  "links": {
    "__typename": "EiEmployerLinks",
    "reviewsUrl": "/Reviews/eBay-Reviews-E7853.htm",
    "manageoLinkData": null
  },
  "website": "www.ebayinc.com",
  "type": "Company - Public",
  "revenue": "$10+ billion (USD)",
  "headquarters": "San Jose, CA",
  "size": "10000+ Employees",
  "stock": "EBAY",
  "squareLogoUrl({\"size\":\"SMALL\"})": "https://media.glassdoor.com/sqls/7853/ebay-squareLogo-1634568971365.png",
  "primaryIndustry": {
    "__typename": "EmployerIndustry",
    "industryId": 200063,
    "industryName": "Internet & Web Services"
  },
  "yearFounded": 1995,
  "overview": {
    "__typename": "EmployerOverview",
    "description": "eBay is where the world goes to shop, sell, and give. Every day, our professionals connect millions of buyers and sellers around the globe, empowering people and creating opportunity. We're on a mission to build a better, more connected form of commerce that benefits individuals, businesses, and society. We create stronger connections between buyers and sellers, offering product experiences that are fast, mobile and secure. At eBay, we develop technologies that enable connected commerce and make every interaction effortless\u2014and more human. And we are doing it on a global scale, providing everyone with the chance to participate and create value.",
    "mission": "We connect people and build communities to create economic opportunity for all."
  },
  "bestProfile": {
    "__ref": "EmployerProfile:7925"
  },
  "employerManagedContent({\"parameters\":[{\"divisionProfileId\":961530,\"employerId\":7853}]})": [
    {
      "__typename": "EmployerManagedContent",
      "diversityContent": {
        "__typename": "DiversityAndInclusionContent",
        "programsAndInitiatives": {
          "__ref": "EmployerManagedContentSection:0"
        },
        "goals": []
      }
    }
  ],
  "badgesOfShame": []
}

By parsing the embedded graphQl data we can easily extract the entire company dataset with few lines of code!

Let's take a look at how we can use this technique to scrape other details such as jobs and reviews next.

Scraping Glassdoor Job Listings

To scrape job listings we'll also take a look at embedded graphql data though this time we'll be parsing graphql cache rather than state data.

For this, let's take a look at Ebay's Glassdoor jobs page:

glassdoor company jobs page highlight
We'll be scraping all of the job listing data

If we look around the page source we can see all of the job data is present in the javascript variable window.appCache in a hidden <script> node:

glassdoor jobs page source highlight on app data
We can see that the hidden dataset contains even more data than it's visible on the page.

To extract it we can use some common parsing algorithms:

import json

def find_json_objects(text: str, decoder=json.JSONDecoder()):
    """Find JSON objects in text, and generate decoded JSON data and it's ID"""
    pos = 0
    while True:
        match = text.find("{", pos)
        if match == -1:
            break
        try:
            result, index = decoder.raw_decode(text[match:])
            # backtrack to find the key/identifier for this json object:
            key_end = text.rfind('"', 0, match)
            key_start = text.rfind('"', 0, key_end)
            key = text[key_start + 1 : key_end]
            yield key, result
            pos = match + index
        except ValueError:
            pos = match + 1


def extract_apollo_cache(html):
    """Extract apollo graphql cache data from HTML source"""
    selector = Selector(text=html)
    script_with_cache = selector.xpath("//script[contains(.,'window.appCache')]/text()").get()
    cache = defaultdict(list)
    for key, data in find_json_objects(script_with_cache):
        cache[key].append(data)
    return cache

Our parser above will take the HTML text, then find a <script> node with the window.appCache variable and extract the cache objects. Let's take a look at how it handles Glassdoor's job page:

Python
ScrapFly
import asyncio
import json
import math
from collections import defaultdict
from typing import Dict, List

import httpx
from parsel import Selector

session = httpx.AsyncClient(
    timeout=httpx.Timeout(30.0),
    cookies={"tldp": "1"},
    follow_redirects=True,
)


def find_json_objects(text: str, decoder=json.JSONDecoder()):
    """Find JSON objects in text, and generate decoded JSON data and it's ID"""
    pos = 0
    while True:
        match = text.find("{", pos)
        if match == -1:
            break
        try:
            result, index = decoder.raw_decode(text[match:])
            # backtrack to find the key/identifier for this json object:
            key_end = text.rfind('"', 0, match)
            key_start = text.rfind('"', 0, key_end)
            key = text[key_start + 1 : key_end]
            yield key, result
            pos = match + index
        except ValueError:
            pos = match + 1


def extract_apollo_cache(html):
    """Extract apollo graphql cache data from HTML source"""
    selector = Selector(text=html)
    script_with_cache = selector.xpath("//script[contains(.,'window.appCache')]/text()").get()
    cache = defaultdict(list)
    for key, data in find_json_objects(script_with_cache):
        cache[key].append(data)
    return cache


def parse_jobs(html) -> List[Dict]:
    """parse jobs page for job data and total amount of jobs"""
    cache = extract_apollo_cache(html)
    return [v["jobview"] for v in cache["JobListingSearchResult"]]


def parse_job_page_count(html) -> int:
    """parse job page count from pagination details in Glassdoor jobs page"""
    _total_results = Selector(html).css(".paginationFooter::text").get()
    if not _total_results:
        return 1
    _total_results = int(_total_results.split()[-1])
    _total_pages = math.ceil(_total_results / 40)
    return _total_pages


async def scrape_jobs(employer_name: str, employer_id: str):
    """Scrape job listings"""
    # scrape first page of jobs:
    first_page = await session.get(
        url=f"https://www.glassdoor.com/Jobs/{employer_name}-Jobs-E{employer_id}.htm?filter.countryId={session.cookies.get('tldp') or 0}",
    )
    jobs = parse_jobs(first_page.text)
    total_pages = parse_job_page_count(first_page.text)

    print(f"scraped first page of jobs, scraping remaining {total_pages - 1} pages")
    other_pages = [
        session.get(
            url=str(first_page.url).replace(".htm", f"_P{page}.htm"),
        )
        for page in range(2, total_pages + 1)
    ]
    for page in await asyncio.gather(*other_pages):
        jobs.extend(parse_jobs(page.text))
    return jobs


async def main():
    jobs = await scrape_jobs("eBay", "7853")
    print(json.dumps(jobs, indent=2))


asyncio.run(main())

import asyncio
import json
import math
from pathlib import Path
import re
from collections import defaultdict
from typing import Dict, List, Optional

from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
BASE_CONFIG = {
    "country": "US",
    "asp": True,
    "cookies": {"tldp": "1"}
}


def find_json_objects(text: str, decoder=json.JSONDecoder()):
    """Find JSON objects in text, and generate decoded JSON data and it's ID"""
    pos = 0
    while True:
        match = text.find("{", pos)
        if match == -1:
            break
        try:
            result, index = decoder.raw_decode(text[match:])
            # backtrack to find the key/identifier for this json object:
            key_end = text.rfind('"', 0, match)
            key_start = text.rfind('"', 0, key_end)
            key = text[key_start + 1 : key_end]
            yield key, result
            pos = match + index
        except ValueError:
            pos = match + 1


def extract_apollo_state(result: ScrapeApiResponse) -> Dict:
    """Extract apollo graphql state data from HTML source"""
    data = re.findall('apolloState":\s*({.+})};', result.content)[0]
    return json.loads(data)


def extract_apollo_cache(result: ScrapeApiResponse) -> Dict[str, List]:
    """Extract apollo graphql cache data from HTML source"""
    script_with_cache = result.selector.xpath("//script[contains(.,'window.appCache')]/text()").get()
    cache = defaultdict(list)
    for key, data in find_json_objects(script_with_cache):
        cache[key].append(data)
    return cache


def parse_jobs(result: ScrapeApiResponse) -> List[Dict]:
    """parse jobs page for job data and total amount of jobs"""
    cache = extract_apollo_cache(result)
    return [v["jobview"] for v in cache["JobListingSearchResult"]]


def parse_job_page_count(result: ScrapeApiResponse) -> int:
    """parse job page count from pagination details in Glassdoor jobs page"""
    total_results = result.selector.css(".paginationFooter::text").get()
    if not total_results:
        return 1
    total_results = int(total_results.split()[-1])
    total_pages = math.ceil(total_results / 40)
    return total_pages


def change_page(url: str, page: int) -> str:
    """update page number in a glassdoor url"""
    new = re.sub("(?:_P\d+)*.htm", f"_P{page}.htm", url)
    assert new != url
    return new


async def scrape_jobs(employer: str, employer_id: str, max_pages: Optional[int] = None) -> List[Dict]:
    """Scrape job listings"""
    first_page_url = f"https://www.glassdoor.com/Jobs/{employer}-Jobs-E{employer_id}.htm?filter.countryId={BASE_CONFIG['cookies']['tldp']}"
    first_page = await client.async_scrape(ScrapeConfig(url=first_page_url, **BASE_CONFIG))

    jobs = parse_jobs(first_page)
    total_pages = parse_job_page_count(first_page)
    if max_pages and total_pages > max_pages:
        total_pages = max_pages

    print(f"scraped first page of jobs, scraping remaining {total_pages - 1} pages")
    other_pages = [
        ScrapeConfig(url=change_page(first_page.context["url"], page=page), **BASE_CONFIG)
        for page in range(2, total_pages + 1)
    ]
    async for result in client.concurrent_scrape(other_pages):
        jobs.extend(parse_jobs(result))
    return jobs


async def run():
    """this is example demo run that'll scrape US jobs, reviews and salaries for ebay and save results to ./results/*.json files"""
    emp_name, emp_id = "ebay", "7853"

    ebay_jobs_in_US = await scrape_jobs(emp_name, emp_id, max_pages=2)
    print(json.dumps(ebay_jobs_in_US, indent=2))


if __name__ == "__main__":
    asyncio.run(run())

Above, we have a scraper that goes through a classic pagination scraping algorithm:

  1. Scrape the first jobs page
  2. Extract GraphQl cache for jobs data
  3. Parse HTML for total page count
  4. Scrape remaining pages concurrently

If we run our scraper, we should get all of the job listings in no time!

Example Output
[
  {
    "header": {
      "adOrderId": 1281260,
      "advertiserType": "EMPLOYER",
      "ageInDays": 0,
      "easyApply": false,
      "employer": {
        "id": 7853,
        "name": "eBay inc.",
        "shortName": "eBay",
        "__typename": "Employer"
      },
      "goc": "machine learning engineer",
      "gocConfidence": 0.9,
      "gocId": 102642,
      "jobLink": "/partner/jobListing.htm?pos=140&ao=1281260&s=21&guid=0000018355c715f3b12a6090d334a7dc&src=GD_JOB_AD&t=ESR&vt=w&cs=1_9a5bdc18&cb=1663591454509&jobListingId=1008147859269&jrtk=3-0-1gdase5jgjopr801-1gdase5k2irmo800-952fa651f152ade0-",
      "jobTitleText": "Sr. Manager, AI-Guided Service Products",
      "locationName": "Salt Lake City, UT",
      "divisionEmployerName": null,
      "needsCommission": false,
      "payCurrency": "USD",
      "payPercentile10": 75822,
      "payPercentile25": 0,
      "payPercentile50": 91822,
      "payPercentile75": 0,
      "payPercentile90": 111198,
      "payPeriod": "ANNUAL",
      "salarySource": "ESTIMATED",
      "sponsored": true,
      "__typename": "JobViewHeader"
    },
    "job": {
      "importConfigId": 322429,
      "jobTitleText": "Sr. Manager, AI-Guided Service Products",
      "jobTitleId": 0,
      "listingId": 1008147859269,
      "__typename": "JobDetails"
    },
    "jobListingAdminDetails": {
      "cpcVal": null,
      "jobListingId": 1008147859269,
      "jobSourceId": 0,
      "__typename": "JobListingAdminDetailsVO"
    },
    "overview": {
      "shortName": "eBay",
      "squareLogoUrl": "https://media.glassdoor.com/sql/7853/ebay-squareLogo-1634568971326.png",
      "__typename": "Employer"
    },
    "__typename": "JobView"
  },
  "..."
]

Now that we understand how Glassdoor works let's take a look at how we can grab other details available in the graphql cache like reviews.

Glassdoor Company Reviews

To scrape reviews we'll take a look at another graphql feature - page state data.

Just like how we found graphql cache in the page HTML we can also find graphql state:

glassdoor reviews page source highlight on app state data
Just like with cache, state data contains much more information than visible on the page

So, to scrape reviews we can parse graphql state data which contains all of the reviews, review metadata and loads of other data details:

import asyncio
import re
import json
from typing import Tuple, List, Dict
import httpx


def extract_apollo_state(html):
    """Extract apollo graphql state data from HTML source"""
    data = re.findall('apolloState":\s*({.+})};', html)[0]
    data = json.loads(data)
    return data


def parse_reviews(html) -> Tuple[List[Dict], int]:
    """parse jobs page for job data and total amount of jobs"""
    cache = extract_apollo_state(html)
    xhr_cache = cache["ROOT_QUERY"]
    reviews = next(v for k, v in xhr_cache.items() if k.startswith("employerReviews") and v.get("reviews"))
    return reviews


async def scrape_reviews(employer: str, employer_id: str, session: httpx.AsyncClient):
    """Scrape job listings"""
    # scrape first page of jobs:
    first_page = await session.get(
        url=f"https://www.glassdoor.com/Reviews/{employer}-Reviews-E{employer_id}.htm",
    )
    reviews = parse_reviews(first_page.text)
    # find total amount of pages and scrape remaining pages concurrently
    total_pages = reviews["numberOfPages"]
    print(f"scraped first page of reviews, scraping remaining {total_pages - 1} pages")
    other_pages = [
        session.get(
            url=str(first_page.url).replace(".htm", f"_P{page}.htm"),
        )
        for page in range(2, total_pages + 1)
    ]
    for page in await asyncio.gather(*other_pages):
        page_reviews = parse_reviews(page.text)
        reviews["reviews"].extend(page_reviews["reviews"])
    return reviews
Run Code & Example Output

We can run our scraper the same as before:

async def main():
    async with httpx.AsyncClient(
        timeout=httpx.Timeout(30.0),
        cookies={"tldp": "1"},
        follow_redirects=True,
    ) as client:
        reviews = await scrape_reviews("eBay", "7853", client)
        print(json.dumps(reviews, indent=2))


asyncio.run(main())

Which will produce results similar to:

{
  "__typename": "EmployerReviews",
  "filteredReviewsCountByLang": [
    {
      "__typename": "ReviewsCountByLanguage",
      "count": 4109,
      "isoLanguage": "eng"
    },
    "..."
  ],
  "employer": {
    "__ref": "Employer:7853"
  },
  "queryLocation": null,
  "queryJobTitle": null,
  "currentPage": 1,
  "numberOfPages": 411,
  "lastReviewDateTime": "2022-09-16T14:51:36.650",
  "allReviewsCount": 5017,
  "ratedReviewsCount": 4218,
  "filteredReviewsCount": 4109,
  "ratings": {
    "__typename": "EmployerRatings",
    "overallRating": 4.1,
    "reviewCount": 4218,
    "ceoRating": 0.87,
    "recommendToFriendRating": 0.83,
    "cultureAndValuesRating": 4.2,
    "diversityAndInclusionRating": 4.3,
    "careerOpportunitiesRating": 3.8,
    "workLifeBalanceRating": 4.1,
    "seniorManagementRating": 3.7,
    "compensationAndBenefitsRating": 4.1,
    "businessOutlookRating": 0.66,
    "ceoRatingsCount": 626,
    "ratedCeo": {
      "__ref": "Ceo:768619"
    }
  },
  "reviews": [
    {
      "__typename": "EmployerReview",
      "isLegal": true,
      "reviewId": 64767391,
      "reviewDateTime": "2022-05-27T08:41:43.217",
      "ratingOverall": 5,
      "ratingCeo": "APPROVE",
      "ratingBusinessOutlook": "POSITIVE",
      "ratingWorkLifeBalance": 5,
      "ratingCultureAndValues": 5,
      "ratingDiversityAndInclusion": 5,
      "ratingSeniorLeadership": 5,
      "ratingRecommendToFriend": "POSITIVE",
      "ratingCareerOpportunities": 5,
      "ratingCompensationAndBenefits": 5,
      "employer": {
        "__ref": "Employer:7853"
      },
      "isCurrentJob": true,
      "lengthOfEmployment": 1,
      "employmentStatus": "REGULAR",
      "jobEndingYear": null,
      "jobTitle": {
        "__ref": "JobTitle:60210"
      },
      "location": {
        "__ref": "City:1139151"
      },
      "originalLanguageId": null,
      "pros": "Thorough training, compassionate and very patient with all trainees. Benefits day one. Inclusive and really work with their employees to help them succeed in their role.",
      "prosOriginal": null,
      "cons": "No cons at all! So far everything is great!",
      "consOriginal": null,
      "summary": "Excellent Company",
      "summaryOriginal": null,
      "advice": null,
      "adviceOriginal": null,
      "isLanguageMismatch": false,
      "countHelpful": 2,
      "countNotHelpful": 0,
      "employerResponses": [],
      "isCovid19": false,
      "divisionName": null,
      "divisionLink": null,
      "topLevelDomainId": 1,
      "languageId": "eng",
      "translationMethod": null
    },
    "..."
  ]
}

Other Details

The embedded graphql data parsing techniques we've learned in job and review data scraping can be applied to scrape other company details like salaries, interviews, benefits and photos.

For example, salary data can be found the same way we found the reviews:

...

def parse_salaries(html) -> Tuple[List[Dict], int]:
    """parse jobs page for job data and total amount of jobs"""
    cache = extract_apollo_state(html)
    xhr_cache = cache["ROOT_QUERY"]
    salaries = next(v for k, v in xhr_cache.items() if k.startswith("salariesByEmployer") and v.get("results"))
    return salaries


async def scrape_reviews(employer: str, employer_id: str, session: httpx.AsyncClient):
    """Scrape job listings"""
    # scrape first page of jobs:
    first_page = await session.get(
        url=f"https://www.glassdoor.com/Salaries/{employer}-Salaries-E{employer_id}.htm",
    )
    salaries = parse_salaries(first_page.text)
    total_pages = salaries["pages"]

    print(f"scraped first page of salaries, scraping remaining {total_pages - 1} pages")
    other_pages = [
        session.get(
            url=str(first_page.url).replace(".htm", f"_P{page}.htm"),
        )
        for page in range(2, total_pages + 1)
    ]
    for page in await asyncio.gather(*other_pages):
        page_reviews = parse_salaries(page.text)
        salaries["results"].extend(page_reviews["results"])
    return salaries

By understanding how Glassdoor web infrastructure works we can easily collect all of the public company datasets available on Glassdoor.

However, when collecting all of this data at scale we're very likely to be blocked, so let's take a look at how we can avoid blocking by using ScrapFly API.

Bypass Glassdoor Blocking with ScrapFly

ScrapFly offers several powerful features that'll help us to get around web scraper blocking:

For this, we'll be using the ScrapFly SDK python package. To start, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To convert our httpx powered code to use ScrapFly SDK all we have to do is replace our httpx requests with SDK ones:

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key='YOUR_SCRAPFLY_KEY')
result = client.scrape(ScrapeConfig(
    url="https://www.glassdoor.com/Salary/eBay-Salaries-E7853.htm",
    # we can enable anti-scraper protection bypass:
    asp=True,
    # or select proxies from specific countries:
    country="US",
    # and change proxy types:
    proxy_pool="public_residential_pool",
))

For extended example of using ScrapFly to scrape glassdoor, see the Full Scraper Code section.

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping Glasdoor.com:

Yes. Data displayed on Glassdoor is publicly available, and we're not extracting anything private. Scraping Glassdoor.com at slow, respectful rates would fall under the ethical scraping definition. That being said, attention should be paid to GDRP compliance in the EU when scraping data submitted by its users such as reviews. For more, see our Is Web Scraping Legal? article.

How to find all company pages listed on Glassdoor?

Glassdoor contains over half a million US companies alone but doesn't have a sitemap. Though it does contain multiple limited directory pages like directory for US companies. Unfortunately, directory pages are limited to a few hundred pages though with crawling and filtering it's possible to discover all of the pages.

Another approach that we covered in this tutorial is to discover company pages through company IDS. Each company on Glassdoor is assigned an incremental ID in the range of 1000-1_000_000+. Using HEAD-type requests we can easily poke each of these IDs to see whether they lead to company pages.

Glassdoor Scraping Summary

In this web scraping tutorial, we've taken a look at how we can scrape company details such as metadata, review, job listings and salaries displayed on glassdoor.com.

We did this by taking advantage of graphql cache and state data which we extracted with a few generic web scraping algorithms in plain Python.

Related Posts

How to Web Scrape with HTTPX and Python

Intro to using Python's httpx library for web scraping. Proxy and user agent rotation and common web scraping challenges, tips and tricks.

How to Scrape Goat.com for Fashion Apparel Data in Python

Goat.com is a rising storefront for luxury fashion apparel items. It's known for high quality apparel data so in this tutorial we'll take a look how to scrape it using Python.

How to Scrape Fashionphile for Second Hand Fashion Data

In this fashion scrapeguide we'll be taking a look at Fashionphile - another major 2nd hand luxury fashion marketplace. We'll be using Python and hidden web data scraping to grap all of this data in just few lines of code.