How to Scrape Glassdoor

article feature image

Glassdoor is mostly known for company reviews from past and current employees though it contains much more data like company metadata, salary information and reviews. This makes Glassdoor a great public data target for web scraping!

In this hands-on web scraping tutorial, we'll be taking a look at glassdoor.com and how can we scrape company information, job listings and reviews. We'll do this in Python using a few popular community packages, so let's dive in.

Setup

In this tutorial we'll be using Python and a couple of popular community packages:

  • httpx - an HTTP client library that will let us communicate with amazon.com's servers
  • parsel - an HTML parsing library though we'll be doing very little HTML parsing in this tutorial and will be mostly working with JSON data directly instead.

These packages can be easily installed via pip command:

$ pip install httpx parsel 

Alternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. As for, parsel, another great alternative is beautifulsoup package.

Dealing with the Overlay

When browsing Glassdoor for a while we are sure to encounter an overlay that requests users to log in:

glassdoor overlay example

After few pages advertisement to register/login appears

All of the content is still there just covered up by the overlay. When scraping our parsing tools will still be able to find this data:

import httpx
from parse import Selector

response = httpx.get(
    "https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm"
)
selector = Selector(response.text)
# find description in the HTML:
print(selector.css('[data-test="employerDescription"]::text').get())
# will print:
# eBay is where the world goes to shop, sell, and give. Every day, our professionals connect millions of buyers and sellers around the globe, empowering people and creating opportunity. We're on a mission to build a better, more connected form of commerce that benefits individuals

That being said, while we're developing our web scraper we want to see and inspect the web page. We can easily remove the overlay with a little bit of javascript:

function addGlobalStyle(css) {
    var head, style;
    head = document.getElementsByTagName('head')[0];
    if (!head) { return; }
    style = document.createElement('style');
    style.type = 'text/css';
    style.innerHTML = css;
    head.appendChild(style);
}

addGlobalStyle("#HardsellOverlay {display:none !important;}");
addGlobalStyle("body {overflow:auto !important; position: initial !important}");

window.addEventListener("scroll", event => event.stopPropagation(), true);
window.addEventListener("mousemove", event => event.stopPropagation(), true);

This script sets a few global CSS styles to hide the overlay and it can be executed either through the web browser's developer tools console (F12 key, console tab).

Alternatively, it can be added to the bookmarks toolbar as a bookmarklet, simply drag this link: glassdoor overlay remover to your bookmarks toolbar and click it to get rid of the overlay at any time.

Selecting Region

Glassdoor operates all around the world and most of its content is region-aware. For example, if we're looking at Ebay's Glassdoor profile on the glassdoor.co.uk website we'll see only job listings relevant to the United Kingdom.

To select the region when web scraping we can either supply a cookie with the selected region's ID:

from parsel import Selector
import httpx

france_location_cookie = {"tldp": "6"}
response = httpx.get(
    "https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm",
    cookies=france_location_cookie,
    follow_redirects=True,
)
selector = Selector(response.text)
# find employee count in the HTML:
print(selector.css('[data-test="employer-size"]::text').get())
# will print:
# Plus de 10 000 employés
How to get country IDs?

All country IDs are present in every glassdoor page HTML which can be extracted with a simple regular expressions pattern:

import re
import json
import httpx

response = httpx.get(
    "https://www.glassdoor.com/",
    follow_redirects=True,
)
country_data = re.findall(r'"countryMenu\\":.+?(\[.+?\])', response.text)[0].replace('\\', '')
country_data = json.loads(country_data)
for country in country_data:
    print(f"{country['textKey']}: {country['id']}")

Note that these IDs are unlikely to change so here's the full output:

Argentina: 13
Australia: 5
Belgique (Français): 15
België (Nederlands): 14
Brasil: 9
Canada (English): 3
Canada (Français): 19
Deutschland: 7
España: 8
France: 6
Hong Kong: 20
India: 4
Ireland: 18
Italia: 23
México: 12
Nederland: 10
New Zealand: 21
Schweiz (Deutsch): 16
Singapore: 22
Suisse (Français): 17
United Kingdom: 2
United States: 1
Österreich: 11

Some pages however use URL parameters which we'll cover more in the web scraping sections of the article.

Scraping Company Data

In this tutorial, we'll focus on scraping company information such as company overview, job listings, reviews etc. That being said, the techniques covered in this section can be applied to almost any other data page on glassdoor.com

Company Overview

Let's start our scraper by scraping company overview data:

glassdoor company overview highlight

We'll be extracting highlighted data and much more

To scrape these details all we need company page URL or alternative just the company ID number (which we can use to generate the URL):

import httpx
from parsel import Selector

company_id = "7671"
short_url = f"https://www.glassdoor.com/Overview/-EI_IE{company_id}.htm"
response = httpx.get(short_url)
print(response.url)
# request to short url wil redirect us to full page
# https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm
sel = Selector(response.text)
print(sel.css("h1::text").get())
# Ebay

To parse company data we can either parse the rendered HTML or since Glassdoor is using Apollo Graphql to power their website we can extract embedded state data:

import re
import httpx
import json

def extract_apollo_state(html):
    """Extract apollo graphql state data from HTML source"""
    data = re.findall('apolloState":\s*({.+})};', html)[0]
    return json.loads(data)

def scrape_overview(company_id: int):
    short_url = f"https://www.glassdoor.com/Overview/-IE_EI{company_id}.htm"
    response = httpx.get(short_url)
    apollo_state = extract_apollo_state(response.text)
    return next(v for k, v in state.items() if k.startswith("Employer:"))

print(json.dumps(scrape_overview("7671"), indent=2))
Example Output
{
  "__typename": "Employer",
  "id": 7853,
  "awards({\"limit\":200,\"onlyFeatured\":false})": [
    {
      "__typename": "EmployerAward",
      "awardDetails": null,
      "name": "Best Places to Work",
      "source": "Glassdoor",
      "year": 2022,
      "featured": true
    },
    "... truncated for preview ..."
  ],
  "shortName": "eBay",
  "links": {
    "__typename": "EiEmployerLinks",
    "reviewsUrl": "/Reviews/eBay-Reviews-E7853.htm",
    "manageoLinkData": null
  },
  "website": "www.ebayinc.com",
  "type": "Company - Public",
  "revenue": "$10+ billion (USD)",
  "headquarters": "San Jose, CA",
  "size": "10000+ Employees",
  "stock": "EBAY",
  "squareLogoUrl({\"size\":\"SMALL\"})": "https://media.glassdoor.com/sqls/7853/ebay-squareLogo-1634568971365.png",
  "primaryIndustry": {
    "__typename": "EmployerIndustry",
    "industryId": 200063,
    "industryName": "Internet & Web Services"
  },
  "yearFounded": 1995,
  "overview": {
    "__typename": "EmployerOverview",
    "description": "eBay is where the world goes to shop, sell, and give. Every day, our professionals connect millions of buyers and sellers around the globe, empowering people and creating opportunity. We're on a mission to build a better, more connected form of commerce that benefits individuals, businesses, and society. We create stronger connections between buyers and sellers, offering product experiences that are fast, mobile and secure. At eBay, we develop technologies that enable connected commerce and make every interaction effortless\u2014and more human. And we are doing it on a global scale, providing everyone with the chance to participate and create value.",
    "mission": "We connect people and build communities to create economic opportunity for all."
  },
  "bestProfile": {
    "__ref": "EmployerProfile:7925"
  },
  "employerManagedContent({\"parameters\":[{\"divisionProfileId\":961530,\"employerId\":7853}]})": [
    {
      "__typename": "EmployerManagedContent",
      "diversityContent": {
        "__typename": "DiversityAndInclusionContent",
        "programsAndInitiatives": {
          "__ref": "EmployerManagedContentSection:0"
        },
        "goals": []
      }
    }
  ],
  "badgesOfShame": []
}

By parsing the embedded graphQl data we can easily extract the entire company dataset with few lines of code!

Let's take a look at how we can use this technique to scrape other details such as jobs and reviews next.

Job Listings

To scrape job listings we'll also take a look at embedded graphql data though this time we'll be parsing graphql cache rather than state data.

For this, let's take a look at Ebay's Glassdoor jobs page:

glassdoor company jobs page highlight

We'll be scraping all of the job listing data

If we look around the page source we can see all of the job data is present in the javascript variable window.appCache in a hidden <script> node:

glassdoor jobs page source highlight on app data

We can see that the hidden dataset contains even more data than it's visible on the page.

To extract it we can use some common parsing algorithms:

import json

def find_json_objects(text: str, decoder=json.JSONDecoder()):
    """Find JSON objects in text, and generate decoded JSON data and it's ID"""
    pos = 0
    while True:
        match = text.find("{", pos)
        if match == -1:
            break
        try:
            result, index = decoder.raw_decode(text[match:])
            # backtrack to find the key/identifier for this json object:
            key_end = text.rfind('"', 0, match)
            key_start = text.rfind('"', 0, key_end)
            key = text[key_start + 1 : key_end]
            yield key, result
            pos = match + index
        except ValueError:
            pos = match + 1


def extract_apollo_cache(html):
    """Extract apollo graphql cache data from HTML source"""
    selector = Selector(text=html)
    script_with_cache = selector.xpath("//script[contains(.,'window.appCache')]/text()").get()
    cache = defaultdict(list)
    for key, data in find_json_objects(script_with_cache):
        cache[key].append(data)
    return cache

Our parser above will take the HTML text, then find a <script>node with thewindow.appCache` variable and extract the cache objects. Let's take a look at how it handles Glassdoor's job page:

import math
from typing import Tuple, List, Dict
import httpx
from parsel import Selector

...  # don't forget the parser code we wrote previously

def parse_jobs(html) -> Tuple[List[Dict], int]:
    """parse jobs page for job data and total amount of jobs"""
    cache = extract_apollo_cache(html)
    return [v["jobview"] for v in cache["JobListingSearchResult"]]


def parse_job_page_count(html) -> int:
    """parse job page count from pagination details in Glassdoor jobs page"""
    _total_results = Selector(html).css(".paginationFooter::text").get()
    if not _total_results:
        return 1
    _total_results = int(_total_results.split()[-1])
    _total_pages = math.ceil(_total_results / 40)
    return _total_pages


async def scrape_jobs(employer_id: str, session: httpx.AsyncClient):
    """Scrape job listings"""
    # scrape first page of jobs:
    first_page = await session.get(
        url=f"https://www.glassdoor.com/Jobs/-Jobs-E{employer_id}_P1.htm?filter.countryId={session.cookies.get('tldp') or 0}",
    )
    # parse first page of jobs:
    jobs = parse_jobs(first_page.text)
    # parse total job page count
    total_pages = parse_job_page_count(first_page.text)
    # then scrape remaining pages concurrently:
    print(f"scraped first page of jobs, scraping remaining {total_pages - 1} pages")
    other_pages = [
        session.get(url=str(first_page.url).replace(".htm", f"_P{page}.htm"))
        for page in range(2, total_pages + 1)
    ]
    for page in await asyncio.gather(*other_pages):
        jobs.extend(parse_jobs(page.text))
    return jobs

Above, we have a scraper that goes through a classic pagination scraping algorithm:

  1. Scrape the first jobs page
  2. Extract GraphQl cache for jobs data
  3. Parse HTML for total page count
  4. Scrape remaining pages concurrently

If we run our scraper, we should get all of the job listings in no time!

Run Code & Example Output

To run our scraper all we have to do is start a connection session and await our scrape function:

async def main():
    async with httpx.AsyncClient(
        timeout=httpx.Timeout(30.0),
        cookies={"tldp": "1"},  # we can set country region using this cookie
        follow_redirects=True,
    ) as client:
        company_id = "7853"  #  Ebay
        jobs = await scrape_jobs(company_id, client)
        print(json.dumps(jobs, indent=2))

asyncio.run(main())

Which should generate all of the jobs available in our select region:

[
  {
    "header": {
      "adOrderId": 1281260,
      "advertiserType": "EMPLOYER",
      "ageInDays": 0,
      "easyApply": false,
      "employer": {
        "id": 7853,
        "name": "eBay inc.",
        "shortName": "eBay",
        "__typename": "Employer"
      },
      "goc": "machine learning engineer",
      "gocConfidence": 0.9,
      "gocId": 102642,
      "jobLink": "/partner/jobListing.htm?pos=140&ao=1281260&s=21&guid=0000018355c715f3b12a6090d334a7dc&src=GD_JOB_AD&t=ESR&vt=w&cs=1_9a5bdc18&cb=1663591454509&jobListingId=1008147859269&jrtk=3-0-1gdase5jgjopr801-1gdase5k2irmo800-952fa651f152ade0-",
      "jobTitleText": "Sr. Manager, AI-Guided Service Products",
      "locationName": "Salt Lake City, UT",
      "divisionEmployerName": null,
      "needsCommission": false,
      "payCurrency": "USD",
      "payPercentile10": 75822,
      "payPercentile25": 0,
      "payPercentile50": 91822,
      "payPercentile75": 0,
      "payPercentile90": 111198,
      "payPeriod": "ANNUAL",
      "salarySource": "ESTIMATED",
      "sponsored": true,
      "__typename": "JobViewHeader"
    },
    "job": {
      "importConfigId": 322429,
      "jobTitleText": "Sr. Manager, AI-Guided Service Products",
      "jobTitleId": 0,
      "listingId": 1008147859269,
      "__typename": "JobDetails"
    },
    "jobListingAdminDetails": {
      "cpcVal": null,
      "jobListingId": 1008147859269,
      "jobSourceId": 0,
      "__typename": "JobListingAdminDetailsVO"
    },
    "overview": {
      "shortName": "eBay",
      "squareLogoUrl": "https://media.glassdoor.com/sql/7853/ebay-squareLogo-1634568971326.png",
      "__typename": "Employer"
    },
    "__typename": "JobView"
  },
  "..."
]

Now that we understand how Glassdoor works let's take a look at how we can grab other details available in the graphql cache like reviews.

Company Reviews

To scrape reviews we'll take a look at another graphql feature - page state data.

Just like how we found graphql cache in the page HTML we can also find graphql state:

glassdoor reviews page source highlight on app state data

Just like with cache, state data contains much more information than visible on the page

So, to scrape reviews we can parse graphql state data which contains all of the reviews, review metadata and loads of other data details:

import asyncio
import re
import json
from typing import Tuple, List, Dict
import httpx


def extract_apollo_state(html):
    """Extract apollo graphql state data from HTML source"""
    data = re.findall('apolloState":\s*({.+})};', html)[0]
    data = json.loads(data)
    return data


def parse_reviews(html) -> Tuple[List[Dict], int]:
    """parse jobs page for job data and total amount of jobs"""
    cache = extract_apollo_state(html)
    xhr_cache = cache["ROOT_QUERY"]
    reviews = next(v for k, v in xhr_cache.items() if k.startswith("employerReviews") and v.get("reviews"))
    return reviews


async def scrape_reviews(employer_id: str, session: httpx.AsyncClient):
    """Scrape job listings"""
    # scrape first page of jobs:
    first_page = await session.get(
        url=f"https://www.glassdoor.com/Reviews/-Reviews-E{employer_id}_P1.htm",
    )
    reviews = parse_reviews(first_page.text)
    # find total amount of pages and scrape remaining pages concurrently
    total_pages = reviews["numberOfPages"]
    print(f"scraped first page of reviews, scraping remaining {total_pages - 1} pages")
    other_pages = [
        session.get(
            url=str(first_page.url).replace("_P1.htm", f"_P{page}.htm"),
        )
        for page in range(2, total_pages + 1)
    ]
    for page in await asyncio.gather(*other_pages):
        page_reviews = parse_reviews(page.text)
        reviews["reviews"].extend(page_reviews["reviews"])
    return reviews
Run Code & Example Output

We can run our scraper the same as before:

async def main():
    async with httpx.AsyncClient(
        timeout=httpx.Timeout(30.0),
        cookies={"tldp": "1"},
        follow_redirects=True,
    ) as client:
        reviews = await scrape_reviews("7853", client)
        print(json.dumps(reviews, indent=2))


asyncio.run(main())

Which will produce results similar to:

{
  "__typename": "EmployerReviews",
  "filteredReviewsCountByLang": [
    {
      "__typename": "ReviewsCountByLanguage",
      "count": 4109,
      "isoLanguage": "eng"
    },
    "..."
  ],
  "employer": {
    "__ref": "Employer:7853"
  },
  "queryLocation": null,
  "queryJobTitle": null,
  "currentPage": 1,
  "numberOfPages": 411,
  "lastReviewDateTime": "2022-09-16T14:51:36.650",
  "allReviewsCount": 5017,
  "ratedReviewsCount": 4218,
  "filteredReviewsCount": 4109,
  "ratings": {
    "__typename": "EmployerRatings",
    "overallRating": 4.1,
    "reviewCount": 4218,
    "ceoRating": 0.87,
    "recommendToFriendRating": 0.83,
    "cultureAndValuesRating": 4.2,
    "diversityAndInclusionRating": 4.3,
    "careerOpportunitiesRating": 3.8,
    "workLifeBalanceRating": 4.1,
    "seniorManagementRating": 3.7,
    "compensationAndBenefitsRating": 4.1,
    "businessOutlookRating": 0.66,
    "ceoRatingsCount": 626,
    "ratedCeo": {
      "__ref": "Ceo:768619"
    }
  },
  "reviews": [
    {
      "__typename": "EmployerReview",
      "isLegal": true,
      "reviewId": 64767391,
      "reviewDateTime": "2022-05-27T08:41:43.217",
      "ratingOverall": 5,
      "ratingCeo": "APPROVE",
      "ratingBusinessOutlook": "POSITIVE",
      "ratingWorkLifeBalance": 5,
      "ratingCultureAndValues": 5,
      "ratingDiversityAndInclusion": 5,
      "ratingSeniorLeadership": 5,
      "ratingRecommendToFriend": "POSITIVE",
      "ratingCareerOpportunities": 5,
      "ratingCompensationAndBenefits": 5,
      "employer": {
        "__ref": "Employer:7853"
      },
      "isCurrentJob": true,
      "lengthOfEmployment": 1,
      "employmentStatus": "REGULAR",
      "jobEndingYear": null,
      "jobTitle": {
        "__ref": "JobTitle:60210"
      },
      "location": {
        "__ref": "City:1139151"
      },
      "originalLanguageId": null,
      "pros": "Thorough training, compassionate and very patient with all trainees. Benefits day one. Inclusive and really work with their employees to help them succeed in their role.",
      "prosOriginal": null,
      "cons": "No cons at all! So far everything is great!",
      "consOriginal": null,
      "summary": "Excellent Company",
      "summaryOriginal": null,
      "advice": null,
      "adviceOriginal": null,
      "isLanguageMismatch": false,
      "countHelpful": 2,
      "countNotHelpful": 0,
      "employerResponses": [],
      "isCovid19": false,
      "divisionName": null,
      "divisionLink": null,
      "topLevelDomainId": 1,
      "languageId": "eng",
      "translationMethod": null
    },
    "..."
  ]
}

Other Details

The embedded graphql data parsing techniques we've learned in job and review data scraping can be applied to scrape other company details like salaries, interviews, benefits and photos.

For example, salary data can be found the same way we found the reviews:

...

def parse_salaries(html) -> Tuple[List[Dict], int]:
    """parse jobs page for job data and total amount of jobs"""
    cache = extract_apollo_state(html)
    xhr_cache = cache["ROOT_QUERY"]
    salaries = next(v for k, v in xhr_cache.items() if k.startswith("salariesByEmployer") and v.get("results"))
    return salaries


async def scrape_reviews(employer_id: str, session: httpx.AsyncClient):
    """Scrape job listings"""
    # scrape first page of jobs:
    first_page = await session.get(
        url=f"https://www.glassdoor.com/Salaries/-Salaries-E{employer_id}_P1.htm",
    )
    salaries = parse_salaries(first_page.text)
    total_pages = salaries["pages"]

    print(f"scraped first page of salaries, scraping remaining {total_pages - 1} pages")
    other_pages = [
        session.get(
            url=str(first_page.url).replace("_P1.htm", f"_P{page}.htm"),
        )
        for page in range(2, total_pages + 1)
    ]
    for page in await asyncio.gather(*other_pages):
        page_reviews = parse_salaries(page.text)
        salaries["results"].extend(page_reviews["results"])
    return salaries

By understanding how Glassdoor web infrastructure works we can easily collect all of the public company datasets available on Glassdoor.

However, when collecting all of this data at scale we're very likely to be blocked, so let's take a look at how we can avoid blocking by using ScrapFly API.

Avoiding Blocking with ScrapFly

ScrapFly offers several powerful features that'll help us to get around web scraper blocking:

For this, we'll be using the ScrapFly SDK python package. To start, let's install scrapfly-sdk using pip:

$ pip install scrapfly-sdk

To convert our httpx powered code to use ScrapFly SDK all we have to do is replace our httpx requests with SDK ones:

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key='YOUR_SCRAPFLY_KEY')
result = client.scrape(ScrapeConfig(
    url="https://www.glassdoor.com/Salary/eBay-Salaries-E7853.htm",
    # we can enable anti-scraper protection bypass:
    asp=True,
    # or select proxies from specific countries:
    country="US",
    # and change proxy types:
    proxy_pool="public_residential_pool",
))

Let's take a look at how our full scraper looks with ScrapFly features enabled next.

Full Scraper Code

We can put everything we've learned together with scrapfly integration for the final scraper code:

Full Scraper Code
import asyncio
import json
import math
import re
from collections import defaultdict
from typing import Dict, List, Tuple

from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient

client = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=10)
BASE_CONFIG = {
    # we want can select any country proxy:
    "country": "CA",
    # To see Glassdoor results of a specific country we must set a cookie:
    "cookies": {"tldp": "1"}
    # note: here are the other country codes:
    # 1: United States
    # 2: United Kingdom
    # 3: Canada (English)
    # 4: India
    # 5: Australia
    # 6: France
    # 7: Deutschland
    # 8: España
    # 9: Brasil
    # 10: Nederland
    # 11: Österreich
    # 12: México
    # 13: Argentina
    # 14: België (Nederlands)
    # 15: Belgique (Français)
    # 16: Schweiz (Deutsch)
    # 17: Suisse (Français)
    # 18: Ireland
    # 19: Canada (Français)
    # 20: Hong Kong
    # 21: New Zealand
    # 22: Singapore
    # 23: Italia
}


def find_json_objects(text: str, decoder=json.JSONDecoder()):
    """Find JSON objects in text, and generate decoded JSON data and it's ID"""
    pos = 0
    while True:
        match = text.find("{", pos)
        if match == -1:
            break
        try:
            result, index = decoder.raw_decode(text[match:])
            # backtrack to find the key/identifier for this json object:
            key_end = text.rfind('"', 0, match)
            key_start = text.rfind('"', 0, key_end)
            key = text[key_start + 1 : key_end]
            yield key, result
            pos = match + index
        except ValueError:
            pos = match + 1


def extract_apollo_state(result: ScrapeApiResponse) -> Dict:
    """Extract apollo graphql state data from HTML source"""
    data = re.findall('apolloState":\s*({.+})};', result.content)[0]
    return json.loads(data)


def extract_apollo_cache(result: ScrapeApiResponse) -> Dict[str, List]:
    """Extract apollo graphql cache data from HTML source"""
    script_with_cache = result.selector.xpath("//script[contains(.,'window.appCache')]/text()").get()
    cache = defaultdict(list)
    for key, data in find_json_objects(script_with_cache):
        cache[key].append(data)
    return cache


def parse_jobs(result: ScrapeApiResponse) -> Tuple[List[Dict], int]:
    """parse jobs page for job data and total amount of jobs"""
    cache = extract_apollo_cache(result)
    return [v["jobview"] for v in cache["JobListingSearchResult"]]


def parse_job_page_count(result: ScrapeApiResponse) -> int:
    """parse job page count from pagination details in Glassdoor jobs page"""
    total_results = result.selector.css(".paginationFooter::text").get()
    if not total_results:
        return 1
    total_results = int(total_results.split()[-1])
    total_pages = math.ceil(total_results / 40)
    return total_pages


def change_page(url: str, page: int) -> str:
    """update page number in a glassdoor url"""
    new = re.sub("(?:_P\d+)*.htm", f"_P{page}.htm", url)
    assert new != url
    return new


async def scrape_jobs(employer_id: str) -> List[Dict]:
    """Scrape job listings"""
    first_page_url = (
        f"https://www.glassdoor.com/Jobs/-Jobs-E{employer_id}_P1.htm?filter.countryId={BASE_CONFIG['cookies']['tldp']}"
    )
    first_page = await client.async_scrape(ScrapeConfig(url=first_page_url, **BASE_CONFIG))

    jobs = parse_jobs(first_page)
    total_pages = parse_job_page_count(first_page)

    print(f"scraped first page of jobs, scraping remaining {total_pages - 1} pages")
    other_pages = [
        ScrapeConfig(url=change_page(first_page.context["url"], page=page), **BASE_CONFIG)
        for page in range(2, total_pages + 1)
    ]
    async for result in client.concurrent_scrape(other_pages):
        jobs.extend(parse_jobs(result))
    return jobs


def parse_reviews(result: ScrapeApiResponse) -> List[Dict]:
    """parse reviews page for review data"""
    cache = extract_apollo_state(result)
    xhr_cache = cache["ROOT_QUERY"]
    reviews = next(v for k, v in xhr_cache.items() if k.startswith("employerReviews") and v.get("reviews"))
    return reviews


async def scrape_reviews(employer_id: str) -> Dict:
    """Scrape reviews of a given company"""
    # scrape first page of jobs:
    first_page_url = f"https://www.glassdoor.com/Reviews/-Reviews-E{employer_id}_P1.htm?filter.countryId={BASE_CONFIG['cookies']['tldp']}"
    first_page = await client.async_scrape(ScrapeConfig(url=first_page_url))

    reviews = parse_reviews(first_page)
    total_pages = reviews["numberOfPages"]

    print(f"scraped first page of reviews, scraping remaining {total_pages - 1} pages")
    other_pages = [
        ScrapeConfig(url=change_page(first_page.context["url"], page=page), **BASE_CONFIG)
        for page in range(2, total_pages + 1)
    ]
    async for result in client.concurrent_scrape(other_pages):
        reviews["reviews"].extend(parse_reviews(result)["reviews"])
    return reviews


def parse_salaries(result: ScrapeApiResponse) -> List[Dict]:
    """parse salary page for salary data"""
    cache = extract_apollo_state(result)
    xhr_cache = cache["ROOT_QUERY"]
    salaries = next(v for k, v in xhr_cache.items() if k.startswith("salariesByEmployer") and v.get("results"))
    return salaries


async def scrape_salaries(employer_id: str) -> Dict:
    """Scrape salary listings"""
    # scrape first page of jobs:
    first_page_url = f"https://www.glassdoor.com/Salaries/-Salaries-E{employer_id}_P1.htm?filter.countryId={BASE_CONFIG['cookies']['tldp']}"
    first_page = await client.async_scrape(ScrapeConfig(url=first_page_url))
    salaries = parse_salaries(first_page)
    total_pages = salaries["pages"]

    print(f"scraped first page of salaries, scraping remaining {total_pages - 1} pages")
    other_pages = [
        ScrapeConfig(url=change_page(first_page.context["url"], page=page), **BASE_CONFIG)
        for page in range(2, total_pages + 1)
    ]
    async for result in client.concurrent_scrape(other_pages):
        salaries["results"].extend(parse_salaries(result)["results"])
    return salaries


if __name__ == "__main__":
    ebay_jobs_in_US = asyncio.run(scrape_jobs("7853"))
    ebay_reviews_in_US = asyncio.run(scrape_reviews("7853"))
    ebay_salaries_in_US = asyncio.run(scrape_salaries("7853"))

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping Glasdoor.com:

How to find all company pages listed on Glassdoor?

Glassdoor contains over half a million US companies alone but doesn't have a sitemap. Though it does contain multiple limited directory pages like directory for US companies. Unfortunately, directory pages are limited to a few hundred pages though with crawling and filtering it's possible to discover all of the pages.

Another approach that we covered in this tutorial is to discover company pages through company IDS. Each company on Glassdoor is assigned an incremental ID in the range of 1000-1_000_000+. Using HEAD-type requests we can easily poke each of these IDs to see whether they lead to company pages.

Summary

In this web scraping tutorial, we've taken a look at how we can scrape company details such as metadata, review, job listings and salaries displayed on glassdoor.com.

We did this by taking advantage of graphql cache and state data which we extracted with a few generic web scraping algorithms in plain Python.

Related post

How to Ensure Web Scrapped Data Quality

Ensuring consitent web scrapped data quality can be a difficult and exhausting task. In this article we'll be taking a look at two populat tools in Python - Cerberus and Pydantic - and how can we use them to validate data.

How to Turn Web Scrapers into Data APIs

Delivering web scraped data can be a difficult problem - what if we could scrape data on demand? In this tutorial we'll be building a data API using FastAPI and Python for real time web scraping.

Web Scraping with Playwright and Python

Playwright is the new, big browser automation toolkit - can it be used for web scraping? In this introduction article, we'll take a look how can we use Playwright and Python to scrape dynamic websites.