How to Scrape Indeed.com

article feature image

In this web scraping tutorial, we'll take a look at how to scrape job listing data from Indeed.com.

Indeed.com is one of the most popular job listing websites, and it's pretty easy to scrape!
In this tutorial, we'll build our scraper with just a few lines of Python code. We'll take a look at how Indeed's search works to replicate it in our scraper and extract job data from embedded javascript variables. Let's dive in!

Web Scraping With Python Tutorial

If you're new to web scraping with Python we recommend checking out our full introduction tutorial to web scraping with Python and common best practices.

Web Scraping With Python Tutorial

Setup

For this web scraper, we'll only need an HTTP client library such as httpx library, which can be installed through pip console command:

$ pip install httpx 

There are many HTTP clients in Python like requests, httpx, aiohttp, etc. however, we recommend httpx as it's the least one likely to be blocked as it supports http2 protocol. httpx also support asynchronous python, which means we can scrape really fast

Finding Jobs

To start, let's take a look at how we can find job listings on Indeed.com.
If we go to the homepage and submit our search, we can see that Indeed redirects us to a search URL with a few key parameters:

0:00
/
https://www.indeed.com/jobs?q=python&l=Texas

So, to find Python jobs in Texas, all we have to do is send a request with l=Texas and q=Python URL parameters:

import httpx
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Connection": "keep-alive",
    "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}

response = httpx.get("https://www.indeed.com/jobs?q=python&l=Texas", headers=HEADERS)
print(response)

Note: if you receive response status code 403 here, it's likely you are being blocked, see Avoiding Blocking section below for more information.

We got a single page that contains 15 job listings! Before we collect the remaining pages, let's see how we can parse job listing data from this response.

We could parse the HTML document using CSS or XPath selectors, but there's an easier way: we can find all of the job listing data hidden away deep in the HTML as a JSON document:

page source of indeed.com search page embedded ddata

So, instead, let's parse this data using a simple regular expression pattern:

import httpx
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Connection": "keep-alive",
    "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}

def parse_search_page(html: str):
    data = re.findall(r'window.mosaic.providerData\["mosaic-provider-jobcards"\]=(\{.+?\});', html)
    data = json.loads(data[0])
    return {
        "results": data['metaData']['mosaicProviderJobCardsModel']['results'],
        "meta": data['metaData']['mosaicProviderJobCardsModel']['tierSummaries'],
    }

response = httpx.get("https://www.indeed.com/jobs?q=python&l=Texas", headers=HEADERS)
print(parse_search_page(response.text))

In our code above, we are using a regular expression pattern to select mosaic-provider-jobcards variable value, load it as a python dictionary and parse out the result and paging meta-data.

Now that we have first page results and total page count, we can retrieve the remaining pages:

import asyncio
import json
import re
from typing import List
from urllib.parse import urlencode

import httpx


def parse_search_page(html: str):
    data = re.findall(r'window.mosaic.providerData\["mosaic-provider-jobcards"\]=(\{.+?\});', html)
    data = json.loads(data[0])
    return {
        "results": data["metaData"]["mosaicProviderJobCardsModel"]["results"],
        "meta": data["metaData"]["mosaicProviderJobCardsModel"]["tierSummaries"],
    }


async def scrape_search(client: httpx.AsyncClient, query: str, location: str):
    def make_page_url(offset):
        parameters = {"q": query, "l": location, "filter": 0, "start": offset}
        return "https://www.indeed.com/jobs?" + urlencode(parameters)

    print(f"scraping first page of search: {query=}, {location=}")
    response_first_page = await client.get(make_page_url(0))
    data_first_page = parse_search_page(response_first_page.text)

    results = data_first_page["results"]
    total_results = sum(category["jobCount"] for category in data_first_page["meta"])
    # there's a page limit on indeed.com
    if total_results > 1000:

        total_results = 1000
    print(f"scraping remaining {total_results - 10 / 10} pages")
    other_pages = [make_page_url(offset) for offset in range(10, total_results + 10, 10)]
    for response in await asyncio.gather(*[client.get(url=url) for url in other_pages]):
        results.extend(parse_search_page(response.text))
    return results
Run Code & Example Output
async def main():
    # we need to use browser-like headers to avoid being blocked instantly:
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
    }
    async with httpx.AsyncClient(headers=HEADERS) as client:
        search_data = await scrape_search(client, query="python", location="texas")
        print(json.dumps(search_data, indent=2))

asyncio.run(main())

This will result in search result data similar to:

[
    {
        "company": "Apple",
        "companyBrandingAttributes": {
            "headerImageUrl": "https://d2q79iu7y748jz.cloudfront.net/s/_headerimage/1960x400/ecdb4796986d27b654fe959e2fdac201",
            "logoUrl": "https://d2q79iu7y748jz.cloudfront.net/s/_squarelogo/256x256/86583e966849b2f081928769a6abdb09"
        },
        "companyIdEncrypted": "c1099851e9794854",
        "companyOverviewLink": "/cmp/Apple",
        "companyOverviewLinkCampaignId": "serp-linkcompanyname",
        "companyRating": 4.1,
        "companyReviewCount": 11193,
        "companyReviewLink": "/cmp/Apple/reviews",
        "companyReviewLinkCampaignId": "cmplinktst2",
        "displayTitle": "Software Quality Engineer, Apple Pay",
        "employerAssistEnabled": false,
        "employerResponsive": false,
        "encryptedFccompanyId": "6e7b40121fbb5e2f",
        "encryptedResultData": "VwIPTVJ1cTn5AN7Q-tSqGRXGNe2wB2UYx73qSczFnGU",
        "expired": false,
        "extractTrackingUrls": "https://jsv3.recruitics.com/partner/a51b8de1-f7bf-11e7-9edd-d951492604d9.gif?client=3427&rx_c=&rx_campaign=indeed16&rx_group=130795&rx_source=Indeed&job=200336736-2&rx_r=none&rx_ts=20220831T001748Z&rx_pre=1&indeed=sp",
        "extractedEntities": [],
        "fccompanyId": -1,
        "featuredCompanyAttributes": {},
        "featuredEmployer": false,
        "featuredEmployerCandidate": false,
        "feedId": 2772,
        "formattedLocation": "Austin, TX",
        "formattedRelativeTime": "Today",
        "hideMetaData": false,
        "hideSave": false,
        "highVolumeHiringModel": {
            "highVolumeHiring": false
        },
        "highlyRatedEmployer": false,
        "hiringEventJob": false,
        "indeedApplyEnabled": false,
        "indeedApplyable": false,
        "isJobSpotterJob": false,
        "isJobVisited": false,
        "isMobileThirdPartyApplyable": true,
        "isNoResumeJob": false,
        "isSubsidiaryJob": false,
        "jobCardRequirementsModel": {
            "additionalRequirementsCount": 0,
            "requirementsHeaderShown": false
        },
        "jobLocationCity": "Austin",
        "jobLocationState": "TX",
        "jobTypes": [],
        "jobkey": "5b47456ae8554711",
        "jsiEnabled": false,
        "locationCount": 0,
        "mobtk": "1gbpe4pcikib6800",
        "moreLocUrl": "",
        "newJob": true,
        "normTitle": "Software Quality Engineer",
        "openInterviewsInterviewsOnTheSpot": false,
        "openInterviewsJob": false,
        "openInterviewsOffersOnTheSpot": false,
        "openInterviewsPhoneJob": false,
        "overrideIndeedApplyText": true,
        "preciseLocationModel": {
            "obfuscateLocation": false,
            "overrideJCMPreciseLocationModel": true
        },
        "pubDate": 1661835600000,
        "redirectToThirdPartySite": false,
        "remoteLocation": false,
        "resumeMatch": false,
        "salarySnippet": {
            "salaryTextFormatted": false
        },
        "saved": false,
        "savedApplication": false,
        "showCommutePromo": false,
        "showEarlyApply": false,
        "showJobType": false,
        "showRelativeDate": true,
        "showSponsoredLabel": false,
        "showStrongerAppliedLabel": false,
        "smartFillEnabled": false,
        "snippet": "<ul style=\"list-style-type:circle;margin-top: 0px;margin-bottom: 0px;padding-left:20px;\"> \n <li style=\"margin-bottom:0px;\">At Apple, new ideas become extraordinary products, services, and customer experiences.</li>\n <li>We have the rare and rewarding opportunity to shape upcoming products\u2026</li>\n</ul>",
        "sourceId": 2700,
        "sponsored": true,
        "taxoAttributes": [],
        "taxoAttributesDisplayLimit": 5,
        "taxoLogAttributes": [],
        "taxonomyAttributes": [ { "attributes": [], "label": "job-types" }, "..."],
        "tier": {
            "matchedPreferences": {
                "longMatchedPreferences": [],
                "stringMatchedPreferences": []
            },
            "type": "DEFAULT"
        },
        "title": "Software Quality Engineer, Apple Pay",
        "translatedAttributes": [],
        "translatedCmiJobTags": [],
        "truncatedCompany": "Apple",
        "urgentlyHiring": false,
        "viewJobLink": "...",
        "vjFeaturedEmployerCandidate": false
    },
]

We've successfully scraped mountains of data with very few lines of Python code! Next, let's take a look at how to get the remainder of the job listing details (like full description) by scraping job pages.

Scraping Job Data

Our search results contain almost all job listing data except a few details, such as a complete job description. To scrape this, we need the job id, which is found in the jobkey field in our search results:

{
  "jobkey": "a82cf0bd2092efa3",
}

Using jobkey we can request the full job details page, and just like with the search; we can parse the embedded data instead of the HTML:

page source of indeed.com search page embedded ddata

We can see that all of the job and page information is hidden in the _initialData` variable, which we can extract with a simple regular expression pattern:

def parse_job_page(html):
    """parse job data from job listing page"""
    data = re.findall(r"_initialData=(\{.+?\});", html)
    data = json.loads(data[0])
    return data["jobInfoWrapperModel"]["jobInfoModel"]


async def scrape_jobs(client: httpx.AsyncClient, job_keys: List[str]):
    """scrape job details from job page for given job keys"""
    urls = [f"https://www.indeed.com/m/basecamp/viewjob?viewtype=embedded&jk={job_key}" for job_key in job_keys]
    scraped = []
    for response in await asyncio.gather(*[client.get(url=url) for url in urls]):
        scraped.append(parse_job_page(response.text))
    return scraped
Run Code & Example Output
async def main():
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Connection": "keep-alive",
        "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
    }
    async with httpx.AsyncClient(headers=HEADERS) as client:
        job_data = await scrape_jobs(client, ["a82cf0bd2092efa3"])
        print(job_data[0]['sanitizedJobDescription']['content'])
        print(job_data)

asyncio.run(main())

This will scrape results similar to:

[
    {
        "jobInfoHeaderModel": {
            "...",
            "companyName": "ExxonMobil",
            "companyOverviewLink": "https://www.indeed.com/cmp/Exxonmobil?campaignid=mobvjcmp&from=mobviewjob&tk=1gbpekba3is92800&fromjk=9dacdef3068a1d25",
            "companyReviewLink": "https://www.indeed.com/cmp/Exxonmobil/reviews?campaignid=mobvjcmp&cmpratingc=mobviewjob&from=mobviewjob&tk=1gbpekba3is92800&fromjk=9dacdef3068a1d25&jt=Geoscience+Technician",
            "companyReviewModel": {
                "companyName": "ExxonMobil",
                "desktopCompanyLink": "https://www.indeed.com/cmp/Exxonmobil/reviews?campaignid=viewjob&cmpratingc=mobviewjob&from=viewjob&tk=1gbpekba3is92800&fromjk=9dacdef3068a1d25&jt=Geoscience+Technician",
                "mobileCompanyLink": "https://www.indeed.com/cmp/Exxonmobil/reviews?campaignid=mobvjcmp&cmpratingc=mobviewjob&from=mobviewjob&tk=1gbpekba3is92800&fromjk=9dacdef3068a1d25&jt=Geoscience+Technician",
                "ratingsModel": {
                    "ariaContent": "3.9 out of 5 stars from 4,649 employee ratings",
                    "count": 4649,
                    "countContent": "4,649 reviews",
                    "descriptionContent": "Read what people are saying about working here.",
                    "rating": 3.9,
                    "showCount": true,
                    "showDescription": true,
                    "size": null
                }
            },
            "disableAcmeLink": false,
            "employerActivity": null,
            "employerResponsiveCardModel": null,
            "formattedLocation": "Spring, TX 77389",
            "hideRating": false,
            "isDesktopApplyButtonSticky": false,
            "isSimplifiedHeader": false,
            "jobTitle": "Geoscience Technician",
            "openCompanyLinksInNewTab": false,
            "parentCompanyName": null,
            "preciseLocationModel": null,
            "ratingsModel": null,
            "remoteWorkModel": null,
            "subtitle": "ExxonMobil - Spring, TX 77389",
            "tagModels": null,
            "viewJobDisplay": "DESKTOP_EMBEDDED"
        },
        "sanitizedJobDescription": {
            "content": "<p></p>\n<div>\n <div>\n  <div>\n   <div>\n    <h2 class=\"jobSectionHeader\"><b>Education and Related Experience</b></h2>\n   </div>\n   <div>\n  ...",
            "contentKind": "HTML"
        },
        "viewJobDisplay": "DESKTOP_EMBEDDED"
    }
]

We should see the full job description printed out if we run this scraper.


With this last feature, our scraper is ready to go! However, if we run our scraper at scale we might get blocked and for that, let's take a look at how we can integrate ScrapFly to avoid being blocked.

Avoid Blocking with ScrapFly

Indeed.com is using anti-scraping protection to block web scraper traffic. To get around this, we can use ScrapFly web scraping API, which offers several powerful features:

illustration of scrapfly's middleware

For our Indeed scraper, we'll be using the Anti Scraping Protection Bypass feature via Scrapfly-sdk, which can be installed using pip console command:

$ pip install scrapfly-sdk

Now, we can enable Anti Scraping Protection bypass via asp=True flag:

from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key="YOUR_API_KEY")
result = client.scrape(ScrapeConfig(
    url="https://www.indeed.com/jobs?q=python&l=Texas",
    asp=True,
    # ^ enable Anti Scraping Protection
))
print(result.content)  # print page HTML

We can convert the rest of our scraper to ScrapFly SDK and avoid all blocking:

Full Scraper Code

Let's put everything we've learned in this tutorial together into a single scraper:

import asyncio
import json
import re
from typing import List
from urllib.parse import urlencode

from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient


def parse_search_page(result):
    data = re.findall(r'window.mosaic.providerData\["mosaic-provider-jobcards"\]=(\{.+?\});', result.content)
    data = json.loads(data[0])
    return {
        "results": data["metaData"]["mosaicProviderJobCardsModel"]["results"],
        "meta": data["metaData"]["mosaicProviderJobCardsModel"]["tierSummaries"],
    }


async def scrape_search(client: ScrapflyClient, query: str, location: str):
    def make_page_url(offset):
        parameters = {"q": query, "l": location, "filter": 0, "start": offset}
        return "https://www.indeed.com/jobs?" + urlencode(parameters)

    print(f"scraping first page of search: {query=}, {location=}")
    result_first_page = await client.async_scrape(
        ScrapeConfig(
            make_page_url(0),
            country="US",
            asp=True,
        )
    )
    data_first_page = parse_search_page(result_first_page)

    results = data_first_page["results"]
    total_results = sum(category["jobCount"] for category in data_first_page["meta"])
    # there's a page limit on indeed.com
    if total_results > 1000:
        total_results = 1000

    print(f"scraping remaining {total_results - 10 / 10} pages")
    other_pages = [
        ScrapeConfig(url=make_page_url(offset), country="US", asp=True) for offset in range(10, total_results + 10, 10)
    ]
    async for result in client.concurrent_scrape(other_pages):
        try:
            data = parse_search_page(result)
            results.extend(data["results"])
        except Exception as e:
            print(e)
    return results


def parse_job_page(result: ScrapeApiResponse):
    """parse job data from job listing page"""
    data = re.findall(r"_initialData=(\{.+?\});", result.content)
    data = json.loads(data[0])
    return data["jobInfoWrapperModel"]["jobInfoModel"]


async def scrape_jobs(client: ScrapflyClient, job_keys: List[str]):
    """scrape job page"""
    urls = [f"https://www.indeed.com/m/basecamp/viewjob?viewtype=embedded&jk={job_key}" for job_key in job_keys]
    scraped = []
    async for result in client.concurrent_scrape([ScrapeConfig(url=url, country="US", asp=True) for url in urls]):
        scraped.append(parse_job_page(result))
    return scraped


async def main():
    with ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=10) as client:
        search_results = await scrape_search(client, "python", "Texas")
        print(json.dumps(search_results, indent=2))
        _found_job_ids = [result["jobkey"] for result in search_results]
        job_results = await scrape_jobs(client, job_keys=_found_job_ids[:10])
        print(json.dumps(job_results, indent=2))


asyncio.run(main())

By using scrapfly-sdk instead of httpx we can safely avoid scraper blocking with very few modifications to our code!

Summary

In this short web scraping tutorial, we've looked at web scraping Indeed.com job listing search. We built a search URL using custom search parameters and parsed job data from the embedded JSON data by using regular expressions. As a bonus, we also looked at scraping full job listing descriptions and how to avoid blocking using Scrapfly SDK.

Related post

How to Ensure Web Scrapped Data Quality

Ensuring consitent web scrapped data quality can be a difficult and exhausting task. In this article we'll be taking a look at two populat tools in Python - Cerberus and Pydantic - and how can we use them to validate data.

How to Turn Web Scrapers into Data APIs

Delivering web scraped data can be a difficult problem - what if we could scrape data on demand? In this tutorial we'll be building a data API using FastAPI and Python for real time web scraping.

How to Scrape Glassdoor

In this web scraping tutorial we'll take a look at Glassdoor - a major resource for company review, job listings and salary data.