In this guide, we'll explain how to scrape LinkedIn data - the most popular career-related social media platform out there.
We'll scrape LinkedIn information from search, job, company, and public profile pages. All of which through straightforward Python code along with a few parsing tips and tricks. Let's get started!
Legal Disclaimer and Precautions
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
A LinkedIn scraping tool enables valuable data extraction for both businesses and individuals through different use cases.
Market Research
Market trends and qualifications are fast-changing. Hence, LinkedIn web scraping is beneficial for keeping up with these changes by extracting industry-related data from company or job pages.
Personalized Job Research
LinkedIn includes thousands of job listing posts across various domains. Scraping data from LinkedIn enables creating alerts for personalized job preferences while also aggregating this data to gain insights into the in-demand skills and job requirements.
Lead Generation
Scraping leads from LinkedIn provides businesses with a wide range of opportunities by identifying potential leads with common interests. This lead data empowers decision-making and helps attract new clients.
Scraping LinkedIn data without getting blocked using Scrapfly is fairly straightforward. All we have to do is replace our HTTP client with the ScrapFly client, enable the asp parameter, and select a proxy country:
# standard web scraping code
import httpx
from parsel import Selector
response = httpx.get("some linkedin.com URL")
selector = Selector(response.text)
# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient
# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
response = scrapfly.scrape(ScrapeConfig(
url="website URL",
asp=True, # enable the anti scraping protection to bypass blocking
country="US", # set the proxy location to a specfic country
proxy_pool="public_residential_pool", # select the residential proxy pool for higher success rate
render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))
# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']
Since LinkedIn is known for its high blocking rate, we'll be using Scrapfly to extract data from LinkedIn for the rest of this guide. So, to follow along, register to get your API key.
In this section, we'll extract data from publicly available data on LinkedIn user profiles. If we take a look at one of the public LinkedIn profiles (like the one for Bill Gates) we can see loads of valuable public data:
Before we start scraping LinkedIn profiles, let's identify the HTML parsing approach. We can manually parse each data point from the HTML or extract data from hidden script tags.
To locate this hidden data, we can follow these steps:
Search for the selector: //script[@type='application/ld+json'].
This will lead to a script tag with the following details:
This gets us the core details available on the page, though a few fields like the job title are missing, as the page is viewed publicly. To scrape it, we'll extract the script and parse it:
import json
from typing import Dict, List
from parsel import Selector
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
BASE_CONFIG = {
# bypass linkedin.com web scraping blocking
"asp": True,
# set the proxy country to US
"country": "US",
"headers": {
"Accept-Language": "en-US,en;q=0.5"
}
}
def refine_profile(data: Dict) -> Dict:
"""refine and clean the parsed profile data"""
parsed_data = {}
profile_data = [key for key in data["@graph"] if key["@type"]=="Person"][0]
profile_data["worksFor"] = [profile_data["worksFor"][0]]
articles = [key for key in data["@graph"] if key["@type"]=="Article"]
for article in articles:
selector = Selector(article["articleBody"])
article["articleBody"] = "".join(selector.xpath("//p/text()").getall())
parsed_data["profile"] = profile_data
parsed_data["posts"] = articles
return parsed_data
def parse_profile(response: ScrapeApiResponse) -> Dict:
"""parse profile data from hidden script tags"""
selector = response.selector
data = json.loads(selector.xpath("//script[@type='application/ld+json']/text()").get())
refined_data = refine_profile(data)
return refined_data
async def scrape_profile(urls: List[str]) -> List[Dict]:
"""scrape public linkedin profile pages"""
to_scrape = [ScrapeConfig(url, **BASE_CONFIG) for url in urls]
data = []
# scrape the URLs concurrently
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
profile_data = parse_profile(response)
data.append(profile_data)
log.success(f"scraped {len(data)} profiles from Linkedin")
return data
Run the code
async def run():
profile_data = await scrape_profile(
urls=[
"https://www.linkedin.com/in/williamhgates"
]
)
# save the data to a JSON file
with open("profile.json", "w", encoding="utf-8") as file:
json.dump(profile_data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
In the above LinkedIn profile scraper, we define three functions. Let's break them down:
scrape_profile(): To request LinkedIn account URLs concurrently and utilize the parsing logic to extract each profile data.
parse_profile(): To parse the script tag containing the profile data.
refine_profile(): To refine and organize the extracted data.
Here's a sample output of the LinkedIn profile data retrieved
With this LinkedIn lead scraper, we can successfully gather detailed information on potential leads, given their job titles, companies, industries, and contact information from LinkedIn profiles. This contact data allows for more personalized and strategic outreach efforts.
Next, let's explore how to scrape company data!
How to Scrape LinkedIn Company Pages?
LinkedIn company profiles include various valuable data points like the company's industry, addresses, number of employees, jobs, and related company businesses. Moreover, the company profiles are public, meaning that we can scrape their full details!
Let's start by taking a look at a company profile page on LinkedIn such as Microsoft:
Just like with people pages, the LinkedIn company page data can also be found in hidden script tags:
From the above image, we can see that the script tag doesn't contain the full company details. Therefore to extract the entire company dataset we'll use a bit of HTML parsing as well:
import json
import jmespath
from typing import Dict, List
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
BASE_CONFIG = {
"asp": True,
"country": "US",
"headers": {
"Accept-Language": "en-US,en;q=0.5"
}
}
def strip_text(text):
"""remove extra spaces while handling None values"""
return text.strip() if text != None else text
def parse_company(response: ScrapeApiResponse) -> Dict:
"""parse company main overview page"""
selector = response.selector
script_data = json.loads(selector.xpath("//script[@type='application/ld+json']/text()").get())
script_data = jmespath.search(
"""{
name: name,
url: url,
mainAddress: address,
description: description,
numberOfEmployees: numberOfEmployees.value,
logo: logo
}""",
script_data
)
data = {}
for element in selector.xpath("//div[contains(@data-test-id, 'about-us')]"):
name = element.xpath(".//dt/text()").get().strip()
value = element.xpath(".//dd/text()").get().strip()
data[name] = value
addresses = []
for element in selector.xpath("//div[contains(@id, 'address') and @id != 'address-0']"):
address_lines = element.xpath(".//p/text()").getall()
address = ", ".join(line.replace("\n", "").strip() for line in address_lines)
addresses.append(address)
affiliated_pages = []
for element in selector.xpath("//section[@data-test-id='affiliated-pages']/div/div/ul/li"):
affiliated_pages.append({
"name": element.xpath(".//a/div/h3/text()").get().strip(),
"industry": strip_text(element.xpath(".//a/div/p[1]/text()").get()),
"address": strip_text(element.xpath(".//a/div/p[2]/text()").get()),
"linkeinUrl": element.xpath(".//a/@href").get().split("?")[0]
})
similar_pages = []
for element in selector.xpath("//section[@data-test-id='similar-pages']/div/div/ul/li"):
similar_pages.append({
"name": element.xpath(".//a/div/h3/text()").get().strip(),
"industry": strip_text(element.xpath(".//a/div/p[1]/text()").get()),
"address": strip_text(element.xpath(".//a/div/p[2]/text()").get()),
"linkeinUrl": element.xpath(".//a/@href").get().split("?")[0]
})
data = {**script_data, **data}
data["addresses"] = addresses
data["affiliatedPages"] = affiliated_pages
data["similarPages"] = similar_pages
return data
async def scrape_company(urls: List[str]) -> List[Dict]:
"""scrape prublic linkedin company pages"""
to_scrape = [ScrapeConfig(url, **BASE_CONFIG) for url in urls]
data = []
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
data.append(parse_company(response))
log.success(f"scraped {len(data)} companies from Linkedin")
return data
Run the code
async def run():
profile_data = await scrape_company(
urls=[
"https://linkedin.com/company/microsoft"
]
)
# save the data to a JSON file
with open("company.json", "w", encoding="utf-8") as file:
json.dump(profile_data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
In the above LinkedIn scraping code, we define two functions. Let's break them down:
parse_company(): To parse the company data from script tags while using JMESPath to refine it and parse other HTML elements using XPath selectors.
scrape_company(): To request the company page URLs while utilizing the parsing logic.
Here's a sample output of the extracted company information:
The above data represents the "about" section of the company pages. Next, we'll scrape data from the dedicated section for company jobs.
Scraping Company Jobs
The company jobs are found in a dedicated section of the main page, under the /jobs path of the primary LinkedIn URL for a company:
The page data here is being loaded dynamically on mouse scroll. We could use a real headless browser to emulate a scroll action though this approach isn't practical, as the job pages can include thousands of results!
Instead, we'll utilize a more efficient data extraction approach: scraping hidden APIs!
When a scroll action reaches the browser, the website sends an API request to retrieve the following page data as HTML. We'll replicate this mechanism in our scraper.
First, to find this hidden API, we can use our web browser:
Open the browser developer tools.
Select the network tab and filter by Fetch/XHR requests.
Scroll down the page to activate the API.
There API requests should be captured as the page is being scrolled:
We can see that the results are paginated using the start URL query parameter:
To scrape LinkedIn company jobs, we'll request the first job page to get the maximum results available and then use the above API endpoint for pagination:
import json
import asyncio
from typing import Dict, List
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
BASE_CONFIG = {
"asp": True,
"country": "US",
"headers": {
"Accept-Language": "en-US,en;q=0.5"
}
}
def strip_text(text):
"""remove extra spaces while handling None values"""
return text.strip() if text != None else text
def parse_jobs(response: ScrapeApiResponse) -> List[Dict]:
"""parse job data from Linkedin company pages"""
selector = response.selector
total_results = selector.xpath("//span[contains(@class, 'job-count')]/text()").get()
total_results = int(total_results.replace(",", "").replace("+", "")) if total_results else None
data = []
for element in selector.xpath("//section[contains(@class, 'results-list')]/ul/li"):
data.append({
"title": element.xpath(".//div/a/span/text()").get().strip(),
"company": element.xpath(".//div/div[contains(@class, 'info')]/h4/a/text()").get().strip(),
"address": element.xpath(".//div/div[contains(@class, 'info')]/div/span/text()").get().strip(),
"timeAdded": element.xpath(".//div/div[contains(@class, 'info')]/div/time/@datetime").get(),
"jobUrl": element.xpath(".//div/a/@href").get().split("?")[0],
"companyUrl": element.xpath(".//div/div[contains(@class, 'info')]/h4/a/@href").get().split("?")[0],
"salary": strip_text(element.xpath(".//span[contains(@class, 'salary')]/text()").get())
})
return {"data": data, "total_results": total_results}
async def scrape_jobs(url: str, max_pages: int = None) -> List[Dict]:
"""scrape Linkedin company pages"""
first_page = await SCRAPFLY.async_scrape(ScrapeConfig(url, **BASE_CONFIG))
data = parse_jobs(first_page)["data"]
total_results = parse_jobs(first_page)["total_results"]
# get the total number of pages to scrape, each page contain 25 results
if max_pages and max_pages * 25 < total_results:
total_results = max_pages * 25
log.info(f"scraped the first job page, {total_results // 25 - 1} more pages")
# scrape the remaining pages using the API
search_keyword = url.split("jobs/")[-1]
jobs_api_url = "https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/" + search_keyword
to_scrape = [
ScrapeConfig(jobs_api_url + f"&start={index}", **BASE_CONFIG)
for index in range(25, total_results + 25, 25)
]
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
page_data = parse_jobs(response)["data"]
data.extend(page_data)
log.success(f"scraped {len(data)} jobs from Linkedin company job pages")
return data
Run the code
async def run():
job_search_data = await scrape_jobs(
url="https://www.linkedin.com/jobs/microsoft-jobs-worldwide",
max_pages=3
)
# save the data to a JSON file
with open("company_jobs.json", "w", encoding="utf-8") as file:
json.dump(job_search_data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
Let's break down the above LinkedIn scraper code:
parse_jobs(): For parsing the jobs data on the HTML using XPath selectors.
scrape_jobs(): For the main scraping tasks. It requests the company page URL and the jobs hidden API for pagination.
Here's an example output of the above LinkedIn data extracted:
Next, as we have covered the parsing logic for job listing pages, let's apply it to another section of LinkedIn - job search pages.
How to Scrape LinkedIn Job Search Pages?
LinkedIn has a robust job search system that includes millions of job listings across different industries across the globe. The job listings on these search pages have the same HTML structure as the ones listed on the company profile page. Hence, we'll utilize almost the same scraping logic as in the previous section.
To define the URL for job search pages on LinkedIn, we have to add search keywords and location parameters, like the following:
The above URL uses basic search filters. However, it accepts further parameters to narrow down the search, such as date, experience level, or city.
We'll request the first page URL to retrieve the total number of results and paginate the remaining pages using the jobs hidden API:
import json
import asyncio
from typing import Dict, List
from loguru import logger as log
from urllib.parse import urlencode, quote_plus
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
BASE_CONFIG = {
"asp": True,
"country": "US",
"headers": {
"Accept-Language": "en-US,en;q=0.5"
}
}
def strip_text(text):
"""remove extra spaces while handling None values"""
return text.strip() if text != None else text
def parse_job_search(response: ScrapeApiResponse) -> List[Dict]:
"""parse job data from job search pages"""
selector = response.selector
total_results = selector.xpath("//span[contains(@class, 'job-count')]/text()").get()
total_results = int(total_results.replace(",", "").replace("+", "")) if total_results else None
data = []
for element in selector.xpath("//section[contains(@class, 'results-list')]/ul/li"):
data.append({
"title": element.xpath(".//div/a/span/text()").get().strip(),
"company": element.xpath(".//div/div[contains(@class, 'info')]/h4/a/text()").get().strip(),
"address": element.xpath(".//div/div[contains(@class, 'info')]/div/span/text()").get().strip(),
"timeAdded": element.xpath(".//div/div[contains(@class, 'info')]/div/time/@datetime").get(),
"jobUrl": element.xpath(".//div/a/@href").get().split("?")[0],
"companyUrl": element.xpath(".//div/div[contains(@class, 'info')]/h4/a/@href").get().split("?")[0],
"salary": strip_text(element.xpath(".//span[contains(@class, 'salary')]/text()").get())
})
return {"data": data, "total_results": total_results}
async def scrape_job_search(keyword: str, location: str, max_pages: int = None) -> List[Dict]:
"""scrape Linkedin job search"""
def form_urls_params(keyword, location):
"""form the job search URL params"""
params = {
"keywords": quote_plus(keyword),
"location": location,
}
return urlencode(params)
first_page_url = "https://www.linkedin.com/jobs/search?" + form_urls_params(keyword, location)
first_page = await SCRAPFLY.async_scrape(ScrapeConfig(first_page_url, **BASE_CONFIG))
data = parse_job_search(first_page)["data"]
total_results = parse_job_search(first_page)["total_results"]
# get the total number of pages to scrape, each page contain 25 results
if max_pages and max_pages * 25 < total_results:
total_results = max_pages * 25
log.info(f"scraped the first job page, {total_results // 25 - 1} more pages")
# scrape the remaining pages concurrently
other_pages_url = "https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?"
to_scrape = [
ScrapeConfig(other_pages_url + form_urls_params(keyword, location) + f"&start={index}", **BASE_CONFIG)
for index in range(25, total_results + 25, 25)
]
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
page_data = parse_job_search(response)["data"]
data.extend(page_data)
log.success(f"scraped {len(data)} jobs from Linkedin job search")
return data
Run the code
async def run():
job_search_data = await scrape_job_search(
keyword="Python Developer",
location="United States",
max_pages=3
)
# save the data to a JSON file
with open("job_search.json", "w", encoding="utf-8") as file:
json.dump(job_search_data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
Here, we start the scraping process by defining the job page URL using the search query and location. Then, request and parse the pages the same way we've done in the previous section.
Here's an example output of the above code for scraping LinkedIn job search:
We can successfully scrape the job listings. However, the data returned doesn't contain the details. Let's scrape them from their dedicated pages!
How to Scrape LinkedIn Job Pages?
To scrape LinkedIn job pages, we'll utilize the hidden web data approach once again.
To start, search for the selector //script[@type='application/ld+json'], and you will find results similar to the below:
If we take a closer look at the description field, we'll find the job description encoded in HTML. Therefore, we'll extract the script tag hidden data and parse the description field to get the full job details:
import json
import asyncio
from typing import Dict, List
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
BASE_CONFIG = {
"asp": True,
"country": "US",
"headers": {
"Accept-Language": "en-US,en;q=0.5"
}
}
def parse_job_page(response: ScrapeApiResponse):
"""parse individual job data from Linkedin job pages"""
selector = response.selector
script_data = json.loads(selector.xpath("//script[@type='application/ld+json']/text()").get())
description = []
for element in selector.xpath("//div[contains(@class, 'show-more')]/ul/li/text()").getall():
text = element.replace("\n", "").strip()
if len(text) != 0:
description.append(text)
script_data["jobDescription"] = description
script_data.pop("description") # remove the key with the encoded HTML
return script_data
async def scrape_jobs(urls: List[str]) -> List[Dict]:
"""scrape Linkedin job pages"""
to_scrape = [ScrapeConfig(url, **BASE_CONFIG) for url in urls]
data = []
# scrape the URLs concurrently
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
data.append(parse_job_page(response))
log.success(f"scraped {len(data)} jobs from Linkedin")
return data
Run the code
async def run():
job_data = await scrape_jobs(
urls=[
"https://in.linkedin.com/jobs/view/data-center-engineering-operations-engineer-hyd-infinity-dceo-at-amazon-web-services-aws-4017265505",
"https://www.linkedin.com/jobs/view/content-strategist-google-cloud-content-strategy-and-experience-at-google-4015776107",
"https://www.linkedin.com/jobs/view/sr-content-marketing-manager-brand-protection-brand-protection-at-amazon-4007942181"
]
)
# save the data to a JSON file
with open("jobs.json", "w", encoding="utf-8") as file:
json.dump(job_data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
Similar to our previous LinkedIn scraping logic, we add the job page URLs to a scraping list and request them concurrently. Then, we use the parse_job_page() function to parse the job data from the hidden script tag, including the HTML inside the description field.
Here's what the above LinkedIn extractor output looks like:
The job page scraping code can be extended with further LinkedIn crawling logic to scrape their pages after they are retrieved from the job search pages.
With this last feature, our LinkedIn scrapers are complete. They can successfully scrape LinkedIn profiles, company, and job data. However, attempts to increase the scraping rate will lead the website to detect and block the IP address. Hence, make sure to rotate high-quality residential proxies.
FAQ
To wrap up this guide on web scraping LinkedIn, let's have a look at a few frequently asked questions.
Is it legal to scrape LinkedIn data?
Yes, for public LinkedIn pages such as public people profiles, company pages, and job listings. Hence, it's legal to scrape data from LinkedIn perfectly as the scraper doesn't damage the LinkedIn website.
Are there public APIs for LinkedIn?
Yes, LinkedIn offers paid APIs for developers. That being said, scraping LinkedIn is straightforward, and you can use it to create your own scraper APIs.
Are there alternatives for web scraping LinkedIn?
Yes, other popular platforms for job data collection are Indeed, Glassdoor, and Zoominfo, which we have covered earlier. For more guides on scraping similar target websites, refer to our #scrapeguide blog tag.
In this guide, we explained how to scrape LinkedIn with Python. We went through a step-by-step guide on extracting different data types from LinkedIn:
Company and public profile pages.
Jobs and their search pages
For this LinkedIn data extractor, we have used httpx as an HTTP client and parsel to parse the HTML. We have also used some web scraping tricks, such as extracting hidden data from JavaScript tags and using hidden APIs.