G2.com is a leading website for software product and service data. It features thousands of product profiles, their reviews and alternative suggestions in various categories. However, due to the high protection level and the heavy use of CAPTCHA challenges, scraping G2.com can be challenging.
In this article, we'll explore web scraping G2. We'll explain how to scrape company data, reviews and alternatives from the website without getting blocked. We'll also use some web scraping tricks to make our scraper resilient, such as error handling and retrying logic. Let's get started!
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Why Scrape G2?
G2 provides comprehensive software product and service details as well as metadata, review and alternative information with detailed pros/cons comparisons. So, if you are looking to become a customer, scrapping G2's company data can help in decision-making and product comparisons.
Web Scraping G2's reviews can also be a good resource for developing Machine Learning models. Companies can analyze these reviews through sentiment analysis to gain insights into specific companies or market niches.
Moreover, manually exploring tens of company review pages on the website can be tedious and time-consuming. Therefore, scraping G2 can save a lot of manual effort by quickly retrieving thousands of reviews.
Project Setup
To scrape G2.com, we'll use a few Python packages:
scrapfly-sdk for bypassing G2 anti-scraping challenges and blocking.
parsel for parsing the website's HTML using XPath and CSS selectors.
loguru for monitoring our G2 scraper.
async for increasing our web scraping speed by running our code asynchronously.
Note that asyncio comes pre-installed in Python, you will only have to install the other packages using the following pip command:
pip install scrapfly-sdk parsel loguru
Avoid G2 Web Scraping Blocking
G2 heavily relies on Cloudflare challenges to prevent scraping. For example, let's send a simple request to the website using httpx. We'll use headers similar to real browsers to decrease the chance of getting detected and blocked:
from httpx import Client
# initializing an httpx client
client = Client(
# enable http2
http2=True,
# add basic browser like headers to prevent being blocked
headers={
"accept-language": "en-US,en;q=0.9",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
)
response = client.get("https://www.g2.com")
print(response)
"<Response [403 Forbidden]>"
The above requests get detected and required to solve a CAPTCHA challenge:
To scrape G2 without getting blocked we don't actually need to solve the captcha. We're just not going to get it at all! For that, we'll use ScrapFly - a web scraping API that allows for scraping at scale by providing:
Cloud headless browsers - for scraping dynamically loaded content with running headless browsers yourself.
Residential proxies from over 50+ countries - for avoiding IP address blocking and throttling, while also allowing for scraping from almost any geographic location.
By using the Scrapfly's asp feature with the ScrapFly SDK. We can easily bypass G2 scraper blocking:
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
api_response: ScrapeApiResponse = scrapfly.scrape(
ScrapeConfig(
# some g2 URL
url="https://www.g2.com",
# cloud headless browser similar to Playwright
render_js=True,
# bypass anti scraping protetion
asp=True,
# set the geographical location to a specific country
country="US",
)
)
# Print the website's status code
print(api_response.upstream_status_code)
"200"
We'll use ScrapFly as our HTTP client for the rest of the article. So, to follow along, you need to get a ScrapFly API key 👇
We'll be using this parameter to crawl over search pages. But before start crawling the search pages, let's write the selectors we'll use to capture the data from the HTML:
def parse_search_page(response: ScrapeApiResponse):
"""parse company data from search pages"""
try:
selector = response.selector
except:
pass
data = []
total_results = selector.xpath("//div[@class='ml-half']/text()").get()
total_results = int(total_results[1:-1]) if total_results else None
_search_page_size = 20 # each search page contains 20 listings
# get the number of total pages
total_pages = math.ceil(total_results / _search_page_size)
for result in selector.xpath("//div[contains(@class, 'paper mb-1')]"):
name = result.xpath(".//div[contains(@class, 'product-name')]/a/div/text()").get()
link = result.xpath(".//div[contains(@class, 'product-name')]/a/@href").get()
image = result.xpath(".//a[contains(@class, 'listing__img')]/img/@data-deferred-image-src").get()
rate = result.xpath(".//a[contains(@title, 'Reviews')]/div/span[2]/span[1]/text()").get()
reviews_number = result.xpath(".//a[contains(@title, 'Reviews')]/div/span[1]/text()").get()
description = result.xpath(".//span[contains(@class, 'paragraph')]/text()").get()
categories = []
for category in result.xpath(".//div[span[contains(text(),'Categories')]]/a/text()"):
categories.append(category.get())
data.append({
"name": name,
"link": link,
"image": image,
"rate": float(rate) if rate else None,
"reviewsNumber": int(reviews_number.replace("(", "").replace(")", "")) if reviews_number else None,
"description": description,
"categories": categories
})
return {
"search_data": data,
"total_pages": total_pages
}
Here, we define a parse_search_page() function, which parses the company data from the page HTML using XPath selectors.
We also extract the total search results to get the number of total pages, which we'll use later to crawl over search pagination.
Next, we'll request the search pages using ScrapFly and use the function we created to parse the data from the HTML:
import math
import json
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from typing import Dict, List
from loguru import logger as log
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def parse_search_page(response: ScrapeApiResponse):
"""parse company data from search pages"""
# ... code from the previous section
async def scrape_search(url: str, max_scrape_pages: int = None) -> List[Dict]:
"""scrape company listings from search pages"""
log.info(f"scraping search page {url}")
# scrape the first search page
# enable the asp and set the proxy location to US
first_page = await SCRAPFLY.async_scrape(ScrapeConfig(url, asp=True, country="US"))
search_data = data["search_data"]
total_pages = data["total_pages"]
log.success(f"scraped {len(search_data)} company listings from G2 search pages with the URL {url}")
return search_data
Here, we've added a scrape_search() function that sends a request to the first search page using the ScarpFly client. Then, we extract its data using the parse_search_page() function we defined earlier.
With this, our G2 scraper can scrape the first search page data. Next let's extend our functions with pagination support so we can scrape all of the search pages:
import asyncio
import math
import json
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from typing import Dict, List
from loguru import logger as log
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def parse_search_page(response: ScrapeApiResponse):
"""parse company data from search pages"""
# ... code from the previous section
async def scrape_search(url: str, max_scrape_pages: int = None) -> List[Dict]:
"""scrape company listings from search pages"""
log.info(f"scraping search page {url}")
# scrape the first search page
# enable the asp and set the proxy location to US
first_page = await SCRAPFLY.async_scrape(ScrapeConfig(url, asp=True, country="US"))
data = parse_search_page(first_page)
search_data = data["search_data"]
total_pages = data["total_pages"]
# get the total number of pages to scrape
if max_scrape_pages and max_scrape_pages < total_pages:
total_pages = max_scrape_pages
# scrape the remaining search pages concurrently and remove the successful request URLs
log.info(f"scraping search pagination, remaining ({total_pages - 1}) more pages")
remaining_urls = [url + f"&page={page_number}" for page_number in range(2, total_pages + 1)]
to_scrape = [ScrapeConfig(url, asp=True, country="US") for url in remaining_urls]
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
try:
data = parse_search_page(response)
search_data.extend(data["search_data"])
# remove the successful requests from the URLs list
remaining_urls.remove(response.context["url"])
except Exception as e: # catch any exception
log.error(f"Error encountered: {e}")
continue
# try again with the blocked requests if any using headless browsers and residential proxies
if len(remaining_urls) != 0:
log.debug(f"{len(remaining_urls)} requests are blocked, trying again with render_js enabled and residential proxies")
try:
failed_requests = [ScrapeConfig(url, asp=True, country="US",
render_js=True, proxy_pool="public_residential_pool")
for url in remaining_urls]
async for response in SCRAPFLY.concurrent_scrape(failed_requests):
data = parse_search_page(response)
search_data.extend(data["search_data"])
except Exception as e: # Catching any exception
log.error(f"Error encountered: {e}")
pass
log.success(f"scraped {len(search_data)} company listings from G2 search pages with the URL {url}")
return search_data
Run the code
async def run():
search_data = await scrape_search(
url="https://www.g2.com/search?query=Infrastructure",
max_scrape_pages=3
)
# save the result to a JSON file
with open("search.json", "w", encoding="utf-8") as file:
json.dump(search_data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
In the above code, we add the remaining search page URLs to a scraping list and scrape them concurrently. Next, we remove the successful requests from the URL list and extend the first page data with the new ones. Then, we try the failed requests again with headless browsers and residential proxies.
Here is a sample output of the result we got:
Sample output
[
{
"name": "Oracle Cloud Infrastructure",
"link": "https://www.g2.com/products/oracle-oracle-cloud-infrastructure/reviews",
"image": "https://images.g2crowd.com/uploads/product/image/large_detail/large_detail_2753ea8c7953188158425365667be750/oracle-oracle-cloud-infrastructure.png",
"rate": 4.2,
"reviewsNumber": 371,
"description": null,
"categories": [
"Other Product Suites"
]
},
{
"name": "Nutanix Cloud Infrastructure (NCI)",
"link": "https://www.g2.com/products/nutanix-cloud-infrastructure-nci/reviews",
"image": "https://images.g2crowd.com/uploads/product/hd_favicon/16abd9cd27db884c0f93e6fe06630051/nutanix-cloud-infrastructure-nci.svg",
"rate": 4.6,
"reviewsNumber": 268,
"description": "Nutanix Cloud Infrastructure (NCI) combines feature-rich software-defined storage with built-in virtualization in a turnkey hyperconverged infrastructure solution that can run any application at any scale.",
"categories": [
"Hyperconverged Infrastructure (HCI) Solutions"
]
},
{
"name": "Data Center Virtualization and Cloud Infrastructure",
"link": "https://www.g2.com/products/data-center-virtualization-and-cloud-infrastructure/reviews",
"image": "https://images.g2crowd.com/uploads/product/image/large_detail/large_detail_b786517d9ff0c2ebbc03614aaa4b38e0/data-center-virtualization-and-cloud-infrastructure.jpg",
"rate": 4.6,
"reviewsNumber": 343,
"description": null,
"categories": [
"Other Product Suites"
]
},
{
"name": "Splunk Infrastructure Monitoring",
"link": "https://www.g2.com/products/splunk-infrastructure-monitoring/reviews",
"image": "https://images.g2crowd.com/uploads/product/hd_favicon/658dd11af574d5228f22a7f5855f32dc/splunk-infrastructure-monitoring.svg",
"rate": 4.3,
"reviewsNumber": 47,
"description": "Splunk Infrastructure Monitoring proactively finds and fixes problems before it impacts business performance through high resolution, full fidelity metrics monitoring that automatically detects and accurately alerts on critical patterns in seconds using real-time, AI-driven streaming analytics. Splunk Infrastructure Monitoring significantly shortens MTTD and MTTR by providing unmatched instant visibility across the infrastructure and application stack that builds on the value of an extensible Sp",
"categories": [
"Server Monitoring",
"Log Monitoring",
"Hardware Monitoring",
"Cloud Infrastructure Monitoring",
"Container Monitoring",
"Log Analysis",
"Network Monitoring",
"Enterprise Monitoring",
"Observability Solution Suites"
]
},
{
"name": "Oracle Cloud Infrastructure Compute",
"link": "https://www.g2.com/products/oracle-cloud-infrastructure-compute/reviews",
"image": "https://images.g2crowd.com/uploads/product/image/large_detail/large_detail_aca40b5464fcf97310dbc332c2296b5c/oracle-cloud-infrastructure-compute.png",
"rate": 4.1,
"reviewsNumber": 67,
"description": "Compute options range from VMs to GPUs to bare metal servers, and includes options for dense I/O workloads, high performance computing (HPC), and AMD EPYC processors.",
"categories": [
"Container Engine"
]
},
....
]
Our G2 scraper can successfully scrape company data from search pages. Let's scrape company reviews next.
How to Scrape G2 Company Reviews
In this section, we'll scrape company reviews from their pages. Before we start, let's have a look at the G2 review pages. Go to any company or product page on the website, such as digitalocean page and you will find the reviews that should look like this:
Review pages also support pagination by adding the same page parameter:
In the above code, we define a parse_review_page function, which extracts each page review data alongside the total number of reviews.
The next step is to scrape the first review page and then crawl over the remaining pages:
import asyncio
import math
import json
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from typing import Dict, List
from loguru import logger as log
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def parse_review_page(response: ScrapeApiResponse):
"""parse reviews data from G2 company pages"""
# the rest of the function code
async def scrape_reviews(url: str, max_review_pages: int = None) -> List[Dict]:
"""scrape company reviews from G2 review pages"""
log.info(f"scraping first review page from company URL {url}")
first_page = await SCRAPFLY.async_scrape(ScrapeConfig(url, asp=True, country="US"))
data = parse_review_page(first_page)
reviews_data = data["reviews_data"]
total_pages = data["total_pages"]
# get the number of total review pages to scrape
if max_review_pages and max_review_pages < total_pages:
total_pages = max_review_pages
# scrape the remaining review pages
log.info(f"scraping reviews pagination, remaining ({total_pages - 1}) more pages")
remaining_urls = [url + f"?page={page_number}" for page_number in range(2, total_pages + 1)]
to_scrape = [ScrapeConfig(url, asp=True, country="US") for url in remaining_urls]
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
try:
data = parse_review_page(response)
reviews_data.extend(data["reviews_data"])
remaining_urls.remove(response.context["url"])
except Exception as e: # catch any exception
log.error(f"Error encountered: {e}")
continue
# you can add retrying logic here, similar to what we did while scaping G2 search pages
log.success(f"scraped {len(reviews_data)} company reviews from G2 review pages with the URL {url}")
return reviews_data
Run the code
async def run():
search_data = await scrape_reviews(
url="https://www.g2.com/products/digitalocean/reviews",
max_review_pages=3
)
# save the result to a JSON file
with open("reviews.json", "w", encoding="utf-8") as file:
json.dump(search_data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
The above code is similar to the G2 search scraping logic we wrote earlier. We start by scraping the first review page and the total number of reviews. Next, we add the remaining review pages to a scraping list and scrape them concurrently. Finally, we save the result to a JSON file.
Here is a sample output of the result we got:
Sample output
[
{
"author": {
"authorName": "Marlon P.",
"authorProfile": "https://www.g2.com/users/d523e9ac-7e5b-453f-85f8-9ab05b27a556",
"authorPosition": "Desenvolvedor de front-end",
"authorCompanySize": []
},
"review": {
"reviewTags": [
"Validated Reviewer",
"Verified Current User",
"Review source: Seller invite",
"Incentivized Review"
],
"reviewData": "2023-11-14",
"reviewRate": 4.5,
"reviewTitle": "Good for beginners",
"reviewLikes": "It was very simple to start playing around and be able to test the projects I'm learning about for a cool price. I use it at work and it's easy to create new machines. Initial configuration is simple with the app Free tier is so short. is now than the company need money but, for me 90 days it was very fast to user the credits. \n\nThere is an app configuration file that I find very annoying to configure. It would be cool if there was a way to test that locally. I've had a lot of problems that doing several deployments in production to see my app's configuration is ok. leave personal projects public. And in the company, when I have to use it, I find it very simple to use the terminal via the platform ",
"reviewDilikes": "Free tier is so short. is now than the company need money but, for me 90 days it was very fast to user the credits. \n\nThere is an app configuration file that I find very annoying to configure. It would be cool if there was a way to test that locally. I've had a lot of problems that doing several deployments in production to see my app's configuration is ok. "
}
},
{
"author": {
"authorName": "Alberto G.",
"authorProfile": "https://www.g2.com/users/1e8b4d38-bc84-470d-a5a8-1be22fc4c95b",
"authorPosition": null,
"authorCompanySize": [
"Small-Business",
"(50 or fewer emp.)"
]
},
"review": {
"reviewTags": [
"Validated Reviewer",
"Verified Current User",
"Review source: Seller invite",
"Incentivized Review"
],
"reviewData": "2023-11-09",
"reviewRate": 5.0,
"reviewTitle": "Digital Ocean review",
"reviewLikes": "I like the agility, speed, and technology of DigitalOcean. I find the pricing of DigitalOcean's plans to be expensive. DigitalOcean is solving several problems for me. One of the key benefits is the simplification of cloud infrastructure management. It provides a user-friendly platform that makes it easier for me to deploy and manage servers and applications without the need for advanced technical expertise. Additionally, DigitalOcean's scalable infrastructure and fast provisioning times help me quickly adapt to changing resource demands, which is crucial for my projects. The availability of a wide range of pre-configured one-click applications and developer-friendly features also saves me time and effort. Overall, DigitalOcean's solutions have streamlined my workflow and reduced the complexities associated with cloud hosting, making it a more cost-effective and efficient choice for my needs. ",
"reviewDilikes": "I find the pricing of DigitalOcean's plans to be expensive. "
}
},
{
"author": {
"authorName": "Jailyn P.",
"authorProfile": "https://www.g2.com/users/ef6efaf5-57ad-4c59-95c4-0311a0c1833f",
"authorPosition": "A",
"authorCompanySize": [
"Small-Business",
"(50 or fewer emp.)"
]
},
"review": {
"reviewTags": [
"Validated Reviewer",
"Review source: Seller invite",
"Incentivized Review"
],
"reviewData": "2023-11-14",
"reviewRate": 4.5,
"reviewTitle": "Reliable Hosting Service",
"reviewLikes": "DigitalOcean is one of the best cloud providers out there. I like the fact that it provides an easy to use platform with a variety of features and services that can be used to create a powerful and reliable cloud infrastructure. It also offers an intuitive user interface that makes it easy to set up and manage cloud servers quickly and efficiently. Additionally, the pricing is very competitive with other providers and the customer support is excellent. The company also offers an extensive library of tutorials and documentation to help users get the most out of the platform. DigitalOcean also provides a wide range of services, including dedicated servers, managed databases, object storage, and more. Furthermore, the company offers a variety of tools and features such as monitoring, logging, backups, and auditing, which make it easier to keep your cloud infrastructure secure and reliable. All in all, DigitalOcean is a great choice for anyone who wants to create and maintain a reliable cloud infrastructure with the help of a powerful and easy to use platform. DigitalOcean is a great cloud hosting service, but there are a few drawbacks. One of the biggest issues with DigitalOcean is its limited customer support. While they do offer 24/7 support for most issues, the response time can be slow, and there are only a few ways to contact them. Web hosting ",
"reviewDilikes": "DigitalOcean is a great cloud hosting service, but there are a few drawbacks. One of the biggest issues with DigitalOcean is its limited customer support. While they do offer 24/7 support for most issues, the response time can be slow, and there are only a few ways to contact them. "
}
},
{
"author": {
"authorName": "Danny A.",
"authorProfile": "https://www.g2.com/users/bb327c9d-ee4c-46c5-a41b-ecd7c77bf01d",
"authorPosition": "LibreChat.ai",
"authorCompanySize": [
"Small-Business",
"(50 or fewer emp.)"
]
},
"review": {
"reviewTags": [
"Validated Reviewer",
"Review source: Seller invite",
"Incentivized Review"
],
"reviewData": "2023-11-09",
"reviewRate": 5.0,
"reviewTitle": "DDoS, Encrypted At Rest, Affordable, easy to setup",
"reviewLikes": "The aspects of DigitalOcean that stand out include its cost-effectiveness, transparent and straightforward billing, and competitive pricing. The user interface is very clear and intuitive, making it incredibly simple to create a cloud instance in seconds. Additionally, the speed performance of DigitalOcean is excellent, with an impressive record for load speed, which is a critical factor for web hosting. The platform also offers robust and reliable cloud servers with the convenience of one-click application installations, and the flexibility to easily scale the infrastructure with growing business needs is a significant advantage. One limitation is the reliance on CloudFlare, which restricts request times to a maximum of 100 seconds, leading to potential timeouts that cannot be altered. Additionally, the initial pricing for certain features, such as load balancers, being too high, though there has been a recent reduction in this cost. Despite these disadvantages, DigitalOcean's strengths in performance, pricing, and user experience seem to outweigh the drawbacks. Easy deployment ",
"reviewDilikes": "One limitation is the reliance on CloudFlare, which restricts request times to a maximum of 100 seconds, leading to potential timeouts that cannot be altered. Additionally, the initial pricing for certain features, such as load balancers, being too high, though there has been a recent reduction in this cost. Despite these disadvantages, DigitalOcean's strengths in performance, pricing, and user experience seem to outweigh the drawbacks. "
}
},
{
"author": {
"authorName": "Matt D.",
"authorProfile": "https://www.g2.com/users/mjd",
"authorPosition": "Computer Software",
"authorCompanySize": [
"Small-Business",
"(50 or fewer emp.)"
]
},
"review": {
"reviewTags": [
"Validated Reviewer",
"Verified Current User",
"Review source: Organic Review from User Profile"
],
"reviewData": "2023-08-15",
"reviewRate": 4.0,
"reviewTitle": "Just what you'd expect from a cloud provider!",
"reviewLikes": "DigitalOcean is an easy option at an affordable price, compared to most of it's competitors, Digital Ocean offers some really low prices and no surprise costs which make it really easy to trust. Their specifications are fair and it's easy to launch a Droplet or other server type anywhere in their range of regions.\n\nThe company offer a variety of different server types as well as other services which means you have a lot of versatility with the platform to run multiple different parts of your business on one cloud system. There are also free options for products such as Functions, which while limited are still generous and useful for specific use cases. The platform offered by DigitalOcean lacks the depth that certain other Platforms such as AWS contains. DigitalOcean lock port 25 on Droplets making it impossible to use for email without an external SMTP relay. DigitalOcean also do not offer an SMTP relay of their own, making it necessary to look elsewhere for such a tool if you plan to use the server for email hosting - as someone managing a web server this is disappointing, but not a dealbreaker. As a web development business we host our clients' websites to close the room for confusion that this entails for someone unexperienced in the industry. No small business owner out there selling their product needs a second 9 to 5 to manage their website. We were finding that any host we tried to use to host our client's sites was costing a lot of money and we needed to change this, so we opted to setup our own servers using DigitalOcean to make a cost-effective way to manage our clients' websites. DigitalOcean costs us a fraction of the cost to get all of our client's sites online as well as our own, meaning we don't have to charge as much, and we have full control of what's going on! ",
"reviewDilikes": "The platform offered by DigitalOcean lacks the depth that certain other Platforms such as AWS contains. DigitalOcean lock port 25 on Droplets making it impossible to use for email without an external SMTP relay. DigitalOcean also do not offer an SMTP relay of their own, making it necessary to look elsewhere for such a tool if you plan to use the server for email hosting - as someone managing a web server this is disappointing, but not a dealbreaker. "
}
}
....
]
Our G2 scraper can successfully scrape review pages. Next, we'll scrape company competitor pages for company alternative listings.
How to Scrape G2 Company Alternatives
G2 company competitor pages offer detailed company product comparisons. However, we'll be focusing on the company's alternative listings. However, other comparison details can be scraped in the same way.
First, go to any company alternative page, like the digitalocean alternatives page. The company alternatives listing should look like this:
As we can see from the image, the company listings can be narrowed down according to a specific market niche, like small business, mid-market and enterprise alternatives. While the default URL represents the top 10 alternatives filter, we can apply other filters by adding the filter name at the end of the URL:
We'll make use of this filter to control the G2 scraping alternatives. But first, let's start with defining our parsing logic:
def parse_alternatives(response: ScrapeApiResponse):
"""parse G2 alternative pages"""
try:
selector = response.selector
except:
pass
data = []
for alt in selector.xpath("//div[@class='product-listing--competitor']"):
sponsored = alt.xpath(".//strong[text()='Sponsored']").get()
if sponsored: # ignore sponsored cards
continue
name = alt.xpath(".//div[@itemprop='name']/text()").get()
link = alt.xpath(".//h3/a[contains(@class, 'link')]/@href").get()
ranking = alt.xpath(".//div[@class='product-listing__number']/text()").get()
number_of_reviews = alt.xpath(".//div[div[contains(@class,'stars')]]/span/text()").get() # clean this
rate = alt.xpath(".//div[div[contains(@class,'stars')]]/span/span/text()").get()
description = alt.xpath(".//div[@data-max-height-expand-type]/p/text()").get()
data.append({
"name": name,
"link": link,
"ranking": ranking,
"numberOfReviews": int(number_of_reviews[1:-1].replace(",", "")) if number_of_reviews else None,
"rate": float(rate.strip()) if rate else None,
"description": description
})
return data
Here, we define a parse_alternatives() function. It iterates over the alternative cards in the HTML and extracts the company listings data from each card.
Next, we'll call this function after requesting the pages to extract the data:
import asyncio
import json
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
from typing import Dict, Literal
from loguru import logger as log
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def parse_alternatives(response: ScrapeApiResponse):
"""parse G2 alternative pages"""
# the rest of the function code
async def scrape_alternatives(
product: str, alternatives: Literal["small-business", "mid-market", "enterprise"] = ""
) -> Dict:
"""scrape product alternatives from G2 alternative pages"""
# the default alternative is top 10, which takes to argument
# define the alternative page URL using the product name
url = f"https://www.g2.com/products/{product}/competitors/alternatives/{alternatives}"
log.info(f"scraping alternative page {url}")
try:
response = await SCRAPFLY.async_scrape(ScrapeConfig(url, asp=True, country="US"))
data = parse_alternatives(response)
except Exception as e: # Catching any exception
log.error(f"Error encountered: {e}, trying")
pass
# you can add retrying logic
log.success(f"scraped {len(data)} company alternatives from G2 alternative pages")
return data
Run the code
async def run():
search_data = await scrape_alternatives(
product="digitalocean",
# you can filter alternatives using the "alternatives" paremeter
# e.g. alternatives="small-business"
)
# save the result to a JSON file
with open("alternatives.json", "w", encoding="utf-8") as file:
json.dump(search_data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
In the above code, we request the company alternative page and parse the data using the parse_alternatives() function we defined earlier.
Here is a sample output of the result we got:
Sample output
[
{
"name": "Hostwinds",
"link": "https://www.g2.com/products/hostwinds/reviews",
"ranking": "#1",
"numberOfReviews": 438,
"rate": 4.9,
"description": "Hostwinds offers website hosting for individuals and businesses of all sizes, with 24/7/365 support and nightly backups."
},
{
"name": "Vultr",
"link": "https://www.g2.com/products/vultr/reviews",
"ranking": "#2",
"numberOfReviews": 154,
"rate": 4.2,
"description": "Vultr offers a standardized highly reliable high performance cloud compute environment with 14 datacenters around the globe."
},
{
"name": "Amazon EC2",
"link": "https://www.g2.com/products/amazon-ec2/reviews",
"ranking": "#3",
"numberOfReviews": 1200,
"rate": 4.6,
"description": "AWS Elastic Compute Cloud (EC2) is a web service that provides resizable compute capacity in the cloud, making web-scale computing easier for developers."
},
{
"name": "AWS Lambda",
"link": "https://www.g2.com/products/aws-lambda/reviews",
"ranking": "#4",
"numberOfReviews": 1009,
"rate": 4.6,
"description": "Run code without thinking about servers. Pay for only the compute time you consume."
},
{
"name": "Amazon Relational Database Service (RDS)",
"link": "https://www.g2.com/products/amazon-relational-database-service-rds/reviews",
"ranking": "#5",
"numberOfReviews": 763,
"rate": 4.5,
"description": "Amazon Relational Database Service (RDS) is a web service that makes it easy to set up, operate, and scale a relational DB in the cloud: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle, and Microsoft SQL Server."
},
....
]
With this last piece, our G2 scraper is complete! It can scrape company data from search, competitor and review pages on G2.com. There are
pages on the website that are worth scraping, such as detailed company comparison pages. These pages can be scraped by following the steps in our previous G2 scraping code snippets.
FAQ
To wrap up this guide, let's have a look at some frequently asked questions about web scraping G2.
Is it legal to scrape G2?
Yes, all the data on G2.com is publicly available and it's legal to scrape as long as the website is not harmed in the process. However, commercializing personal data such as reviewers' emails may violate GDPR compliance in the EU countries. Refer to our previous article on web scraping legality for more details.
Is there a public API for G2.com?
At the time of writing, there are no public APIs for G2 we can use for web scraping. Though G2's HTML is pretty descriptive, making scraping it through HTML parsing viable.
Are there alternatives for G2?
Yes, Trustpilot.com is another popular website for company reviews. Refer to our #scrapeguide blog tag for its scraping guide and for other related web scraping guides.
G2.com is a global website for company reviews and comparisons, known for its high protection level.
We explained how to avoid G2 scraping blocking using ScrapFly. We also went through a step-by-step guide on how to scrape G2 using Python. We have used HTML parsing to scrape search, review and competitor pages on G2.
In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.
In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.
In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive website traffic insights, websites comparing data, sitemaps, and trending industry domains.