SimilarWeb is a leading platform specializing in web analytics, acting as a directory for worldwide website traffic. Imagine the insights and SEO impact would scraping SimilarWeb allow for!
In this guide, we'll explain how to scrape SimilarWeb through a step-by-step guide. We'll scrape comprehensive domain traffic insights, websites comparing data, sitemaps, and trending industry domains. Let's get started!
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Why Scrape SimilarWeb?
Web scraping SimilarWeb provides us with detailed valuable insights into websites' traffic, which can be valuable across different aspects.
Competitor Analysis
One of the key features of web analytics is analyzing the industry peers and benchmarking against their traffic. Scraping SimilarWeb enables such data retrieval, allowing business to fine-tune their strategies to meet their competitors and gain a competitive edge.
SEO and Keyword Analysis
Search Engine Optimization (SEO) is crucial for driving traffic into the domains. SimilarWeb data extraction provides comprehensive insights into the SEO keywords and search engine rankings, allowing for better online presence and visibility.
Data-Driven Decision Making
The search trends are aggressive and fast-changing. Therefore, utilizing scraping SimilarWeb for data-based insights is crucial for supporting decision-making and defining strategies.
loguru: Optional prerequisite to monitor our code through colored terminal outputs.
Since asyncio comes pre-installed in Python, we'll only have to install the other packages using the following pip command:
pip install httpx parsel jmespath loguru
How to Discover SimilarWeb Pages?
Crawling sitemaps is a great way to discover and navigate pages on a website. Since they direct search engines for organized indexing, we can use them for scraping, too!
Each of the above sitemap indexes represents a group of related sitemaps. Let's explore the latest sitemap /sitemaps/sitemap_index.xml.gz. It's a gz compressed file to save bandwidth, which looks like this after extracting:
Search for the XPath selector: //script[@id='dataset-json-ld'].
After following the above steps, you will find the below data:
The above data is the same on the web page but before getting rendered into the HTML. To scrape it, we'll select its associated script tag and then parse it:
Python
ScrapFly
import asyncio
import json
from typing import List, Dict
from httpx import AsyncClient, Response
from parsel import Selector
from loguru import logger as log
# initialize an async httpx client
client = AsyncClient(
# enable http2
http2=True,
# add basic browser like headers to prevent getting blocked
headers={
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br"
},
)
def parse_trending_data(response: Response) -> List[Dict]:
"""parse hidden trending JSON data from script tags"""
selector = Selector(response.text)
json_data = json.loads(selector.xpath("//script[@id='dataset-json-ld']/text()").get())["mainEntity"]
data = {}
data["name"] = json_data["name"]
data["url"] = str(response.url)
data["list"] = json_data["itemListElement"]
return data
async def scrape_trendings(urls: List[str]) -> List[Dict]:
"""parse trending websites data"""
to_scrape = [client.get(url) for url in urls]
data = []
for response in asyncio.as_completed(to_scrape):
response = await response
category_data = parse_trending_data(response)
data.append(category_data)
log.success(f"scraped {len(data)} trneding categories from similarweb")
return data
import asyncio
import json
from typing import List, Dict
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def parse_trending_data(response: ScrapeApiResponse) -> List[Dict]:
"""parse hidden trending JSON data from script tags"""
selector = response.selector
json_data = json.loads(selector.xpath("//script[@id='dataset-json-ld']/text()").get())["mainEntity"]
data = {}
data["name"] = json_data["name"]
data["url"] = response.scrape_result["url"]
data["list"] = json_data["itemListElement"]
return data
async def scrape_trendings(urls: List[str]) -> List[Dict]:
"""parse trending websites data"""
to_scrape = [ScrapeConfig(url, asp=True, country="US") for url in urls]
data = []
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
category_data = parse_trending_data(response)
data.append(category_data)
log.success(f"scraped {len(data)} trneding categories from similarweb")
return data
Run the code
async def run():
data = await scrape_trendings(
urls=[
"https://www.similarweb.com/top-websites/computers-electronics-and-technology/programming-and-developer-software/",
"https://www.similarweb.com/top-websites/computers-electronics-and-technology/social-networks-and-online-communities/",
"https://www.similarweb.com/top-websites/finance/investing/"
]
)
# save the data to a JSON file
with open("trendings.json", "w", encoding="utf-8") as file:
json.dump(data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
We use the previously defined httpx client and define additional functions:
parse_trending_data: For extracting the page JSON data from the hidden script tag, organizing the data by removing the JSON schema details and adding the URL.
scrape_trendings: For adding the page URLs to a list and requesting them concurrently.
Here is a sample output of the above SimilarWeb scraping code:
Next, let's explore the exciting part of our SimilarWeb scraper: website analytics! But before this, we must solve a SimilarWeb scraping blocking issue: validation challenge.
How to Avoid SimilarWeb Validation Challenge?
The SimilarWeb validation challenge is a web scraping blocking mechanism that blocks HTTP requests from clients without JavaScript support. It's a JavaScript challenge that's automatically bypassed after 5 seconds when requesting the domain for the first time:
Since we scrape SimilarWeb with an HTTP client that doesn't support JavaScript (httpx), requests sent to pages with this challenge will be blocked due to not evaluating it:
To avoid the SimilarWeb validation challenge, we can use a headless browser to complete the challenge automatically using JavaScript. However, there's a trick we can use to bypass the challenge without JavaScript: cookies!
When the validation challenge is solved, the website cookies are updated with the challenge state, so it's not triggered again.
We can make use of cookies for web scraping to bypass the validation challenge automatically! To do this, we have to get the cookie value:
Go to any protected SimilarWeb page with the challenge.
Open the browser developer tools by pressing the F12 key.
Select the Application tab and choose cookies.
Copy the _abck cookie value, which is responsible for the challenge.
After following the above steps, you will find the SimilarWeb saved cookies:
Adding the _abck cookie to the requests will authorize them against the challenge:
from httpx import Client
client = Client(
# enable http2
http2=True,
# add basic browser like headers to prevent getting blocked
headers={
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Cookie": "_abck=85E72C5791B36ED327B311F1DC7461A6~0~YAAQHPR6XIavCzyOAQAANeDuYgs83VF+IZs6MdB2WGsdsp5d89AWqe1hI+IskJ6V24OYvokUZSIn2Om9PATl5rqminoOTHQYZAMWO5Om8bcXlT3q2D9axmG+YQkS/77h/7O98vFFDrFX8Jns/upO+RbomHm7SxQ0IGk0yS80GGbWBQoSkxN+770ltBb9vdyT/7ShUBl3eKz/iLfyMSe4SyOxymE0pQL0pch0FJhvCiC2CD4asMBXGBNMQv2qvA553uO9bwz4Yr1X/7zLPOm6Vn2bz242O7rephGPmVud25Yc3Khs0oEqiQ4pgMvCy/NGIXTlVKN8anBc5QlnqGw7dq8kLqDrID9HqzbqusS9p5gkNUd4A2QJXDj80pjB9k4SWitpn1zRhsUNUYzrfvHMeGiDZhNuTYSq3sMcYg==~-1~-1~-1"
},
)
response = client.get("https://www.similarweb.com/website/google.com/")
print(response.text) # full HTML response
We can successfully bypass the validation challenge. However, the cookie value has to be roated as it can expire. The rotation logic can also be automated with a headless browser for better rotation efficiency.
How to Scrape SimilarWeb Website Analytics?
The SimilarWeb website analytics pages is a powerful feature that includes comprehensive insights about the domain, including:
Ranking: The domain's category, country, and global rank.
Traffic: Engagement analysis including total visits, bounce rate, and visit duration.
Geography: The domain's traffic by top countries.
Demographics: The visitors' composition distribution by age and gender.
Interests: The visitors' interests by categories and topics.
Competitors: The domain's competitors and alternatives and their similarities.
Traffic sources: The domain's traffic by its source, such as search, direct, or emails.
Keywords: Top keywords visitors use to search the domain.
First, let's look at what the website analysis page on our target website looks like by targeting a specific domain: Google.com. Go to the domain page SimilarWeb, and you will get a page similar to this:
The above page data are challenging to scrape using selectors, as they are mostly located in charts and graphs. Therefore, we'll use the hidden web data approach.
Search through the HTML using the following XPath selector: //script[contains(text(), 'window.__APP_DATA__')]. The script tag found contains a comprehensive JSON dataset with the domain analysis data:
To scrape SimilarWeb traffic analytics pages, we'll select this script tag and parse the inside JOSN data:
Python
ScrapFly
import re
import asyncio
import json
from typing import List, Dict
from httpx import AsyncClient, Response
from parsel import Selector
from loguru import logger as log
# initialize an async httpx client
client = AsyncClient(
# enable http2
http2=True,
# add basic browser like headers to prevent getting blocked
headers={
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Cookie": "_abck=D2F915DBAC628EA7C01A23D7AA5DF495~0~YAAQLvR6XFLvH1uOAQAAJcI+ZgtcRlotheILrapRd0arqRZwbP71KUNMK6iefMI++unozW0X7uJgFea3Mf8UpSnjpJInm2rq0py0kfC+q1GLY+nKzeWBFDD7Td11X75fPFdC33UV8JHNmS+ET0pODvTs/lDzog84RKY65BBrMI5rpnImb+GIdpddmBYnw1ZMBOHdn7o1bBSQONMFqJXfIbXXEfhgkOO9c+DIRuiiiJ+y24ubNN0IhWu7XTrcJ6MrD4EPmeX6mFWUKoe/XLiLf1Hw71iP+e0+pUOCbQq1HXwV4uyYOeiawtCcsedRYDcyBM22ixz/6VYC8W5lSVPAve9dabqVQv6cqNBaaCM2unTt5Vy+xY3TCt1s8a0srhH6qdAFdCf9m7xRuRsi6OarPvDYjyp94oDlKc0SowI=~-1~-1~-1"
},
)
def parse_hidden_data(response: Response) -> List[Dict]:
"""parse website insights from hidden script tags"""
selector = Selector(response.text)
script = selector.xpath("//script[contains(text(), 'window.__APP_DATA__')]/text()").get()
data = json.loads(re.findall(r"(\{.*?)(?=window\.__APP_META__)", script, re.DOTALL)[0])
return data
async def scrape_website(domains: List[str]) -> List[Dict]:
"""scrape website inights from website pages"""
# define a list of similarweb URLs for website pages
urls = [f"https://www.similarweb.com/website/{domain}/" for domain in domains]
to_scrape = [client.get(url) for url in urls]
data = []
for response in asyncio.as_completed(to_scrape):
response = await response
website_data = parse_hidden_data(response)["layout"]["data"]
data.append(website_data)
log.success(f"scraped {len(data)} website insights from similarweb website pages")
return data
import re
import asyncio
import json
from typing import List, Dict
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def parse_hidden_data(response: ScrapeApiResponse) -> List[Dict]:
"""parse website insights from hidden script tags"""
selector = response.selector
script = selector.xpath("//script[contains(text(), 'window.__APP_DATA__')]/text()").get()
data = json.loads(re.findall(r"(\{.*?)(?=window\.__APP_META__)", script, re.DOTALL)[0])
return data
async def scrape_website(domains: List[str]) -> List[Dict]:
"""scrape website inights from website pages"""
# define a list of similarweb URLs for website pages
urls = [f"https://www.similarweb.com/website/{domain}/" for domain in domains]
to_scrape = [ScrapeConfig(url, asp=True, country="US") for url in urls]
data = []
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
website_data = parse_hidden_data(response)["layout"]["data"]
data.append(website_data)
log.success(f"scraped {len(data)} website insights from similarweb website pages")
return data
Run the code
async def run():
data = await scrape_website(
domains=["google.com", "twitter.com", "instagram.com"]
)
# save the data to a JSON file
with open("websites.json", "w", encoding="utf-8") as file:
json.dump(data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
🤖 Update the "_abck" cookie before running the above code, as it may expire, to avoid challenge validation blocking or use the ScrapFly code tab instead.
Let's break down the above SimilarWeb scraping code:
parse_hidden_data: For selecting the script tag that contains the domain analysis data and then parsing the JSON data using regex to execute the HTML tags.
scrape_website: For creating the domain analytics page URLs on SimilarWeb and then requesting them concurrently while utilizing the parsing logic.
To scrape the above data, we'll use the hidden data approach again using the previously used selector //script[contains(text(), 'window.__APP_DATA__')]. The data inside the script tag looks like the following:
Similar to our previous SimilarWeb scraping code, we'll select the script tag and parse the inside data:
Python
ScrapFly
import jmespath
import re
import asyncio
import json
from typing import List, Dict, Optional
from httpx import AsyncClient, Response
from parsel import Selector
from loguru import logger as log
# initialize an async httpx client
client = AsyncClient(
# enable http2
http2=True,
# add basic browser like headers to prevent getting blocked
headers={
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Cookie": "_abck=D2F915DBAC628EA7C01A23D7AA5DF495~0~YAAQLvR6XFLvH1uOAQAAJcI+ZgtcRlotheILrapRd0arqRZwbP71KUNMK6iefMI++unozW0X7uJgFea3Mf8UpSnjpJInm2rq0py0kfC+q1GLY+nKzeWBFDD7Td11X75fPFdC33UV8JHNmS+ET0pODvTs/lDzog84RKY65BBrMI5rpnImb+GIdpddmBYnw1ZMBOHdn7o1bBSQONMFqJXfIbXXEfhgkOO9c+DIRuiiiJ+y24ubNN0IhWu7XTrcJ6MrD4EPmeX6mFWUKoe/XLiLf1Hw71iP+e0+pUOCbQq1HXwV4uyYOeiawtCcsedRYDcyBM22ixz/6VYC8W5lSVPAve9dabqVQv6cqNBaaCM2unTt5Vy+xY3TCt1s8a0srhH6qdAFdCf9m7xRuRsi6OarPvDYjyp94oDlKc0SowI=~-1~-1~-1"
},
)
def parse_hidden_data(response: Response) -> List[Dict]:
"""parse website insights from hidden script tags"""
selector = Selector(response.text)
script = selector.xpath("//script[contains(text(), 'window.__APP_DATA__')]/text()").get()
data = json.loads(re.findall(r"(\{.*?)(?=window\.__APP_META__)", script, re.DOTALL)[0])
return data
def parse_website_compare(response: Response, first_domain: str, second_domain: str) -> Dict:
"""parse website comparings inights between two domains"""
def parse_domain_insights(data: Dict, second_domain: Optional[bool]=None) -> Dict:
"""parse each website data and add it to each domain"""
data_key = data["layout"]["data"]
if second_domain:
data_key = data_key["compareCompetitor"] # the 2nd website compare key is nested
parsed_data = jmespath.search(
"""{
overview: overview,
traffic: traffic,
trafficSources: trafficSources,
ranking: ranking,
demographics: geography
}""",
data_key
)
return parsed_data
script_data = parse_hidden_data(response)
data = {}
data[first_domain] = parse_domain_insights(data=script_data)
data[second_domain] = parse_domain_insights(data=script_data, second_domain=True)
return data
async def scrape_website_compare(first_domain: str, second_domain: str) -> Dict:
"""parse website comparing data from similarweb comparing pages"""
url = f"https://www.similarweb.com/website/{first_domain}/vs/{second_domain}/"
response = await client.get(url)
data = parse_website_compare(response, first_domain, second_domain)
f"scraped comparing insights between {first_domain} and {second_domain}"
log.success(f"scraped comparing insights between {first_domain} and {second_domain}")
return data
import jmespath
import re
import asyncio
import json
from typing import List, Dict, Optional
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def parse_hidden_data(response: ScrapeApiResponse) -> List[Dict]:
"""parse website insights from hidden script tags"""
selector = response.selector
script = selector.xpath("//script[contains(text(), 'window.__APP_DATA__')]/text()").get()
data = json.loads(re.findall(r"(\{.*?)(?=window\.__APP_META__)", script, re.DOTALL)[0])
return data
def parse_website_compare(response: ScrapeApiResponse, first_domain: str, second_domain: str) -> Dict:
"""parse website comparings inights between two domains"""
def parse_domain_insights(data: Dict, second_domain: Optional[bool]=None) -> Dict:
"""parse each website data and add it to each domain"""
data_key = data["layout"]["data"]
if second_domain:
data_key = data_key["compareCompetitor"] # the 2nd website compare key is nested
parsed_data = jmespath.search(
"""{
overview: overview,
traffic: traffic,
trafficSources: trafficSources,
ranking: ranking,
demographics: geography
}""",
data_key
)
return parsed_data
script_data = parse_hidden_data(response)
data = {}
data[first_domain] = parse_domain_insights(data=script_data)
data[second_domain] = parse_domain_insights(data=script_data, second_domain=True)
return data
async def scrape_website_compare(first_domain: str, second_domain: str) -> Dict:
"""parse website comparing data from similarweb comparing pages"""
url = f"https://www.similarweb.com/website/{first_domain}/vs/{second_domain}/"
response = await SCRAPFLY.async_scrape(ScrapeConfig(url, country="US", asp=True))
data = parse_website_compare(response, first_domain, second_domain)
f"scraped comparing insights between {first_domain} and {second_domain}"
log.success(f"scraped comparing insights between {first_domain} and {second_domain}")
return data
Run the code
async def run():
comparing_data = await scrape_website_compare(
first_domain="twitter.com",
second_domain="instagram.com"
)
# save the data to a JSON file
with open("websites_compare.json", "w", encoding="utf-8") as file:
json.dump(comparing_data, file, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(run())
In the above code, we use the previously defined parse_hidden_data to parse data from the page and define two additional functions:
parse_website_compare: For organizing the JSON data and parsing it to exclude unnecessary details with JMESPath.
scrape_website_compare: For defining the SimilarWeb comparing URL and requesting it, while utilizing the parsing logic.
For further details on JMESPath, refer to our dedicated guide.
With this last feature, our SimilarWeb scraper is complete. It can scrape tons of website traffic data from sitemaps, trending, domain, and comparing pages. However, our scraper will soon encounter a major challenge: scraping blocking!
Bypass SimilarWeb Web Scraping Blocking
We can successfully scrape SimilarWeb for a limited amount of requests. However, attempting to scale our scraper will lead SimilarWeb to block the IP address or request us to log in:
This is where Scrapfly can lend a hand for scraping Similarweb without getting blocked.
For example, with scrapfly all we have to do is enable the asp parameter and select a proxy country:
# standard web scraping code
import httpx
from parsel import Selector
response = httpx.get("some similarweb.com URL")
selector = Selector(response.text)
# in ScrapFly becomes this 👇
from scrapfly import ScrapeConfig, ScrapflyClient
# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
response = scrapfly.scrape(ScrapeConfig(
url="website URL",
asp=True, # enable the anti scraping protection to bypass blocking
proxy_pool="public_residential_pool", # select the residential proxy pool
country="US", # set the proxy location to a specfic country
render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))
# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']
Yes, you can scrape Google Trends and SEO keywords for similar web traffic insights. For more website scraping tutorials, refer to our #scrapeguide blog tag.
In this guide, we explained how to scrape SimilarWeb with Python. We started by exploring and navigating the website by scraping sitemaps. Then, we went through a step-by-step guide on scraping several SimilarWeb pages for traffic, rankings, trending, and comparing data.
We have also explored bypassing web scraping SimilarWeb without getting blocked using ScrapFly and avoiding its validation challenges.
In this article, we'll explore how to scrape Reddit. We'll extract various social data types from subreddits, posts, and user pages. All of which through plain HTTP requests without headless browser usage.
In this scrape guide we'll be taking a look at one of the most popular web scraping targets - LinkedIn.com. We'll be scraping people profiles, company profiles as well as job listings and search.
Learn how to scrape BestBuy, one of the most popular retail stores for electronic stores in the United States. We'll scrape different data types from product, search, review, and sitemap pages using different web scraping techniques.