What is Error 1015 (Cloudflare) and How to Fix it?
Discover why you're seeing Cloudflare Error 1015 and learn effective ways to resolve and prevent it.
The amount of data available on the internet is constantly increasing, and as a result, there has been a growing demand for web scraping to retrieve this data. However, due to this high demand, web obfuscation mechanisms have evolved to deny web scraping from accessing this public data.
In this article, we'll explore one of these mechanisms - honeypots. We'll explain what honeypots are, how they work and how to avoid honeypots while scraping. Let's get started!
Honeypots are sweet vulnerabilities in the system that are left there intentionally to trick cyber attackers into attacking them.
In essence, the idea behind honeypots is to lure cyber attackers and bots into a trap. Once the attacker/bot is trapped the honeypots can gather identifiable information, such as IP addresses, location, used techniques and behavior patterns. Then, the bots can be either blocked or further monitored for more data.
Although honeypots were initially targeted against cyber attacks, they can't distinguish between cyber attacks and web scraping requests. This is because certain types of honeypots can identify web scrapers without requiring them to fall deep into their traps. Let's take a closer look at the following section.
There are two main kinds of honeypots and each one serves a unique purpose:
Research Honeypots - traps that are placed far from the production network, usually before the actual website's firewalls. These honeypots are typically easier to get through and security professionals use them to analyze attackers' behavior and techniques. Rarely encountered in web scraping.
Production Honeypots - traps that are placed within the same production network. The goal of these honeypots is to detect and block harmful bots. Most commonly encountered in web scraping.
Despite the difference between the main types of honeypots, they work almost in the same way - by tricking harmful bots into specific parts of the network.
For example, a common type of honeypot is an exposed SSH server on the website with minimum protection. Attackers often fall for this trap and connect to the server and then can be blocked or monitored.
In web scraping, honeypots are generally placed in areas that are not visible to humans but visible to bots. Scrapers are then fingerprinted, blocked, throttled or even served false data.
Honeypot types are diverse, each designed to mimic a specific attack type. However, they all aim for one thing: analyzing the tactics, techniques, and procedures (TTPs) used by attackers and harmful bots. So, let's explore the most popular categories of honeypots.
Database Honeypots
Honeypots that simulate database vulnerabilities such as SQL injections and authorization weaknesses. Besides monitoring and analyzing attackers, they are used to shift the actual database attacks into fake honeypot ones.
Client Honeypots
Client honeypots emulate normal client applications such as browsers and email clients. The main goal of client honeypots is to catch and analyze potential threats that could target your clients while accessing bad websites and servers.
Server Honeypots
Server honeypots are designed to mimic the services, protocols and networks of legitimate servers, while also imitating common server vulnerabilities.
Web Honeypots (most common in web scraping)
As the name suggests, web honeypots are related to web applications. This honeypot is the most commonly encountered while scraping. It emulates a web application with some weaknesses or traps, such as:
A popular trap that web scrapers encounter through web honeypots is fake page data. For example, a website detecting a scraper can serve different product details such as price or images, which leads to corrupting scraping datasets.
The easiest way to confirm this type of trap is by scraping the target web page through two distinct web scrapers with different configurations, such as the IP address, and comparing the results.
When it comes to web scraping honeypots are placed in areas not visible to users but visible to bots. Let's take a look at some techniques that honeypots use to detect web scrapers and how to avoid them.
Honeypots can be placed as hidden links or page areas that are invisible to humans but visible to bots.
For example, websites can add links on the web page and hide them through JavaScript and CSS tricks, such as setting the visibility rule in CSS to display: none;
or using the same background color. Here's an example:
<!-- hidden honeypot trap by hiding a link using CSS -->
<html>
<head>
<style>
.hidden-link {
display: none;
}
</style>
</head>
<body>
<a href="/honeypot-trap" class="hidden-link">Hidden Honeypot Link</a>
</body>
</html>
<!-- hidden honeypot trap by hiding a link by applying the same background color -->
<html>
<head>
<style>
body {
background-color: white;
}
.invisible-link {
color: white;
text-decoration: none;
}
</style>
</head>
<body>
<a href="/honeypot-trap" class="invisible-link">Invisible Honeypot Link</a>
</body>
</html>
Since humans only see the visible part of the HTML they are unlikely to click on these invisible honeypots. However, web crawlers do not consider what's visible and what's not and fall into these traps very easily.
To avoid this web scraping honeypot it's best to:
Have well-defined crawling rules to avoid hidden links. Often these links can be identified by extra class names like d-none
(from bootstrap) or other HTML properties.
Verify details using headless browsers. Using a headless browser for scraping we can execute javascript script to check if the element is visible, for example:
function isElementVisible(selector) {
const element = document.querySelector(selector);
if (!element) {
return false;
}
const style = window.getComputedStyle(element);
if (style.display === 'none' || style.visibility === 'hidden' || style.opacity === '0') {
return false;
}
return true;
}
Website sitemaps and robots.txt files are perfect places for honeypots as these files are mostly read by robots.
When scraping links gathered from sitemaps, robots.txt and other directories that are rarely used by human users but visible to bots it's best to validate the existence of these pages on the live website. Here are some tips how this can be done:
Modern websites are very complex and create many opportunities for honeypot attacks. That being said, the easiest way to handle honeypots is to resist their tracking by mixing up scraping profiles. Let's take a look at that next.
It can be difficult to not stumble on a honeypot when web scraping, especially when using broad crawling techniques. However, to alleviate the risks caused by honeypots we can use common scraping techniques and mixup up the scraper fingerprint:
While it's important to mimic normal browser headers to avoid detection, there's some wiggle room in the overall header profile and we can change up some values in each request to avoid detection.
Most commonly the User-Agent header is being rotated but other headers like Referer
, Accept-Language
and Accept-Encoding
can also be rotated to avoid detection.
IP address is the prime identifier of each HTTP connection and rotating proxies for scraping requests can be an easy way to resists web scraping honeypots.
Honeypots can be surprisingly dangerous and difficult to avoid but Scrapfly can be of an assistance!
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:
To avoid web scraping honeypots using ScrapFly, all we have to do is request the target website using the ScrapFly API:
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly_client = ScrapflyClient("Your ScrapFly API key")
result: ScrapeApiResponse = scrapfly_client.scrape(ScrapeConfig(
# some website URL you want to scrape
url="website URL",
# we can select a specific proxy country location
country="US",
# enable anti scraping protection bypass to avoid blocking and detectio
asp=True,
# allows JavaScript rendering to render JavaScript loaded content, similar to headless browsers
render_js=True
))
# use the built-in parsel selector
selector = result.selector
# access the page HTML
html = result.scrape_result['content']
Before we wrap this intro on web scraping honey pots let's take a look at some frequently asked questions:
Honeypots in web scraping use fake data to detect and block web scrapers. They are usually placed in areas not visible to humans but visible to bots. For example, a website can serve different product details such as price or images to scrapers.
Due to the complexity of the modern web detecting honey pots can be very difficult. Validating that scraped URLs existing on the public web is the most reliable strategy.
The best way to avoid being blocked by honeypots is to use a big pool of web scraping profiles with different fingerprints and IP addresses while also using strict scraping rules to avoid honeypot traps.
Identified scrapes are sometimes served fake data to poison scraped datasets. The only way to avoid this is to verify each scraped dataset multiple times from different scraping profiles (IP addresses, geographical location and scrape configuration).
Honeypot traps are very common in IT security and becoming more common in web scraping. These traps lure robots into specific parts of the network to be fingerprinted and analyzed. The resulting data can be used to throttle, block or even serve false data to scrapers.
We went through the essential steps to avoid web scraping honeypots. In a nutshell, these are: