What are Honeypots and How to Avoid Them in Web Scraping

What are Honeypots and How to Avoid Them in Web Scraping

The amount of data available on the internet is constantly increasing, and as a result, there has been a growing demand for web scraping to retrieve this data. However, due to this high demand, web obfuscation mechanisms have evolved to deny web scraping from accessing this public data.

In this article, we'll explore one of these mechanisms - honeypots. We'll explain what honeypots are, how they work and how to avoid honeypots while scraping. Let's get started!

What Is a Honeypot Trap?

Honeypots are sweet vulnerabilities in the system that are left there intentionally to trick cyber attackers into attacking them.

In essence, the idea behind honeypots is to lure cyber attackers and bots into a trap. Once the attacker/bot is trapped the honeypots can gather identifiable information, such as IP addresses, location, used techniques and behavior patterns. Then, the bots can be either blocked or further monitored for more data.

Although honeypots were initially targeted against cyber attacks, they can't distinguish between cyber attacks and web scraping requests. This is because certain types of honeypots can identify web scrapers without requiring them to fall deep into their traps. Let's take a closer look at the following section.

How do Honeypots Work?

There are two main kinds of honeypots and each one serves a unique purpose:

  • Research Honeypots - traps that are placed far from the production network, usually before the actual website's firewalls. These honeypots are typically easier to get through and security professionals use them to analyze attackers' behavior and techniques. Rarely encountered in web scraping.

  • Production Honeypots - traps that are placed within the same production network. The goal of these honeypots is to detect and block harmful bots. Most commonly encountered in web scraping.

Despite the difference between the main types of honeypots, they work almost in the same way - by tricking harmful bots into specific parts of the network.

For example, a common type of honeypot is an exposed SSH server on the website with minimum protection. Attackers often fall for this trap and connect to the server and then can be blocked or monitored.

In web scraping, honeypots are generally placed in areas that are not visible to humans but visible to bots. Scrapers are then fingerprinted, blocked, throttled or even served false data.

What are Honeypot Types?

Honeypot types are diverse, each designed to mimic a specific attack type. However, they all aim for one thing: analyzing the tactics, techniques, and procedures (TTPs) used by attackers and harmful bots. So, let's explore the most popular categories of honeypots.

  • Database Honeypots
    Honeypots that simulate database vulnerabilities such as SQL injections and authorization weaknesses. Besides monitoring and analyzing attackers, they are used to shift the actual database attacks into fake honeypot ones.

  • Client Honeypots
    Client honeypots emulate normal client applications such as browsers and email clients. The main goal of client honeypots is to catch and analyze potential threats that could target your clients while accessing bad websites and servers.

  • Server Honeypots
    Server honeypots are designed to mimic the services, protocols and networks of legitimate servers, while also imitating common server vulnerabilities.

  • Web Honeypots (most common in web scraping)
    As the name suggests, web honeypots are related to web applications. This honeypot is the most commonly encountered while scraping. It emulates a web application with some weaknesses or traps, such as:

    • CSRF and SSRF security leaks.
    • Fake website APIs and URLs.
    • Fake input fields that attract spam bots.
    • Unsecure HTML forms.
    • Redirections to fake web pages.

A popular trap that web scrapers encounter through web honeypots is fake page data. For example, a website detecting a scraper can serve different product details such as price or images, which leads to corrupting scraping datasets.

The easiest way to confirm this type of trap is by scraping the target web page through two distinct web scrapers with different configurations, such as the IP address, and comparing the results.

How to Avoid Web Scraping Honeypots?

When it comes to web scraping honeypots are placed in areas not visible to users but visible to bots. Let's take a look at some techniques that honeypots use to detect web scrapers and how to avoid them.

Honeypots can be placed as hidden links or page areas that are invisible to humans but visible to bots.

For example, websites can add links on the web page and hide them through JavaScript and CSS tricks, such as setting the visibility rule in CSS to display: none; or using the same background color. Here's an example:

<!-- hidden honeypot trap by hiding a link using CSS -->
<html>
<head>
    <style>
        .hidden-link {
            display: none;
        }
    </style>
</head>
<body>
    <a href="/honeypot-trap" class="hidden-link">Hidden Honeypot Link</a>
</body>
</html>

<!-- hidden honeypot trap by hiding a link by applying the same background color -->
<html>
<head>
    <style>
        body {
            background-color: white;
        }
        .invisible-link {
            color: white;
            text-decoration: none;
        }
    </style>
</head>
<body>
    <a href="/honeypot-trap" class="invisible-link">Invisible Honeypot Link</a>
</body>
</html>

Since humans only see the visible part of the HTML they are unlikely to click on these invisible honeypots. However, web crawlers do not consider what's visible and what's not and fall into these traps very easily.

To avoid this web scraping honeypot it's best to:

Have well-defined crawling rules to avoid hidden links. Often these links can be identified by extra class names like d-none (from bootstrap) or other HTML properties.

Verify details using headless browsers. Using a headless browser for scraping we can execute javascript script to check if the element is visible, for example:

function isElementVisible(selector) {
    const element = document.querySelector(selector);
    if (!element) {
        return false;
    }
    const style = window.getComputedStyle(element);
    if (style.display === 'none' || style.visibility === 'hidden' || style.opacity === '0') {
        return false;
    }
    return true;
}
How to Crawl the Web with Python

If you're unfamiliar with crawling see our introduction tutorial and example project here

How to Crawl the Web with Python

Validate Sitemaps and Robots.txt Contents

Website sitemaps and robots.txt files are perfect places for honeypots as these files are mostly read by robots.

When scraping links gathered from sitemaps, robots.txt and other directories that are rarely used by human users but visible to bots it's best to validate the existence of these pages on the live website. Here are some tips how this can be done:

  • Use the website's search system to see whether the page is present on the live website.
  • Use web indexes like Google or Bing to see whether the page is indexed.
  • Use a burner scraper profile (Proxy IP) for uncertain links.

Modern websites are very complex and create many opportunities for honeypot attacks. That being said, the easiest way to handle honeypots is to resist their tracking by mixing up scraping profiles. Let's take a look at that next.

How to Resist Web Scraping Honeypots?

It can be difficult to not stumble on a honeypot when web scraping, especially when using broad crawling techniques. However, to alleviate the risks caused by honeypots we can use common scraping techniques and mixup up the scraper fingerprint:

Rotate Browser Headers

While it's important to mimic normal browser headers to avoid detection, there's some wiggle room in the overall header profile and we can change up some values in each request to avoid detection.

Most commonly the User-Agent header is being rotated but other headers like Referer, Accept-Language and Accept-Encoding can also be rotated to avoid detection.

How Headers Are Used to Block Web Scrapers and How to Fix It

For more on HTTP headers role in scraper blocking see our complete introduction tutorial

How Headers Are Used to Block Web Scrapers and How to Fix It

Rotate IP Address

IP address is the prime identifier of each HTTP connection and rotating proxies for scraping requests can be an easy way to resists web scraping honeypots.

Avoid Scraping Honeypots using ScrapFly

Honeypots can be surprisingly dangerous and difficult to avoid but Scrapfly can be of an assistance!

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:

  • Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
  • Millions of self-healing proxies of the highest possible trust score.
  • Constantly evolving and adapting to new anti-bot systems.
  • We've been doing this publicly since 2020 with the best bypass on the market!

To avoid web scraping honeypots using ScrapFly, all we have to do is request the target website using the ScrapFly API:

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly_client = ScrapflyClient("Your ScrapFly API key")
result: ScrapeApiResponse = scrapfly_client.scrape(ScrapeConfig(
    # some website URL you want to scrape
    url="website URL",
    # we can select a specific proxy country location
    country="US",
    # enable anti scraping protection bypass to avoid blocking and detectio
    asp=True,
    # allows JavaScript rendering to render JavaScript loaded content, similar to headless browsers
    render_js=True
))
# use the built-in parsel selector
selector = result.selector
# access the page HTML
html = result.scrape_result['content']

FAQ

Before we wrap this intro on web scraping honey pots let's take a look at some frequently asked questions:

What is a honeypot in web scraping?

Honeypots in web scraping use fake data to detect and block web scrapers. They are usually placed in areas not visible to humans but visible to bots. For example, a website can serve different product details such as price or images to scrapers.

How do you detect honeypots in web scraping?

Due to the complexity of the modern web detecting honey pots can be very difficult. Validating that scraped URLs existing on the public web is the most reliable strategy.

How to avoid being blocked by honeypots?

The best way to avoid being blocked by honeypots is to use a big pool of web scraping profiles with different fingerprints and IP addresses while also using strict scraping rules to avoid honeypot traps.

How do I know whether my scraper is being served fake data?

Identified scrapes are sometimes served fake data to poison scraped datasets. The only way to avoid this is to verify each scraped dataset multiple times from different scraping profiles (IP addresses, geographical location and scrape configuration).

Web Scraping Honeypots Summary

Honeypot traps are very common in IT security and becoming more common in web scraping. These traps lure robots into specific parts of the network to be fingerprinted and analyzed. The resulting data can be used to throttle, block or even serve false data to scrapers.

We went through the essential steps to avoid web scraping honeypots. In a nutshell, these are:

  • Use strict scraping or crawling rules to avoid fake data.
  • Analyze links before scraping them.
  • Using headers to mimic normal browsers' requests.
  • Rotating the User-Agent headers.
  • Splitting your traffic between between multiple IP addresses.

Related Posts

What is Error 1015 (Cloudflare) and How to Fix it?

Discover why you're seeing Cloudflare Error 1015 and learn effective ways to resolve and prevent it.

What is HTTP Error 503 Service Unavailable and How to Fix it?

Understand what causes HTTP 503 errors, when they might indicate blocking, and how to effectively mitigate them.

What is HTTP Error 429 Too Many Request and How to Fix it

HTTP 429 is an infamous response code that indicates request throttling or distribution is needed. Let's take a look at how to handle it.