How to Bypass CAPTCHA While Web Scraping in 2024

How to Bypass CAPTCHA While Web Scraping in 2024

CAPTCHAs are one of the oldest methods used by WAFs to identify and block bots. They are extremely popular and can be found on almost all websites.

In this article, we'll explain CAPTCHAs, what they are and how they work. We'll also go over the most effective techniques for bypassing CAPTCHAs while scraping. So, let's get started!

How to Scrape Without Getting Blocked? In-Depth Tutorial

Learn how to scrape without getting blocked by exploring 4 core areas where web scrapers fail to cover their tracks and how analysis of these details can lead to blocking.

How to Scrape Without Getting Blocked? In-Depth Tutorial

What are CAPTCHAs?

CAPTCHAs are a popular security check used to prevent bots and automated programs from engaging with a web page. They are used to block malicious and spam activities by popping up a challenge that requires human interaction to solve them.

recpatcha challenge
The popular Google reCAPTCHA challenge

CAPTCHAs were first used in the early 2000s, and they have developed over the years to resist the new AI capabilities in solving them.

Various anti-bot protection services (like Akamai, Cloudflare, Datadome and PermiterX) often also use CAPTCHAs on low trust connections.

However, an anti-bot shares many technical aspects with modern CAPTCHAs. This means that we can bypass CAPTCHA while scraping in the same way - by using secure, fortified HTTP connections.

How do CAPTCHAs work?

There are various types of CAPTCHA tests that require solving challenges related to text, images, or even sounds that can be solved by computers and humans apartbut not by automated scripts.

captcha types: image, fingerprint, text image, audio
While there are many captcha types they function very similarly

These challenges can be solved using a CAPTCHA solver with machine learning and computer vision capabilities. However, the CAPTCHA bypass result in this case won't be very accurate, requires multiple retry attempts, and tends to consume a lot of processing resources. This is often enough to deter bots as it's simply too expensive to solve CAPTCHAs.

Therefore, it's better to avoid them entirely rather than try to solve them.

How to Avoid CAPTCHAs?

Since CAPTCHAs negatively affect the user experience while browsing, anti-bot services show them if they suspect the request to be automated by first calculating a trust score. This score determines whether a request has to solve a CAPTCHA challenge.

This means that we can bypass CAPTCHA while scraping by raising our trust score. In simple terms, we have to mimic the requests' configuration of normal human behavior on a web browser. Let's have a close look!

Use Resistant TLS Fingerprint

When a request is sent to a website protected with an SSL certificate, the request must go over a TLS handshake process to initialize a secure transmission channel. During this process, both the request and web server exchange security information, which leads to creating a JA3 fingerprint.

Since the web scraping tools and HTTP clients perform TLS handshakes differently compared to real browsers, this fingerprint can vary. This can lead to a lower trust score and eventually require to solve a CAPTCHA test.

Therefore, having a web scraper with the correct TLS fingerprint is necessary to bypassing CAPTCHA challenges.

For further details, see the ScrapFly JA3 fingerprinting tool and refer to our guide on TLS fingerprinting.

How TLS Fingerprint is Used to Block Web Scrapers?

Fore more - see how TLS can leak the fact that connecting client is a web scraper and how can it be used to establish fingerprint to track the client across the web.

How TLS Fingerprint is Used to Block Web Scrapers?

Enable JavaScript and Rotate JS Fingerprint

Real web browsers and web users usually have JavaScript enabled. Therefore, so scraping web pages without JavaScript capabilities will require solving CAPTCHA challenges.

In addition, JavaScript can be used to include numerous details about the request sender, including:

  • Hardware specs and capabilities
  • Operating system details
  • Browser configuration and its details
  • JavaScript runtime and its version

This is called JavaScript fingerprint and can be used to identify web scrapers from human users.

Therefore, use browser automation tools, such as Selenium, Playwright, and Puppeteer to bypass CAPTCHAS during the scraping process.

That being said, headless browser tools can leak specific details about the browser engine, allowing websites to identify them as controlled browsers and not normal ones. For example, a popular headless browser leak is the navigator.webdriver value, which is set to True with headless browsers only:

illustration of natural vs automated browser

So, to summarize:

  • Enable JavaScript execution using browser automation tools.
  • Patch headless browser leaks, using web scraping tools such as Undedected ChromeDriver and Puppueteer stealth.
  • Rotate JavaScript fingerprint details (browser viewport size, operating system, etc)
How Javascript is Used to Block Web Scrapers? In-Depth Guide

For more details, see our full intro to common fingerprinting techniques and fingerprint leaks in headless browsers and how to patch them.

How Javascript is Used to Block Web Scrapers? In-Depth Guide

Rotate IP Address Fingerprint

The IP address is a unique number set that identifies a device over the network. Websites can look for the request's IP address to get details about the geolocation and ISP to create an IP address fingerprint.

This fingerprint includes details about the IP address type, which can be one of three types:

  • Residential
    IP addresses assigned to home networks and retail users from ISPs. These IPs have a high trust score with anti-bots and CAPTCHA providers, as these are most likely used by humans.

  • Mobile
    IP addresses assigned to mobile networks through cell phone towers and have a high trust score, as they are used by real users. These IPs are also hard to track, cell tower connections are short and rotate often.

  • Datacenter
    IP addresses that are generated by data centers and cloud providers, such as AWS and Google Cloud, thus have a low trust score. This is because normal users are unlikely to use data center IPs to browse the web.

Based on the IP address trust score, anti-bot providers may challenge the request with a CAPTCHA if the traffic is suspected to come from a bot. Websites can also set rate-limiting rules or even block the IPs if the outcoming traffic from same the IP address is high in a short time window.

Therefore, it's essential to rotate high-quality IP addresses to prevent CAPTCHA detection for your web scraper.

How to Hide Your IP Address

For more on IP address and proxy blocking see our full introduction tutorial.

How to Hide Your IP Address

Request Header Fingerprint

HTTP headers are key-value pairs used to exchange information between a request and a web server. Websites compare the requests' headers with those of normal browsers. If they are missing, misconfigured or mismatched, the anti-bot can suspect the request and require it to solve a CAPTCHA challenge.

Let's take a look at some common headers websites use to detect web scrapers.

Accept

Indicates the content type accepted by the browser or HTTP client:

Accept: text/html, application/xhtml+xml, application/xml;q=0.9, image/webp

Accept-Language

The website language that the browser accepts. It's effective with websites that support localized versions, which can also be used to control the web scraping language:

Accept-Language: en-US,en;q=0.5

User-Agent

Arguably, it is the most important header in web scraping. It provides information about the request's browser name, type, version and operating system:

User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0

Websites also use the User-Agent to detect web scraping if the same User-Agent is used across many requests. Therefore, rotating User-Agent headers can help with avoiding CAPTCHAs while web scraping.

The Cookie header contains the cookie values stored on the browser JavaScript and sends them to the server:

name=value; name2=value2; name3=value3

Cookies are used to store user data and website preferences. It can also store CAPTCHA-related keys that authenticate the requests for a specific period, preventing them from popping up as long as these values aren't expired. Therefore, using cookies in web scraping can help mimic normal users' behavior and avoid CAPTCHAs.

Websites can also create custom headers, usually starting with the X- prefix. These headers can contain keys related to security, analytics, and authentication.

Therefore, correctly adding headers can help requests bypass CAPTCHA while scraping.

How Headers Are Used to Block Web Scrapers and How to Fix It

For more on request header role in scraper blocking see our full introduction article which takes a deep dive in header fingerprinting.

How Headers Are Used to Block Web Scrapers and How to Fix It

Avoid Web Scraping Honeypots

Honeypots are traps used to lure bots and attackers, they are often used to identify web scrapers and prevent automated programs. These honeypots are usually placed in the HTML code and are invisible to normal users.

Websites can place hidden honeypot traps in the HTML using JavaScript and CSS tricks, such as adding hidden links and buttons that are not visible to real users but are visible to scrapers like web crawlers. Honeypots can also be used to disrupt web scraping results by manipulating HTML pages and presenting misleading data, such as fake product prices.

So, avoiding honeypots by following direct links and mimicking normal users' behavior can minimize the detection and CAPTCHAs rate.

What are Honeypots and How to Avoid Them in Web Scraping

Learn what honeypots are, how they work and how to avoid them while scraping.

What are Honeypots and How to Avoid Them in Web Scraping

There are so many captcha techniques that it's almost impossible to bypass them all, so to focus our CAPTCHA bypass efforts, we'll take a look at the most popular captcha providers and what bypass techniques work best for them.

reCAPTCHA

Google reCaptcha is the most commonly encountered service while scraping web data. It uses a combination of image-based CAPTCHAs, as well as audio ones, to challenge automated scripts. To bypass reCAPTCHA while scraping, we can focus on the following details:

  • TLS fingerprint
  • JavaScript fingerprint
  • IP address type (residential or mobile)

hCaptcha

hCaptcha is a popular CAPTCHA provider developed by Intuition Machines. It's becoming an increasingly popular alternative to reCaptcha. To bypass hCaptcha while scraping, the most important details to focus on are:

  • IP Address type (residential or mobile)
  • Javascript fingerprint
  • Request details (headers, cookies etc)
  • Browsing behavior

Friendly Captcha

Friendly Captcha is a new frictionless, privacy-first captcha service. It's based on a proof-of-work system that requires the client to solve a complex mathematical challenge (similar to what cryptocurrency does). This means bypassing friendly CAPTCHA is possible through the following details:

  • Javascript execution engine (real web browser)
  • Javascript fingerprint

Bypass CAPTCHAs While Scraping using ScrapFly

We have seen that avoiding CAPTCHAs can be tedious, requiring paying attention to endless details. This is why we created ScrapFly - a web scraping API equipped with anti-CAPTCHA and blocking bypass to bypass scraper blocking.

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:

  • Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
  • Millions of self-healing proxies of the highest possible trust score.
  • Constantly evolving and adapting to new anti-bot systems.
  • We've been doing this publicly since 2020 with the best bypass on the market!

Here is how to prevent CAPTCHAs using ScrapFly. All we have to do is send a request to the target website and enable the asp feature. ScrapFly will then manage the avoiding logic for us:

from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

result: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
    url="the target website URL",
    # select a the proxy country
    country="us",
    # enable the ASP to bypass CAPTCHAs on any website
    asp=True,
    # enable JS rendering, similar to headless browsers
    render_js=True
))

# get the page HTML content
print(result.scrape_result['content'])

Sign-up for FREE to get your API key!

FAQ

To wrap up this guide on how to avoid CAPTCHA, let's take a look at some frequently asked questions.

Yes, bypassing CAPTCHAs for scraping public pages at a reasonable rate without damaging the website is considered legal.

Can I bypass CAPTCHAs while scraping?

Yes, it's possible to solve CAPTCHAs using computer vision techniques or paid captcha solvers. However, these methods are often complex, inaccurate or expensive and it's easier to avoid them in the first place.

How to get around recaptcha?

Recaptcha relies on javascript fingerprinting the most of all captcha services though it also analyzes IP addresses. Therefore, it's best to use headless browsers with strong javascript fingerprints and residential IP addresses, especially when it comes to bypass of recatpcha v3.

Are there any captchas that are impossible to bypass?

Yes, the webmasters can configure captcha appearance rules and it's possible that captcha is configured to appear 100% of the time. This is an undesired friction so it happens rarely but in that case, there's no other option but to solve the captcha to access the page.

Summary

We've taken a look at Captchas in web scraping - how it all works and how to skip captcha by improving scraper's trust scores. This is done by fortifying connection details like TLS, IP address, headers and javascript execution.

To summarize, these five effective techniques can be used to create an anti-CAPTCHA scraper in Python, Javascript or any other programming language that supports headless browsers:

  • Using TLS fingerprint similar to normal browsers.
  • Rotating the JavaScript fingerprint by randomizing its variables and hiding the headless browsers' traces.
  • Splitting the requests' traffic across multiple IP addresses.
  • Using headers similar to normal users.
  • Avoiding honeypots by requesting the necessary page URLs only.

With proper use of these tools bypassing captcha using Python is a very achievable task!

Related Posts

How to Know What Anti-Bot Service a Website is Using?

In this article we'll take a look at two popular tools: WhatWaf and Wafw00f which can identify what WAF service is used.

Use Curl Impersonate to scrape as Chrome or Firefox

Learn how to prevent TLS fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking.

FlareSolverr Guide: Bypass Cloudflare While Scraping

In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works, how to install and use it. Let's get started!