🚀 We are hiring! See open positions

5 Proven Ways to Bypass CAPTCHA in Python

5 Proven Ways to Bypass CAPTCHA in Python

Hitting a CAPTCHA wall can stop even the best web scrapers in their tracks, turning automated data collection into a frustrating puzzle. But what if you could glide past these digital gatekeepers without ever solving a single challenge?

In this article, we'll explain CAPTCHAs, what they are and how they work. We'll also go over the most effective techniques for bypassing CAPTCHAs while scraping. So, let's get started!

What are CAPTCHAs?

CAPTCHAs are a popular security check used to prevent bots and automated programs from engaging with a web page. They are used to block malicious and spam activities by popping up a challenge that requires human interaction to solve them.

recpatcha challenge
The popular Google reCAPTCHA challenge

CAPTCHAs were first used in the early 2000s, and they have developed over the years to resist the new AI capabilities in solving them.

Various anti-bot protection services (like Akamai, Cloudflare, Datadome and PermiterX) often also use CAPTCHAs on low trust connections.

However, an anti-bot shares many technical aspects with modern CAPTCHAs. This means that we can bypass CAPTCHA while scraping in the same way - by using secure, fortified HTTP connections.

How do CAPTCHAs work?

There are various types of CAPTCHA tests that require solving challenges related to text, images, or even sounds that can be solved by computers and humans apart but not by automated scripts.

captcha types: image, fingerprint, text image, audio
While there are many captcha types they function very similarly

These challenges can be solved using a CAPTCHA solver with machine learning and computer vision capabilities. However, the CAPTCHA bypass result in this case won't be very accurate, requires multiple retry attempts, and tends to consume a lot of processing resources. This is often enough to deter bots as it's simply too expensive to solve CAPTCHAs.

Therefore, it's better to avoid them entirely rather than try to solve them.

How to Avoid CAPTCHAs?

Since CAPTCHAs negatively affect the user experience while browsing, anti-bot services show them if they suspect the request to be automated by first calculating a trust score. This score determines whether a request has to solve a CAPTCHA challenge.

This means that we can bypass CAPTCHA while scraping by raising our trust score. In simple terms, we have to mimic the requests' configuration of normal human behavior on a web browser. Let's have a close look!

Use Resistant TLS Fingerprint

When a request is sent to a website protected with an SSL certificate, the request must go over a TLS handshake process to initialize a secure transmission channel. During this process, both the request and web server exchange security information, which leads to creating a JA3 fingerprint.

Since the web scraping tools and HTTP clients perform TLS handshakes differently compared to real browsers, this fingerprint can vary. This can lead to a lower trust score and eventually require to solve a CAPTCHA test.

Therefore, having a web scraper with the correct TLS fingerprint is necessary to bypassing CAPTCHA challenges.

The standard Python requests library uses urllib3's TLS implementation, which creates a fingerprint that differs from real browsers. To bypass this, we need to use a library that mimics browser TLS fingerprints. The curl-cffi library uses curl's TLS implementation, which produces browser-like fingerprints.

Here's how to use curl-cffi to make requests with proper TLS fingerprinting:

from curl_cffi import requests

# Use a browser-like TLS fingerprint
response = requests.get(
    "https://httpbin.org/headers",
    impersonate="chrome110"  # Mimics Chrome 110 TLS fingerprint
)

print(response.json())

The impersonate parameter tells curl-cffi to use the TLS fingerprint of a specific browser version. Supported options include chrome110, chrome107, safari15_3, safari15_5, and edge99.

You can also verify your TLS fingerprint using a testing service:

from curl_cffi import requests

# Test your TLS fingerprint
response = requests.get(
    "https://tls.browserleaks.com/json",
    impersonate="chrome110"
)

fingerprint_data = response.json()
print(f"TLS Version: {fingerprint_data.get('tls_version')}")
print(f"JA3 Hash: {fingerprint_data.get('ja3_hash')}")

For further details, see the ScrapFly JA3 fingerprinting tool and refer to our guide on TLS fingerprinting.

How TLS Fingerprint is Used to Block Web Scrapers?

TLS fingeprinting is a popular way to identify web scrapers that not many developers are aware of. What is it and how can we fortify our scrapers to avoid being detected?

How TLS Fingerprint is Used to Block Web Scrapers?

Enable JavaScript and Rotate JS Fingerprint

Real web browsers and web users usually have JavaScript enabled. Therefore, scraping web pages without JavaScript capabilities will require solving CAPTCHA challenges.

In addition, JavaScript can be used to include numerous details about the request sender, including:

  • Hardware specs and capabilities
  • Operating system details
  • Browser configuration and its details
  • JavaScript runtime and its version

This is called JavaScript fingerprint and can be used to identify web scrapers from human users.

Therefore, use browser automation tools, such as Selenium, Playwright, and Puppeteer to bypass CAPTCHAS during the scraping process.

That being said, headless browser tools can leak specific details about the browser engine, allowing websites to identify them as controlled browsers and not normal ones. For example, a popular headless browser leak is the navigator.webdriver value, which is set to True with headless browsers only:

illustration of natural vs automated browser

Here's how to use Playwright with stealth plugins to bypass JavaScript fingerprinting:

from playwright.sync_api import sync_playwright
import random

def scrape_with_stealth(url):
    with sync_playwright() as p:
        # Launch browser with realistic settings
        browser = p.chromium.launch(
            headless=False,  # Use headful mode for better fingerprint
            args=[
                '--disable-blink-features=AutomationControlled',
                '--disable-dev-shm-usage',
            ]
        )
        
        # Create context with randomized viewport
        viewports = [
            {'width': 1920, 'height': 1080},
            {'width': 1366, 'height': 768},
            {'width': 1536, 'height': 864},
        ]
        viewport = random.choice(viewports)
        
        context = browser.new_context(
            viewport=viewport,
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
        )
        
        # Inject stealth scripts to hide automation
        page = context.new_page()
        page.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {
                get: () => undefined
            });
            
            window.chrome = {
                runtime: {}
            };
            
            Object.defineProperty(navigator, 'plugins', {
                get: () => [1, 2, 3, 4, 5]
            });
        """)
        
        # Navigate and scrape
        page.goto(url, wait_until='networkidle')
        content = page.content()
        
        browser.close()
        return content

# Test the scraper
html = scrape_with_stealth("https://httpbin.org/html")
print(html[:500])

For even better results, use the playwright-stealth package which automatically patches common detection methods:

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def scrape_with_stealth_plugin(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        context = browser.new_context()
        page = context.new_page()
        
        # Apply stealth techniques
        stealth_sync(page)
        
        page.goto(url)
        content = page.content()
        
        browser.close()
        return content

So, to summarize:

  • Enable JavaScript execution using browser automation tools.
  • Patch headless browser leaks, using web scraping tools such as Undedected ChromeDriver and Puppueteer stealth.
  • Rotate JavaScript fingerprint details (browser viewport size, operating system, etc)

How Javascript is Used to Block Web Scrapers? In-Depth Guide

Introduction to how javascript is used to detect web scrapers. What's in javascript fingerprint and how to correctly spoof it for web scraping.

How Javascript is Used to Block Web Scrapers? In-Depth Guide

Rotate IP Address Fingerprint

The IP address is a unique number set that identifies a device over the network. Websites can look for the request's IP address to get details about the geolocation and ISP to create an IP address fingerprint.

This fingerprint includes details about the IP address type, which can be one of three types:

  • Residential
    IP addresses assigned to home networks and retail users from ISPs. These IPs have a high trust score with anti-bots and CAPTCHA providers, as these are most likely used by humans.
  • Mobile
    IP addresses assigned to mobile networks through cell phone towers and have a high trust score, as they are used by real users. These IPs are also hard to track, cell tower connections are short and rotate often.
  • Datacenter
    IP addresses that are generated by data centers and cloud providers, such as AWS and Google Cloud, thus have a low trust score. This is because normal users are unlikely to use data center IPs to browse the web.

Based on the IP address trust score, anti-bot providers may challenge the request with a CAPTCHA if the traffic is suspected to come from a bot. Websites can also set rate-limiting rules or even block the IPs if the outcoming traffic from same the IP address is high in a short time window.

Therefore, it's essential to rotate high-quality IP addresses to prevent CAPTCHA detection for your web scraper.

Here's how to implement IP rotation with proxy rotation in Python:

from curl_cffi import requests
import random
import time

# List of proxy servers (format: http://user:pass@host:port)
proxies = [
    "http://user1:pass1@proxy1.example.com:8080",
    "http://user2:pass2@proxy2.example.com:8080",
    "http://user3:pass3@proxy3.example.com:8080",
]

def scrape_with_proxy_rotation(url, proxies_list):
    """Scrape with automatic proxy rotation"""
    proxy = random.choice(proxies_list)
    
    try:
        response = requests.get(
            url,
            proxy=proxy,
            impersonate="chrome110",
            timeout=30
        )
        return response
    except Exception as e:
        print(f"Proxy {proxy} failed: {e}")
        # Try next proxy
        if len(proxies_list) > 1:
            return scrape_with_proxy_rotation(url, [p for p in proxies_list if p != proxy])
        raise

# Example usage
urls = [
    "https://httpbin.org/ip",
    "https://httpbin.org/ip",
    "https://httpbin.org/ip",
]

for url in urls:
    response = scrape_with_proxy_rotation(url, proxies)
    print(f"IP: {response.json()}")
    time.sleep(2)  # Add delay between requests

For residential proxies with automatic rotation, you can use a proxy service API:

from curl_cffi import requests

def scrape_with_residential_proxy(url, proxy_endpoint):
    """Use residential proxy endpoint that auto-rotates IPs"""
    response = requests.get(
        url,
        proxy=proxy_endpoint,  # e.g., "http://rotating-residential.example.com:8080"
        impersonate="chrome110"
    )
    return response

# Each request will use a different residential IP
response = scrape_with_residential_proxy("https://httpbin.org/ip", "http://proxy-service.com:8080")
print(response.json())

How to Hide Your IP Address

In this article we'll be taking a look at several ways to hide IP addresses: proxies, tor networks, vpns and other techniques.

How to Hide Your IP Address

Request Header Fingerprint

HTTP headers are key-value pairs used to exchange information between a request and a web server. Websites compare the requests' headers with those of normal browsers. If they are missing, misconfigured or mismatched, the anti-bot can suspect the request and require it to solve a CAPTCHA challenge.

Let's take a look at some common headers websites use to detect web scrapers.

Accept

Indicates the content type accepted by the browser or HTTP client:

Accept: text/html, application/xhtml+xml, application/xml;q=0.9, image/webp

Accept-Language

The website language that the browser accepts. It's effective with websites that support localized versions, which can also be used to control the web scraping language:

Accept-Language: en-US,en;q=0.5

User-Agent

Arguably, it is the most important header in web scraping. It provides information about the request's browser name, type, version and operating system:

User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0

Websites also use the User-Agent to detect web scraping if the same User-Agent is used across many requests. Therefore, rotating User-Agent headers can help with avoiding CAPTCHAs while web scraping.

The Cookie header contains the cookie values stored on the browser JavaScript and sends them to the server:

name=value; name2=value2; name3=value3
Cookies are used to store user data and website preferences. It can also store CAPTCHA-related keys that authenticate the requests for a specific period, preventing them from popping up as long as these values aren't expired. Therefore, using cookies in web scraping can help mimic normal users' behavior and avoid CAPTCHAs.

Websites can also create custom headers, usually starting with the X- prefix. These headers can contain keys related to security, analytics, and authentication.

Therefore, correctly adding headers can help requests bypass CAPTCHA while scraping.

Here's a complete example of setting realistic browser headers with proper consistency:

from curl_cffi import requests
import random

# Realistic browser headers that match each other
def get_browser_headers():
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    ]
    
    user_agent = random.choice(user_agents)
    
    # Headers must be consistent with User-Agent
    if "Chrome" in user_agent:
        headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "User-Agent": user_agent,
            "sec-ch-ua": '"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"',
            "sec-ch-ua-mobile": "?0",
            "sec-ch-ua-platform": '"Windows"',
            "sec-fetch-dest": "document",
            "sec-fetch-mode": "navigate",
            "sec-fetch-site": "none",
            "sec-fetch-user": "?1",
            "upgrade-insecure-requests": "1",
        }
    else:
        headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.5",
            "Accept-Encoding": "gzip, deflate",
            "User-Agent": user_agent,
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
        }
    
    return headers

def scrape_with_headers(url):
    """Scrape with realistic, consistent headers"""
    headers = get_browser_headers()
    
    response = requests.get(
        url,
        headers=headers,
        impersonate="chrome110"
    )
    
    return response

# Test the scraper
response = scrape_with_headers("https://httpbin.org/headers")
print(response.json()["headers"])

For maintaining cookies across requests, which helps build trust:

from curl_cffi import requests
from curl_cffi.requests import Session

def scrape_with_cookie_persistence(url):
    """Maintain cookies across requests to build session trust"""
    session = Session()
    
    # First request to establish session
    response1 = session.get(
        url,
        headers=get_browser_headers(),
        impersonate="chrome110"
    )
    
    # Cookies are automatically stored in session
    # Subsequent requests use the same cookies
    response2 = session.get(
        url + "/another-page",
        headers=get_browser_headers(),
        impersonate="chrome110"
    )
    
    return response2

# Add delays between requests to mimic human behavior
import time

def scrape_with_timing(url, pages):
    """Add realistic delays between requests"""
    session = Session()
    
    for page in pages:
        response = session.get(
            f"{url}{page}",
            headers=get_browser_headers(),
            impersonate="chrome110"
        )
        
        # Random delay between 2-5 seconds
        time.sleep(random.uniform(2, 5))
        
        yield response

How Headers Are Used to Block Web Scrapers and How to Fix It

Introduction to web scraping headers - what do they mean, how to configure them in web scrapers and how to avoid being blocked.

How Headers Are Used to Block Web Scrapers and How to Fix It

Avoid Web Scraping Honeypots

Honeypots are traps used to lure bots and attackers, they are often used to identify web scrapers and prevent automated programs. These honeypots are usually placed in the HTML code and are invisible to normal users.

Websites can place hidden honeypot traps in the HTML using JavaScript and CSS tricks, such as adding hidden links and buttons that are not visible to real users but are visible to scrapers like web crawlers. Honeypots can also be used to disrupt web scraping results by manipulating HTML pages and presenting misleading data, such as fake product prices.

So, avoiding honeypots by following direct links and mimicking normal users' behavior can minimize the detection and CAPTCHAs rate.

Here's how to detect and avoid honeypots when scraping:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def is_visible_element(element):
    """Check if element is actually visible to users"""
    style = element.get('style', '')
    classes = element.get('class', [])
    
    # Check for common hiding techniques
    if 'display: none' in style or 'visibility: hidden' in style:
        return False
    
    if 'opacity: 0' in style or 'opacity:0' in style:
        return False
    
    # Check for common honeypot classes
    honeypot_classes = ['honeypot', 'hidden', 'spam-trap', 'bot-trap']
    if any(hp_class in ' '.join(classes).lower() for hp_class in honeypot_classes):
        return False
    
    return True

def scrape_avoiding_honeypots(url):
    """Scrape while filtering out honeypot elements"""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        context = browser.new_context()
        page = context.new_page()
        
        page.goto(url, wait_until='networkidle')
        
        # Get rendered HTML after JavaScript execution
        html = page.content()
        soup = BeautifulSoup(html, 'html.parser')
        
        # Find all links
        links = soup.find_all('a', href=True)
        
        # Filter out honeypot links
        valid_links = []
        for link in links:
            if is_visible_element(link):
                # Check if link is actually visible in viewport
                href = link.get('href')
                if href and not href.startswith('javascript:'):
                    valid_links.append(href)
        
        browser.close()
        return valid_links

# Only follow links that are visible to real users
valid_links = scrape_avoiding_honeypots("https://example.com")
print(f"Found {len(valid_links)} valid links")

For more advanced honeypot detection, check element positions and dimensions:

def check_element_visibility(page, selector):
    """Check if element is visible in viewport using Playwright"""
    try:
        element = page.query_selector(selector)
        if not element:
            return False
        
        # Get element bounding box
        box = element.bounding_box()
        if not box:
            return False
        
        # Check if element has zero dimensions
        if box['width'] == 0 or box['height'] == 0:
            return False
        
        # Check if element is outside viewport
        viewport = page.viewport_size
        if box['x'] < 0 or box['y'] < 0:
            return False
        if box['x'] > viewport['width'] or box['y'] > viewport['height']:
            return False
        
        return True
    except:
        return False

What are Honeypots and How to Avoid Them in Web Scraping

Introduction to web honeypots, their types and functions and how they are used to identify and block web scrapers and bots and how to avoid them.

What are Honeypots and How to Avoid Them in Web Scraping

Success Rate by Method

Different bypass techniques have varying success rates depending on the website's protection level. Here's what to expect:

Method Success Rate Best For
TLS fingerprinting alone 60-70% Basic protection sites
TLS + Headers 70-80% Moderate protection
TLS + Headers + IP rotation 85-90% Most commercial sites
Full stack (all techniques) 90-95% High-protection sites
Browser automation + Full stack 95-98% Maximum protection sites

These rates assume proper implementation. Success depends on factors like proxy quality, request timing, and the specific anti-bot service used by the target website.

Why Isn’t Your CAPTCHA Bypass Working?

So, you’ve followed all the tricks, swapped out your headers, used proxies, even mimicked Chrome’s TLS fingerprint, but CAPTCHAs are still popping up? Don’t worry. Here are some common reasons why you might still be getting blocked, plus straightforward tips to fix them.

1. TLS Fingerprint Mismatch

What you’ll see: You’re still blocked, even though your headers look perfect.

How to check: Websites can tell if your request doesn’t quite “feel” like a real browser. To check if your TLS fingerprint matches a real one:

from curl_cffi import requests

response = requests.get(
    "https://tls.browserleaks.com/json",
    impersonate="chrome110"
)
print(response.json())

Look at the ja3_hash in the output, does it match the real Chrome browser? If not, anti-bot systems will get suspicious.

2. Header Inconsistency

What you’ll see: Sometimes requests go through, sometimes you’re blocked for no obvious reason.

How to fix: Your headers have to make sense for your chosen User-Agent. Using a Chrome UA string? Make sure you’re also sending all the Chrome-specific headers like sec-ch-ua, sec-fetch-*, and so on.

3. Bad IP Reputation

What you’ll see: Blocked instantly, every time, no matter what.

How to check: Your IP might already be flagged as suspicious. Try new proxies, ideally residential ones.

# Check your current IP
response = requests.get("https://httpbin.org/ip", impersonate="chrome110")
print(f"Your IP: {response.json()['origin']}")

# Check its reputation
response = requests.get(f"https://ipapi.co/{response.json()['origin']}/json/")
print(response.json())

If your IP is listed as a datacenter or associated with previous abuse, rotate to something cleaner.

4. Rate Limiting

What you’ll see: First few requests are fine, then suddenly CAPTCHAs everywhere.

How to fix: Slow down! Add random delays so your requests don’t look robotic:

from curl_cffi import requests
import time
import random

def scrape_with_rate_limiting(urls, headers):
    for url in urls:
        response = requests.get(
            url,
            headers=headers,
            impersonate="chrome110"
        )
        # Wait between 3 and 8 seconds like a (very patient) human
        time.sleep(random.uniform(3, 8))
        yield response

The example above demonstrates how to avoid triggering rate limits or CAPTCHAs by introducing random delays between your requests.

5. JavaScript Detection

What you’ll see: CAPTCHA appears as soon as you land on sites heavy with JavaScript.

How to fix: Instead of just sending HTTP requests, try browser automation (like Playwright or Selenium) and use stealth plugins that help you blend in:

from playwright_stealth import stealth_sync

# Apply stealth tweaks before visiting the page
stealth_sync(page)
page.goto(url)

Don’t forget: the stealth layer is essential to avoid quick detection!

What you’ll see: Everything works at the start, then CAPTCHAs hit after a few clicks or page changes.

How to fix: Many sites use cookies to tell if you’re acting like a normal user. Make sure to maintain a session across multiple requests:

# Keep cookies across requests with a session
session = Session()
session.get(url1)  # Set initial cookies
session.get(url2)  # Keeps the same cookies for the next page

If you still can’t get past CAPTCHAs after these checks, it might be time to upgrade your proxies, switch tools, or consider a service like ScrapFly that handles all those steps for you. Happy scraping!

There are so many captcha techniques that it's almost impossible to bypass them all, so to focus our CAPTCHA bypass efforts, we'll take a look at the most popular captcha providers and what bypass techniques work best for them.

reCAPTCHA

Google reCaptcha is the most commonly encountered service while scraping web data. It uses a combination of image-based CAPTCHAs, as well as audio ones, to challenge automated scripts. To bypass reCAPTCHA while scraping, we can focus on the following details:

  • TLS fingerprint
  • JavaScript fingerprint
  • IP address type (residential or mobile)

hCaptcha

hCaptcha is a popular CAPTCHA provider developed by Intuition Machines. It's becoming an increasingly popular alternative to reCaptcha. To bypass hCaptcha while scraping, the most important details to focus on are:

  • IP Address type (residential or mobile)
  • Javascript fingerprint
  • Request details (headers, cookies etc)
  • Browsing behavior

Friendly Captcha

Friendly Captcha is a new frictionless, privacy-first captcha service. It's based on a proof-of-work system that requires the client to solve a complex mathematical challenge (similar to what cryptocurrency does). This means bypassing friendly CAPTCHA is possible through the following details:

  • Javascript execution engine (real web browser)
  • Javascript fingerprint

Power-Up with ScrapFly

We have seen that avoiding CAPTCHAs can be tedious, requiring paying attention to endless details. This is why we created ScrapFly - a web scraping API equipped with anti-CAPTCHA and blocking bypass to bypass scraper blocking.

scrapfly middleware

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:

  • Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
  • Millions of self-healing proxies of the highest possible trust score.
  • Constantly evolving and adapting to new anti-bot systems.
  • We've been doing this publicly since 2020 with the best bypass on the market!

Here is how to prevent CAPTCHAs using ScrapFly. All we have to do is send a request to the target website and enable the asp feature. ScrapFly will then manage the avoiding logic for us:

from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

result: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
    url="the target website URL",
    # select a the proxy country
    country="us",
    # enable the ASP to bypass CAPTCHAs on any website
    asp=True,
    # enable JS rendering, similar to headless browsers
    render_js=True
))

# get the page HTML content
print(result.scrape_result['content'])
Sign-up for FREE to get your API key!

FAQs

To wrap up this guide on how to avoid CAPTCHA, let's take a look at some frequently asked questions.

What's the real success rate of different bypass methods?

Success rates vary significantly based on implementation quality and target website protection. TLS fingerprinting alone achieves 60-70% success on basic sites. Adding proper headers increases this to 70-80%. Combining TLS, headers, and residential IP rotation typically reaches 85-90% success. A full implementation with all techniques including browser automation can achieve 95-98% success on most protected sites. These rates assume high-quality proxies and proper configuration.

Can I get banned for trying to bypass CAPTCHA?

Bypassing CAPTCHAs for scraping public data at reasonable rates is generally legal, but website terms of service may prohibit automated access. Violating terms of service could result in IP bans or account suspensions. Always respect robots.txt, rate limits, and website terms. For commercial scraping, consider using professional services like ScrapFly that handle legal compliance.

Yes, bypassing CAPTCHAs for scraping public pages at a reasonable rate without damaging the website is considered legal in most jurisdictions. However, always check local laws and website terms of service. Some websites explicitly prohibit automated access in their terms.

What's the difference between solving and avoiding CAPTCHAs?

Solving CAPTCHAs means using computer vision or paid services to answer the challenge after it appears. This is expensive, slow, and unreliable. Avoiding CAPTCHAs means raising your trust score so they never appear in the first place. Avoiding is faster, cheaper, and more reliable, which is why it's the recommended approach.

Are there any captchas that are impossible to bypass?

Yes, webmasters can configure CAPTCHA to appear 100% of the time regardless of trust score. This is rare because it creates poor user experience, but when configured this way, you must solve the CAPTCHA to proceed. In these cases, using a CAPTCHA-solving service is the only option.

Why does my bypass work sometimes but not always?

Inconsistent results usually indicate header inconsistency, IP reputation issues, or rate limiting. Ensure headers match your User-Agent, use high-quality residential proxies, add delays between requests, and maintain session cookies. Also verify your TLS fingerprint matches a real browser using testing services.

Summary

We've explored how CAPTCHAs work in web scraping and how to bypass them by improving your scraper's trust score. The key insight is that anti-bot services calculate a trust score for each request, and CAPTCHAs only appear when this score is too low.

By fortifying your connection details across five critical areas, you can raise your trust score and avoid CAPTCHAs entirely:

  1. TLS Fingerprinting: Use curl-cffi or similar tools to mimic browser TLS fingerprints, achieving 60-70% success on basic sites.

  2. JavaScript Fingerprinting: Use browser automation with Playwright or Selenium, patch headless browser leaks, and rotate viewport sizes and browser configurations.

  3. IP Address Rotation: Use residential or mobile proxies instead of datacenter IPs, and rotate them to distribute traffic and avoid rate limits.

  4. Request Headers: Set realistic, consistent headers that match your User-Agent, including all modern browser headers like sec-ch-ua and sec-fetch-*.

  5. Honeypot Avoidance: Filter out hidden elements and only follow links that are visible to real users.

When combined, these techniques can achieve 90-95% success rates on most protected websites. The code examples provided in this guide are production-ready and can be adapted to your specific scraping needs.

Explore this Article with AI

Related Knowledgebase

Related Articles