CAPTCHAs are one of the oldest methods used by WAFs to identify and block bots. They are extremely popular and can be found on almost all websites.
In this article, we'll explain CAPTCHAs, what they are and how they work. We'll also go over the most effective techniques for bypassing CAPTCHAs while scraping. So, let's get started!
What are CAPTCHAs?
CAPTCHAs are a popular security check used to prevent bots and automated scripts from engaging with websites. They are used to block malicious and spam activities by popping up a challenge that requires human interaction to solve them.
CAPTCHAs were first used in the early 2000s, and they have developed over the years to resist the new AI capabilities in solving them.
However, anti-bots share many technical aspects with modern captchas and can be bypassed in almost the same way - by using secure, fortified HTTP connections.
How do CAPTCHAs work?
There are various types of CAPTCHAs that require solving challenges related to text, images, or even sounds that can be solved by humans but not by automated scripts.
While these challenges can be solved using machine learning and computer vision capabilities, the results aren't very accurate and require multiple retry attempts and tend to consume a lot of processing resources. This is often enough to deter bots as it's simply too expensive to solve CAPTCHAs.
Therefore, it's better to avoid them entirely rather than try to solve them.
How to Avoid CAPTCHAs?
Since CAPTCHAs negatively affect the user experience while browsing, anti-bot services show them if they suspect the request to be automated by first calculating a trust score. This score determines whether to use captcha at all.
This means that we can avoid CAPTCHAs by raising our trust score which just means making our requests look like that of normal web users. Let's take a deeper look!
Use Resistant TLS Fingerprint
When a request is sent to a website protected with an SSL certificate, the request must go over a TLS handshake process to initialize a secure transmission channel. During this process, both the request and web server exchange security information, which leads to creating a JA3 fingerprint.
Since HTTP clients used in scraping perform TLS handshakes differently compared to real browsers, this fingerprint can vary. This can lead to a lower trust score and eventually require to solve a CAPTCHA challenge.
Therefore, having a correct TLS fingerprint is necessary to bypass CAPTCHA detection.
Hardware specs and capabilities
Operating system details
Browser configuration and its details
Unfortunately, headless browser tools can leak specific details about the browser engine, allowing websites to identify them as controlled browsers and not normal ones. For example, a popular headless browser leak is the navigator.webdriver value, which is set to True with headless browsers only:
So, to summarize:
Patch headless browser leaks
Rotate IP Address Fingerprint
The IP address is a unique number set that identifies a device over the network. Websites can look for the request's IP address to get details about the geolocation and ISP to create an IP address fingerprint.
This fingerprint includes details about the IP address type, which can be one of three types:
Residential IP addresses assigned to home networks and retail users from ISPs and have a high trust score with anti-bots and CAPTCHA providers, as these are most likely used by humans.
Mobile IP addresses assigned to mobile networks through cell phone towers and have a high trust score, as they are used by real users. These IPs are also hard to track, cell tower connections are short and rotate often.
Datacenter IP addresses that are generated by data centers and cloud providers, such as AWS and Google Cloud thus have a low trust score. This is because normal users are unlikely to use datacenters to browse the web.
Based on the IP address trust score, anti-bot providers may challenge the request with a CAPTCHA if the traffic is suspected to come from a bot. Websites can also set rate-limiting rules or even block the IPs if the outcoming traffic is high in a short time window.
Therefore, It's essential to rotate high-quality IP addresses to make your scraper anti CAPTCHA resistant.
Request Header Fingerprint
HTTP headers are key-value pairs used to exchange information between a request and a web server. Websites compare the requests' headers with those of normal browsers. If they are missing, misconfigured or mismatched, the anti-bot can suspect the request and require it to solve a CAPTCHA challenge.
Let's take a look at some common headers websites use to identify web scrapers.
Indicates the content type accepted by the browser or HTTP client:
The website language that the browser accepts. It's effective with websites that support localized versions, which can also be used to control the web scraping language:
Arguably, it is the most important header in web scraping. It provides information about the request's browser name, type, version and operating system:
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0
Websites also use the User-Agent to detect web scraping if the same User-Agent is used across many requests. Therefore, rotating User-Agent headers can help with avoiding CAPTCHAs while web scraping.
name=value; name2=value2; name3=value3
Cookies are used to store user data and website preferences. It can also store CAPTCHA-related keys that authenticate the requests for a specific period, preventing them from popping up as long as these values aren't expired. Therefore, using cookies in web scraping can help mimic normal users' behavior and avoid CAPTCHAs.
Websites can also create custom headers, usually starting with the X- prefix. These headers can contain keys related to security, analytics and authentication.
Therefore, correctly adding headers can help requests bypass CAPTCHAs while scraping.
Avoid Web Scraping Honeypots
Honeypots are traps used to lure bots and attackers and are often used to identify web scrapers. They are usually placed in the HTML code and are invisible to normal users.
So, avoiding honeypots by following direct links and mimicking normal users' behavior can minimize the detection and CAPTCHAs rate.
Popular CAPTCHA Providers
There are so many captcha techniques that it's almost impossible to bypass them all so to focus our efforts we'll take a look at the most popular captcha providers and what bypass techniques work best for them.
reCaptcha is the most commonly encountered captcha service and it's developed by Google. It uses a combination of image and audio challenges though to bypass recaptcha most important details to focus on are:
IP address type (residential or mobile)
Prioritizing these 3 aspects will lead to highest chances
hCaptcha is a popular captcha provider developed by Intuition Machines. It's becoming an increasingly popular alternative to reCaptcha and to bypass hCaptcha most important details to focus on are:
IP Address type (residential or mobile)
Request details (headers, cookies etc)
Friendly Captcha is a new frictionless, privacy-first captcha service. It's based on proof-of-work system that requires the client to solve a complex mathematical challenge (similar to what cryptocurrency does). This means to bypass friendly captcha most important details to focus on are:
Bypass CAPTCHAs using ScrapFly
We have seen that avoiding CAPTCHAs can be tedious, requiring paying attention to endless details. This is why we created ScrapFly - a web scraping API equipped with anti-CAPTCHA and blocking bypass to bypass scraper blocking. Furthermore, it allows for scraping at scale by providing:
Residential proxies from over 50+ countries - allowing for avoiding IP address throttling and blocking while also allowing for scraping from almost any geographical location.
Here is how to prevent CAPTCHAs using ScrapFly. All we have to do is send a request to the target website and enable the asp feature. ScrapFly will then manage the avoiding logic for us:
from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
result: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
url="the target website URL",
# select a the proxy country
# enable the ASP to bypass CAPTCHAs on any website
# enable JS rendering, similar to headless browsers
# get the page HTML content
To wrap up this guide on how to avoid CAPTCHA, let's take a look at some frequently asked questions.
Is it legal to bypass CAPTCHAs?
Yes, bypassing CAPTCHAs for scraping public pages at a reasonable rate without damaging the website is considered legal.
Can I bypass CAPTCHAs while scraping?
Yes, it's possible to solve CAPTCHAs using computer vision techniques or paid captcha solvers. However, these methods are often complex, inaccurate or expensive and it's easier to avoid them in the first place.
How to get around recaptcha?
Are there any captchas that are impossible to bypass?
Yes, the webmasters can configure captcha appearance rules and it's possible that captcha is configured to appear 100% of the time. This is an undesired friction so it happens rarely but in that case, there's no other option but to solve the captcha to access the page.
Using TLS fingerprint similar to normal browsers.
Splitting the requests' traffic across multiple IP addresses.
Using headers similar to normal users.
Avoiding honeypots by requesting the necessary page URLs only.
With proper use of these tools bypassing captcha using Python is a very achievable task!
Learn how to prevent TLS fingerprinting by impersonating normal web browser configurations. We'll start by explaining what the Curl Impersonate is, how it works, how to install and use it. Finally, we'll explore using it with Python to avoid web scraping blocking.
In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works, how to install and use it. Let's get started!