Anti Scraping Protection (ASP)

Service disruption on this feature can occur, regardless of our will. Protection evolves; we need to find and adapt our solution and take days up to weeks to get a production-grade solution. As soon as you use this feature, you must keep it in mind and develop your software by considering it.
Anti Bot
Unblock protected website, our Anti Scraping Protection is an internal solution to scrape protected website that won't let bots scraping them. It's the addition of many concept to keep a coherent fingerprint and replicate real user fingerprint as close as possible when you scrape a website. If you are interest on how technically it works to be undetected, we have written a series of article
Despite we are able to bypass protection, many usage are prohibited* and any attempt will result in account suspension* Cybersecurity (red team) firm can be allowed after we got the authorization from the blue side on the domain(s) they approved
- Automated Online Payment
- Account Creation
- Spam Post
- Falsify Vote
- Credit Card Testing
- Login Brute Force
- Referral / Gifting systems
- Ads fraud
- Banks
- Ticketing (Automated Buying System)
- Betting, Casino, Gambling
Scrapfly detects and resolves challenges from well-known anti-scraping solutions on the market. Scrapfly also supports custom solutions on popular websites. Anti-bot bypass are transparent, no extra work on your side. You directly retrieve the successful response.
Keep in mind anti-bot solutions evolve, and we may need to adapt our bypass technics; this is why you should handle ASP errors correctly when relying on ASP.
If you plan to target a protected website, you should try many configurations, with or without browser, with residential proxies. If you send POST
request
with body, you must configure headers and content type to mimic a real call.
Despite all your attempts, if you are still blocked - you can contact us through the chat to investigate, custom bypass are available from custom plan with a minimum engagement of one year.
Usage
When ASP is enabled, anti-bot solution vendor are automatically detected and everything is managed to bypass it.
import requests
url = "https://api.scrapfly.io/scrape?key=__API_KEY__&url=https%3A%2F%2Fhttpbin.dev%2Fanything&tags=project%3Adefault&proxy_pool=public_datacenter&asp=true"
response = requests.request("GET", url)
print(response.text)
# import json
# print(json.loads(response.text)['result']['content'])
# print(json.loads(response.text)['result']['status_code'])
"https://api.scrapfly.io/scrape?key=&url=https%3A%2F%2Fhttpbin.dev%2Fanything&tags=project%3Adefault&proxy_pool=public_datacenter&asp=true"
"api.scrapfly.io"
"/scrape"
key = ""
url = "https://httpbin.dev/anything"
tags = "project:default"
proxy_pool = "public_datacenter"
asp = "true"
ASP is magic but have some limitations
Popular anti bot vendor are bypassed, but they are still some area that we can't cover in fully automated way and still require your attention to configure correctly the calls. Website exposing their internal api publicly or form POST will also protect it (authentication / proof of browser) and the ASP can't guess it / fix it. You need to investigate yourself and do if you do not push any effort in that direction while your scrape (POST/Private API) are rejected - you will stay unsuccessful. The most complicated part is already done for you (defeat anti bot), if you are in this case, here is the last mile:
POST
form
Posting form require to pass correct headers and information like the website does. Most of the time headers Accept
, Content-Type
require special attention.
The best way to replicate correctly headers is to inspect the call with the dev tools and trigger the call from your browser and inspect related resources.
Website Private API
Basically, same as for POST
request. Most of the time private API requires to be authenticated - if basic cookies from a normal user navigation are not enough you might
need to reverse the process to figured out how it's authenticated in order to retrieve the token and pass along your scrape to be authorized
Maximize Your Success Rate
Network Quality
Most of the time, datacenter IPs are good enough, but on websites protected by anti-bot vendors, they check the origin of IP if the traffic is coming from a datacenter or regular connection. By residential network, you will get an IP with a better reputation and registered under a regular ASN (which is used to control the origin of the IP).
- Introduction To Proxies in Web Scraping
- How to Avoid Web Scraping Blocking: IP Address Guide
- Learn how to change the network type
API Usage: proxy_pool=public_residential_pool
, checkout the related documentation
Use a Browser
Most of anti bot check the browser fingerprint / javascript engine to generate a proof of legitimacy.
API Usage: render_js=true
, checkout the related documentation
Verify Cookies and Headers
Observe headers/cookies of regular calls that are successful; you can figure out if you need to add extra headers or retrieve specific cookies to authenticate. You can use the dev tool and inspect the network activity.
API Usage: headers[referer]=https%3A%2F%2Fexample.com
(value is url encoded), checkout the related documentation
Navigation Coherence
You might need to retrieve cookies from navigation before calling unofficial API. The easiest way to achieve that is to scrape by enabling session and rendering JS to retrieve cookies, then you can scrape without rendering js; cookies are now stored in your session and applied back.
Geo Blocking
Some websites block navigation based on IP location; by default, Scrapfly select a random country from the pool, specify the country regarding the location of the website could help.
API Usage: country=us
, checkout the related documentation
Pricing
Pricing is not easy to predict with Anti Scraping Protection. Everything is designed and optimized to reduce the cost and maximize the reuse of authenticated sessions (which is free even if protection is activated).
Be aware protected sites at scale have a real cost; the best way to budget your volume is to take an account and monitor the cost usage while increasing the volume to avoid surprises. We try to be transparent as much as we can on this subject because we know you need to predict and budget the price of the solution; if you have any questions or feedback, please contact us.
Pricing Grid
Scenario | API Credits Cost |
---|---|
ASP + Residential Proxies* | 25 |
ASP + Residential Proxies + Browser* | 25 + 5 = 30 |
ASP + Datacenter Proxies | 1 |
ASP + Datacenter Proxies + Browser | 6 |
Pricing Predictability and control of your spending
Since the ASP dynamically updates configuration to be able to pass protection, the price is dynamic for the following reason:
- On the first try, always respect your configuration.
- When the first try at low cost failed, we upgraded regarding the protection (browser, proxies quality) - Most of the well-known anti-bot require a browser.
- Some targets/shields have special fees due resources required to pass.
- We can optimize and reduce the cost when we reuse the bypass and avoid the challenge.
- Something no protection is triggered if no protection/block is involved; you don't pay any extra.
To help you to make your cost more predictable the following options are available:
- Project: Allow or disallow extra quota, extra usage spending limit and concurrency limit
- Throttler: Define per target speed limit (request rate, concurrency), API Credit budget for period (hour, day, month)
-
API:
Using
cost_budget
parameter to define the maximum budget the ASP should respect. If a web scrape and the budget interrupt configuration mutation, the web scrape performed is billed regardless of the status code. Make sure to define the correct minimum budget regarding your target, if the budget is too low, you will never be able to pass and pay for blocked result.
Therefore, you can enable ASP if you are unsure if a website is protected. If nothing is blocking, no extra calls count.
Success Rate and Fair use
Failed request >= 400 are not billed except the following: 400, 401, 404, 405, 406, 407, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 422, 424, 426, 428, 456
.
To prevent any abuse, this is subject to our fair use policy, if more than 30% of the traffic with previous http code is reached - the fair use is disabled and failed
request are billed.
If your account fall under 60% of success rate and/or you deliberately scrape protected website without ASP or failed target, your account will be suspended.
You can try to target the desired website through our API player by creating a free account.
API Response contains headerX-Scrapfly-Api-Cost
indicate you the billed amount andX-Scrapfly-Remaining-Api-Credit
indicate the remaining amount of Scrape API Credits
Some specific protected domain have a special price due to high protection, you can ask through support or test via player to show how much credit are billed
Integration
Related Errors
All related errors are listed below. You can see full description and example of error response on the Errors section.
- ERR::ASP::CAPTCHA_ERROR - Something wrong happened with the captcha. We will figure out to fix the problem as soon as possible
- ERR::ASP::CAPTCHA_TIMEOUT - The budgeted time to solve the captcha is reached
- ERR::ASP::PROTECTION_FAILED - The attempt to solved or bypass the bot protection failed for this time - Unfortunately it happened sometimes and you should retry this error if it's sporadic. If this issue always happened - check your config and ask support
- ERR::ASP::SHIELD_ERROR - The ASP encounter an unexpected problem. We will fix it as soon as possible. Our team has been alerted
- ERR::ASP::SHIELD_EXPIRED - The ASP shield previously set is expired, you must retry.
- ERR::ASP::SHIELD_PROTECTION_FAILED - The ASP shield failed to solve the challenge against the anti scrapping protection
- ERR::ASP::TIMEOUT - The ASP made too much time to solve or respond
- ERR::ASP::UNABLE_TO_SOLVE_CAPTCHA - Despite our effort, we were unable to solve the captcha. It can happened sporadically, please retry
- ERR::ASP::UPSTREAM_UNEXPECTED_RESPONSE - The response given by the upstream after challenge resolution is not expected. Our team has been alerted