Service disruption on this feature can occur, regardless of our will. Protection evolves; we need to find and adapt our solution and take days up to weeks to get a production-grade solution. As soon as you use this feature, you must keep it in mind and develop your software by considering it.
Unblock protected website, our Anti Scraping Protection is an internal solution to scrape protected website that won't let bots scraping them. It's the addition of many concept to keep a coherent fingerprint and replicate real user fingerprint as close as possible when you scrape a website. If you are interest on how technically it works to be undetected, we have written a series of article
Despite we are able to bypass protection, many usage are prohibited* and any attempt will result in account suspension* Cybersecurity (red team) firm can be allowed after we got the authorization from the blue side on the domain(s) they approved
- Automated Online Payment
- Account Creation
- Spam Post
- Falsify Vote
- Credit Card Testing
- Login Brute Force
- Referral / Gifting systems
- Ads fraud
- Banks
- Ticketing (Automated Buying System)
- Betting, Casino, Gambling
Scrapfly detects and resolves challenges from well-known anti-scraping solutions on the market. Scrapfly also supports custom solutions on popular websites. Anti-bot bypass are transparent, no extra work on your side. You directly retrieve the successful response.
Keep in mind anti-bot solutions evolve, and we may need to adapt our bypass technics; this is why you should handle ASP errors correctly when relying on ASP.
If you plan to target a protected website, you should try many configurations, with or without browser, with residential proxies. If you send POST
request
with body, you must configure headers and content type to mimic a real call.
Despite all your attempts, if you are still blocked - you can contact us through the chat to investigate, custom bypass are available from custom plan with a minimum engagement of one year.
When ASP is enabled, anti-bot solution vendor are automatically detected and everything is managed to bypass it.
curl -G \
--request "GET" \
--url "https://api.scrapfly.io/scrape" \
--data-urlencode "key=__API_KEY__" \
--data-urlencode "url=https://httpbin.dev/anything" \
--data-urlencode "tags=project:default" \
--data-urlencode "proxy_pool=public_datacenter" \
--data-urlencode "asp=true"
"https://api.scrapfly.io/scrape?key=&url=https%3A%2F%2Fhttpbin.dev%2Fanything&tags=project%3Adefault&proxy_pool=public_datacenter&asp=true"
"api.scrapfly.io"
"/scrape"
key = ""
url = "https://httpbin.dev/anything"
tags = "project:default"
proxy_pool = "public_datacenter"
asp = "true"
Popular anti bot vendor are bypassed, but they are still some area that we can't cover in fully automated way and still require your attention to configure correctly the calls. Website exposing their internal api publicly or form POST will also protect it (authentication / proof of browser) and the ASP can't guess it / fix it. You need to investigate yourself and do if you do not push any effort in that direction while your scrape (POST/Private API) are rejected - you will stay unsuccessful. The most complicated part is already done for you (defeat anti bot), if you are in this case, here is the last mile:
POST
form
Posting form require to pass correct headers and information like the website does. Most of the time headers Accept
, Content-Type
require special attention.
The best way to replicate correctly headers is to inspect the call with the dev tools and trigger the call from your browser and inspect related resources.
Basically, same as for POST
request. Most of the time private API requires to be authenticated - if basic cookies from a normal user navigation are not enough you might
need to reverse the process to figured out how it's authenticated in order to retrieve the token and pass along your scrape to be authorized
Most of the time, datacenter IPs are good enough, but on websites protected by anti-bot vendors, they check the origin of IP if the traffic is coming from a datacenter or regular connection. By residential network, you will get an IP with a better reputation and registered under a regular ASN (which is used to control the origin of the IP).
API Usage: proxy_pool=public_residential_pool
, checkout the related documentation
Most of anti bot check the browser fingerprint / javascript engine to generate a proof of legitimacy.
API Usage: render_js=true
, checkout the related documentation
Observe headers/cookies of regular calls that are successful; you can figure out if you need to add extra headers or retrieve specific cookies to authenticate. You can use the dev tool and inspect the network activity.
API Usage: headers[referer]=https%3A%2F%2Fexample.com
(value is url encoded), checkout the related documentation
You might need to retrieve cookies from navigation before calling unofficial API. The easiest way to achieve that is to scrape by enabling session and rendering JS to retrieve cookies, then you can scrape without rendering js; cookies are now stored in your session and applied back.
Some websites block navigation based on IP location; by default, Scrapfly select a random country from the pool, specify the country regarding the location of the website could help.
API Usage: country=us
, checkout the related documentation
Pricing is not easy to predict with Anti Scraping Protection. Everything is designed and optimized to reduce the cost and maximize the reuse of authenticated sessions (which is free even if protection is activated).
Be aware protected sites at scale have a real cost; the best way to budget your volume is to take an account and monitor the cost usage while increasing the volume to avoid surprises. We try to be transparent as much as we can on this subject because we know you need to predict and budget the price of the solution; if you have any questions or feedback, please contact us.
Scenario | API Credits Cost |
---|---|
ASP + Residential Proxies* | 25 |
ASP + Residential Proxies + Browser* | 25 + 5 = 30 |
ASP + Datacenter Proxies | 1 |
ASP + Datacenter Proxies + Browser | 6 |
Since the ASP dynamically updates configuration to be able to pass protection, the price is dynamic for the following reason:
To help you to make your cost more predictable the following options are available:
cost_budget
parameter to define the maximum budget the ASP should respect. If a web scrape and the budget interrupt configuration mutation, the web scrape performed
is billed regardless of the status code.
Make sure to define the correct minimum budget regarding your target,
if the budget is too low, you will never be able to pass and pay for blocked result.
Therefore, you can enable ASP if you are unsure if a website is protected. If nothing is blocking, no extra calls count.
Failed request >= 400 are not billed except the following: 400, 401, 404, 405, 406, 407, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 422, 424, 426, 428, 456
.
To prevent any abuse, this is subject to our fair use policy, if more than 30% of the traffic with previous http code is reached - the fair use is disabled and failed
request are billed.
If your account fall under 60% of success rate and/or you deliberately scrape protected website without ASP or failed target, your account will be suspended.
You can try to target the desired website through our API player by creating a free account.
API Response contains headerX-Scrapfly-Api-Cost
indicate you the billed amount andX-Scrapfly-Remaining-Api-Credit
indicate the remaining amount of Scrape API Credits
Some specific protected domain have a special price due to high protection, you can ask through support or test via player to show how much credit are billed
All related errors are listed below. You can see full description and example of error response on the Errors section.